Ticket 2860 - Unable to run sacct on new system, permission denied
Summary: Unable to run sacct on new system, permission denied
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other tickets)
Version: 15.08.7
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2016-06-27 11:42 MDT by Jeff White
Modified: 2016-06-27 12:36 MDT (History)
0 users

See Also:
Site: Washington State University
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Jeff White 2016-06-27 11:42:45 MDT
I am working on getting XDMod running which uses sacct.  On this system I cannot run sacct for some reason:

$ sacct --verbose
sacct: Jobs eligible from Sun Jun 26 00:00:00 2016 - Now
sacct: Accounting storage SLURMDBD plugin loaded with AuthInfo=(null)
sacct: error: slurmdbd: Sending DbdInit msg: Access/permission denied
sacct: error: Problem talking to the database: Access/permission denied

This works on every other system I have.  It is running munge, has the correct key, and has the exact same version of Slurm as working systems (it fact it was deployed by SaltStack which /forces/ systems to be in identical states).  I haven't found anything in any logs that shows anything useful as to why this system is giving a permission error.  In fact, nothing Slurmy seems to be working:

$ sinfo 
sinfo: error: slurm_receive_msg: Zero Bytes were transmitted or received
slurm_load_partitions: Zero Bytes were transmitted or received

Any ideas?
Comment 1 Tim Wickberg 2016-06-27 11:59:27 MDT
This looks an awful lot like a munge key mismatch. Can you check that munge has been restarted on the new host after installing the cluster key?

I think you'll see the same message if the clocks are out of sync on the systems by more than a minute.

There should be some log messages in slurmdbd / slurmctld that would narrow down the issue - can you test and provide those from the same time as your failed commands?
Comment 2 Jeff White 2016-06-27 12:36:31 MDT
You had it with the clock sync comment.  Looks like we lost access to our NTP server and this new host was the only one off enough to cause a problem.  From what I can see Slurm and munge don't log this as an error.