Hi, Today we created a slurm account for a new user. When we attempt to run a job as that user, it fails with the following: [root@lx-chmmqrslrm03 ~]# runuser -c "/logs/slurm/bin/srun -p veryshort hostname" swang1 srun: error: Unable to allocate resources: Invalid account or account/partition combination specified In the slurm logs we see: 2018-03-12T23:04:27.27797 slurmctld: _job_create: invalid account or partition for user 14028, account '(null)', and partition 'veryshort' From the slurm controller, I can resolve the user without issue. [root@lx-chmmqrslrm03 ~]# getent passwd 14028 swang1:*:14028:11:Sheng Wang:/home/swang1:/bin/bash I've deleted and recreated his slurm account without any change in behavior. This user's UID has not been changed (per bug ID 3575). I've also been able to reproduce this issue with another account I created for a user on my team who joined the firm 3 months ago. Any help in diagnosing this problem is greatly appreciated. Please let me know if I can provide any other details. Thanks, Javier
Hi Javier, Could you run "sacctmgr show user swang1" and send me the output? Did you add the user with the something close to the following command: sacctmgr add user swang1 account=$account_name
Hi Isaac, jcardena@lx-chmmqrslrm03 ~$ sacctmgr show user swang1 User Def Acct Admin ---------- ---------- --------- jcardena@lx-chmmqrslrm03 ~$ sacctmgr show assoc where account=swang1 Cluster Account User Partition Share GrpJobs GrpTRES GrpSubmit GrpWall GrpTRESMins MaxJobs MaxTRES MaxTRESPerNode MaxSubmit MaxWall MaxTRESMins QOS Def QOS GrpTRESRunMin ---------- ---------- ---------- ---------- --------- ------- ------------- --------- ----------- ------------- ------- ------------- -------------- --------- ----------- ------------- -------------------- --------- ------------- cluster swang1 1000 normal cluster swang1 14028 1000 cpu=2300,mem+ 2300 normal sacctmgr show user swang1 returns no values (the same is true for working accounts). I've included above other output for this user's account. This is the portion of the user creation process that creates the user. $sacctmgr -i create account name=$user parent=baseusers FairShare=1000 $sacctmgr -i create user account=$user name=$uid FairShare=1000 $sacctmgr -i modify user $uid set MaxJobs=$corecount GrpCpus=$corecount FairShare=1000 GrpMem=$mem GrpTres=gres/io=400
We are considering restarting slurmctld today to see if that might resolve the issue. I've already tried an scontrol reconfigure.
Ok, swang1 is the account name, not the user name; my mistake. The output you sent looks fine, could you send the runuser script you used in your initial comment?
We restarted the slurmctld service and the 3 newly created accounts began to work. I created a new account for another user requesting access and that account also works now. The runuser command wasn't part of any script. It was just trying to srun hostname in our veryshort partition. runuser -c "/logs/slurm/bin/srun -p veryshort hostname" swang1 The only thing I can think of is that we did decommission an old ldap server in the environment a couple weeks ago. However we see from the getent output in my initial comment that the controller was able to resolve the user so I dont think it was the ldap work unless maybe the slurmctld was looking at the old ldap server which has slapd disabled now. Although, I imagine even existing users would fail unless the controller caches them permanently...
Ok, well I'll close this ticket for now, but should the problem resurface, please reopen it. Regards