Ticket 4905

Summary: Invalid account or account/partition combination specified
Product: Slurm Reporter: Javier Cardenas <javier.cardenas>
Component: AccountingAssignee: Director of Support <support>
Status: RESOLVED CANNOTREPRODUCE QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: nmajeran
Version: 17.02.5   
Hardware: Linux   
OS: Linux   
Site: Confidential Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: Screaming Hairy Armadillo
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Javier Cardenas 2018-03-12 17:30:25 MDT
Hi,
Today we created a slurm account for a new user. When we attempt to run a job as that user, it fails with the following:

[root@lx-chmmqrslrm03 ~]# runuser -c "/logs/slurm/bin/srun -p veryshort hostname" swang1
srun: error: Unable to allocate resources: Invalid account or account/partition combination specified


In the slurm logs we see:
2018-03-12T23:04:27.27797 slurmctld: _job_create: invalid account or partition for user 14028, account '(null)', and partition 'veryshort'


From the slurm controller, I can resolve the user without issue.
[root@lx-chmmqrslrm03 ~]# getent passwd 14028
swang1:*:14028:11:Sheng Wang:/home/swang1:/bin/bash


I've deleted and recreated his slurm account without any change in behavior.
This user's UID has not been changed (per bug ID 3575).

I've also been able to reproduce this issue with another account I created for a user on my team who joined the firm 3 months ago.

Any help in diagnosing this problem is greatly appreciated. Please let me know if I can provide any other details.

Thanks,
Javier
Comment 2 Isaac Hartung 2018-03-13 14:32:04 MDT
Hi Javier,

Could you run "sacctmgr show user swang1" and send me the output?

Did you add the user with the something close to the following command:

sacctmgr add user swang1 account=$account_name
Comment 3 Javier Cardenas 2018-03-13 14:39:40 MDT
Hi Isaac,

jcardena@lx-chmmqrslrm03 ~$ sacctmgr show user swang1
      User   Def Acct     Admin
---------- ---------- ---------
jcardena@lx-chmmqrslrm03 ~$ sacctmgr show assoc where account=swang1
   Cluster    Account       User  Partition     Share GrpJobs       GrpTRES GrpSubmit     GrpWall   GrpTRESMins MaxJobs       MaxTRES MaxTRESPerNode MaxSubmit     MaxWall   MaxTRESMins                  QOS   Def QOS GrpTRESRunMin
---------- ---------- ---------- ---------- --------- ------- ------------- --------- ----------- ------------- ------- ------------- -------------- --------- ----------- ------------- -------------------- --------- -------------
   cluster     swang1                            1000                                                                                                                                                  normal
   cluster     swang1      14028                 1000         cpu=2300,mem+                                        2300                                                                                normal

sacctmgr show user swang1 returns no values (the same is true for working accounts). I've included above other output for this user's account.


This is the portion of the user creation process that creates the user.

        $sacctmgr -i create account name=$user parent=baseusers FairShare=1000
        $sacctmgr -i create user account=$user name=$uid FairShare=1000
        $sacctmgr -i modify user $uid set MaxJobs=$corecount GrpCpus=$corecount FairShare=1000 GrpMem=$mem GrpTres=gres/io=400
Comment 4 Javier Cardenas 2018-03-13 14:40:35 MDT
We are considering restarting slurmctld today to see if that might resolve the issue. I've already tried an scontrol reconfigure.
Comment 5 Isaac Hartung 2018-03-13 15:36:03 MDT
Ok, swang1 is the account name, not the user name; my mistake.  The output you sent looks fine, could you send the runuser script you used in your initial comment?
Comment 6 Javier Cardenas 2018-03-13 16:19:16 MDT
We restarted the slurmctld service and the 3 newly created accounts began to work.
I created a new account for another user requesting access and that account also works now.

The runuser command wasn't part of any script. It was just trying to srun hostname in our veryshort partition.

runuser -c "/logs/slurm/bin/srun -p veryshort hostname" swang1



The only thing I can think of is that we did decommission an old ldap server in the environment a couple weeks ago. However we see from the getent output in my initial comment that the controller was able to resolve the user so I dont think it was the ldap work unless maybe the slurmctld was looking at the old ldap server which has slapd disabled now. Although, I imagine even existing users would fail unless the controller caches them permanently...
Comment 7 Isaac Hartung 2018-03-13 16:24:13 MDT
Ok, well I'll close this ticket for now, but should the problem resurface, please reopen it.

Regards