Ticket 13010

Summary:	sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
Product:	Slurm	Reporter:	Brian Hammond <bhammond>
Component:	Accounting	Assignee:	Ben Roberts <ben>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	20.11.8
Hardware:	Linux
OS:	Linux
Site:	Albert Einstein	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Brian Hammond 2021-12-10 07:34:50 MST

Hello

I am getting the message
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified

for user bhammond (me). I think this is due to having the wrong UID associated with the user. scontrol returns this:
User Records

UserName=bhammond(1007) DefAccount=general DefWckey= AdminLevel=None

UID 1007 was my account on the headnode, not the cluster. 

How do I clear this value?

Thanks,
Brian

Comment 1 Ben Roberts 2021-12-10 12:21:46 MST

Hi Brian,

The user records stored in the database use the user name rather than the UID, but for authentication purposes we do use the UID that it gets from the system.  We talk about the need for consistent UIDs on the cluster for authentication in the accounting documentation.
https://slurm.schedmd.com/accounting.html#infrastructure

Is it just your user that has a different UID on the headnode than you do on the compute nodes or do other users have a similar situation?

I would recommend looking at making the UIDs match for your users, but I'm not sure that the mismatched UID is causing this error.  It could be that you're requesting an account or partition that you don't have access to.  Can I have you send the output of the following command:
sacctmgr show assoc tree user=bhammond

I would also like to see the sbatch arguments (or the #SBATCH directives in the script) to see which account and partition are being requested.

Thanks,
Ben

Comment 2 Brian Hammond 2021-12-10 12:38:30 MST

The mismatch comes from the issue that we use LDAP for all the user accounts, but we don't allow LDAP on the headnode. So my account on the headnode has a different uid.

Here is the result:
sacctmgr show assoc tree user=bhammond

   Cluster              Account       User  Partition     Share   Priority GrpJobs       GrpTRES GrpSubmit     GrpWall   GrpTRESMins MaxJobs       MaxTRES MaxTRESPerNode MaxSubmit     MaxWall   MaxTRESMins                  QOS   Def QOS GrpTRESRunMin 
---------- -------------------- ---------- ---------- --------- ---------- ------- ------------- --------- ----------- ------------- ------- ------------- -------------- --------- ----------- ------------- -------------------- --------- ------------- 
vhead-702+ general                bhammond                    1                                                                           10                                                                                normal                         


---
Or from scontrol:
ClusterName=vhead-702_cluster Account=general UserName=bhammond(1007) Partition= Priority=0 ID=4
    SharesRaw/Norm/Level/Factor=1/0.50/2/0.25
    UsageRaw/Norm/Efctv=3855.74/0.00/1.00
    ParentAccount= Lft=33 DefAssoc=Yes
    GrpJobs=N(0) GrpJobsAccrue=N(0)
    GrpSubmitJobs=N(0) GrpWall=N(27.32)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),billing=N(0),fs/disk=N(0),vmem=N(0),pages=N(0)
    GrpTRESMins=cpu=N(64),mem=N(131609),energy=N(0),node=N(27),billing=N(64),fs/disk=N(0),vmem=N(0),pages=N(0)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),billing=N(0),fs/disk=N(0),vmem=N(0),pages=N(0)
    MaxJobs=10(0) MaxJobsAccrue= MaxSubmitJobs= MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESMinsPJ=
    MinPrioThresh=

---
Here is my script:
#!/bin/bash
#SBATCH -p test
#SBATCH --job-name=test
#SBATCH -o test.out
#SBATCH --cpus-per-task=3
#SBATCH --mem-per-cpu=2GB

module load intel/tools
/bin/date
/bin/hostname
sleep 120
cd ~/QMC/qmagic/bin
/bin/pwd
mpirun -np 2 ./c3d.x

echo "job " $SLURM_JOB_NAME
echo "ID " $SLURM_JOBID
echo "  "

Comment 3 Ben Roberts 2021-12-10 14:21:22 MST

Hi Brian,

That looks ok so far, I don't see that there is a partition defined for the user association you have.  I don't see which account this job is requesting though.  It should be defaulting to the 'general' account since that's the only one I see for you, but to verify that you add a line to your job script?
#SBATCH -A general

I would also like to see some debug information from sbatch to see if it shows anything else about why it is rejecting the job.  Can I have you call sbatch one more time with verbose output and send the results?
sbatch -vvvv <job_script>


I'm also curious if it would be possible to try updating your UID on this system to match the UID you have in LDAP.  Can you use usermod on the controller to try this for your user?

Thanks,
Ben

Comment 4 Brian Hammond 2021-12-10 15:47:31 MST

Hi Ben,

So changing my UID on the headnode then restarting the daemon worked. 

Thanks!
Brian

Comment 5 Ben Roberts 2021-12-10 16:08:14 MST

Ok, I'm glad that was it.  Feel free to let us know if anything else comes up.

Thanks,
Ben