Ticket 13010

Summary: sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
Product: Slurm Reporter: Brian Hammond <bhammond>
Component: AccountingAssignee: Ben Roberts <ben>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 20.11.8   
Hardware: Linux   
OS: Linux   
Site: Albert Einstein Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Brian Hammond 2021-12-10 07:34:50 MST
Hello

I am getting the message
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified

for user bhammond (me). I think this is due to having the wrong UID associated with the user. scontrol returns this:
User Records

UserName=bhammond(1007) DefAccount=general DefWckey= AdminLevel=None

UID 1007 was my account on the headnode, not the cluster. 

How do I clear this value?

Thanks,
Brian
Comment 1 Ben Roberts 2021-12-10 12:21:46 MST
Hi Brian,

The user records stored in the database use the user name rather than the UID, but for authentication purposes we do use the UID that it gets from the system.  We talk about the need for consistent UIDs on the cluster for authentication in the accounting documentation.
https://slurm.schedmd.com/accounting.html#infrastructure

Is it just your user that has a different UID on the headnode than you do on the compute nodes or do other users have a similar situation?

I would recommend looking at making the UIDs match for your users, but I'm not sure that the mismatched UID is causing this error.  It could be that you're requesting an account or partition that you don't have access to.  Can I have you send the output of the following command:
sacctmgr show assoc tree user=bhammond

I would also like to see the sbatch arguments (or the #SBATCH directives in the script) to see which account and partition are being requested.

Thanks,
Ben
Comment 2 Brian Hammond 2021-12-10 12:38:30 MST
The mismatch comes from the issue that we use LDAP for all the user accounts, but we don't allow LDAP on the headnode. So my account on the headnode has a different uid.

Here is the result:
sacctmgr show assoc tree user=bhammond

   Cluster              Account       User  Partition     Share   Priority GrpJobs       GrpTRES GrpSubmit     GrpWall   GrpTRESMins MaxJobs       MaxTRES MaxTRESPerNode MaxSubmit     MaxWall   MaxTRESMins                  QOS   Def QOS GrpTRESRunMin 
---------- -------------------- ---------- ---------- --------- ---------- ------- ------------- --------- ----------- ------------- ------- ------------- -------------- --------- ----------- ------------- -------------------- --------- ------------- 
vhead-702+ general                bhammond                    1                                                                           10                                                                                normal                         


---
Or from scontrol:
ClusterName=vhead-702_cluster Account=general UserName=bhammond(1007) Partition= Priority=0 ID=4
    SharesRaw/Norm/Level/Factor=1/0.50/2/0.25
    UsageRaw/Norm/Efctv=3855.74/0.00/1.00
    ParentAccount= Lft=33 DefAssoc=Yes
    GrpJobs=N(0) GrpJobsAccrue=N(0)
    GrpSubmitJobs=N(0) GrpWall=N(27.32)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),billing=N(0),fs/disk=N(0),vmem=N(0),pages=N(0)
    GrpTRESMins=cpu=N(64),mem=N(131609),energy=N(0),node=N(27),billing=N(64),fs/disk=N(0),vmem=N(0),pages=N(0)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),billing=N(0),fs/disk=N(0),vmem=N(0),pages=N(0)
    MaxJobs=10(0) MaxJobsAccrue= MaxSubmitJobs= MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESMinsPJ=
    MinPrioThresh=

---
Here is my script:
#!/bin/bash
#SBATCH -p test
#SBATCH --job-name=test
#SBATCH -o test.out
#SBATCH --cpus-per-task=3
#SBATCH --mem-per-cpu=2GB

module load intel/tools
/bin/date
/bin/hostname
sleep 120
cd ~/QMC/qmagic/bin
/bin/pwd
mpirun -np 2 ./c3d.x

echo "job " $SLURM_JOB_NAME
echo "ID " $SLURM_JOBID
echo "  "
Comment 3 Ben Roberts 2021-12-10 14:21:22 MST
Hi Brian,

That looks ok so far, I don't see that there is a partition defined for the user association you have.  I don't see which account this job is requesting though.  It should be defaulting to the 'general' account since that's the only one I see for you, but to verify that you add a line to your job script?
#SBATCH -A general

I would also like to see some debug information from sbatch to see if it shows anything else about why it is rejecting the job.  Can I have you call sbatch one more time with verbose output and send the results?
sbatch -vvvv <job_script>


I'm also curious if it would be possible to try updating your UID on this system to match the UID you have in LDAP.  Can you use usermod on the controller to try this for your user?

Thanks,
Ben
Comment 4 Brian Hammond 2021-12-10 15:47:31 MST
Hi Ben,

So changing my UID on the headnode then restarting the daemon worked. 

Thanks!
Brian
Comment 5 Ben Roberts 2021-12-10 16:08:14 MST
Ok, I'm glad that was it.  Feel free to let us know if anything else comes up.

Thanks,
Ben