Hello I am getting the message sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified for user bhammond (me). I think this is due to having the wrong UID associated with the user. scontrol returns this: User Records UserName=bhammond(1007) DefAccount=general DefWckey= AdminLevel=None UID 1007 was my account on the headnode, not the cluster. How do I clear this value? Thanks, Brian
Hi Brian, The user records stored in the database use the user name rather than the UID, but for authentication purposes we do use the UID that it gets from the system. We talk about the need for consistent UIDs on the cluster for authentication in the accounting documentation. https://slurm.schedmd.com/accounting.html#infrastructure Is it just your user that has a different UID on the headnode than you do on the compute nodes or do other users have a similar situation? I would recommend looking at making the UIDs match for your users, but I'm not sure that the mismatched UID is causing this error. It could be that you're requesting an account or partition that you don't have access to. Can I have you send the output of the following command: sacctmgr show assoc tree user=bhammond I would also like to see the sbatch arguments (or the #SBATCH directives in the script) to see which account and partition are being requested. Thanks, Ben
The mismatch comes from the issue that we use LDAP for all the user accounts, but we don't allow LDAP on the headnode. So my account on the headnode has a different uid. Here is the result: sacctmgr show assoc tree user=bhammond Cluster Account User Partition Share Priority GrpJobs GrpTRES GrpSubmit GrpWall GrpTRESMins MaxJobs MaxTRES MaxTRESPerNode MaxSubmit MaxWall MaxTRESMins QOS Def QOS GrpTRESRunMin ---------- -------------------- ---------- ---------- --------- ---------- ------- ------------- --------- ----------- ------------- ------- ------------- -------------- --------- ----------- ------------- -------------------- --------- ------------- vhead-702+ general bhammond 1 10 normal --- Or from scontrol: ClusterName=vhead-702_cluster Account=general UserName=bhammond(1007) Partition= Priority=0 ID=4 SharesRaw/Norm/Level/Factor=1/0.50/2/0.25 UsageRaw/Norm/Efctv=3855.74/0.00/1.00 ParentAccount= Lft=33 DefAssoc=Yes GrpJobs=N(0) GrpJobsAccrue=N(0) GrpSubmitJobs=N(0) GrpWall=N(27.32) GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),billing=N(0),fs/disk=N(0),vmem=N(0),pages=N(0) GrpTRESMins=cpu=N(64),mem=N(131609),energy=N(0),node=N(27),billing=N(64),fs/disk=N(0),vmem=N(0),pages=N(0) GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),billing=N(0),fs/disk=N(0),vmem=N(0),pages=N(0) MaxJobs=10(0) MaxJobsAccrue= MaxSubmitJobs= MaxWallPJ= MaxTRESPJ= MaxTRESPN= MaxTRESMinsPJ= MinPrioThresh= --- Here is my script: #!/bin/bash #SBATCH -p test #SBATCH --job-name=test #SBATCH -o test.out #SBATCH --cpus-per-task=3 #SBATCH --mem-per-cpu=2GB module load intel/tools /bin/date /bin/hostname sleep 120 cd ~/QMC/qmagic/bin /bin/pwd mpirun -np 2 ./c3d.x echo "job " $SLURM_JOB_NAME echo "ID " $SLURM_JOBID echo " "
Hi Brian, That looks ok so far, I don't see that there is a partition defined for the user association you have. I don't see which account this job is requesting though. It should be defaulting to the 'general' account since that's the only one I see for you, but to verify that you add a line to your job script? #SBATCH -A general I would also like to see some debug information from sbatch to see if it shows anything else about why it is rejecting the job. Can I have you call sbatch one more time with verbose output and send the results? sbatch -vvvv <job_script> I'm also curious if it would be possible to try updating your UID on this system to match the UID you have in LDAP. Can you use usermod on the controller to try this for your user? Thanks, Ben
Hi Ben, So changing my UID on the headnode then restarting the daemon worked. Thanks! Brian
Ok, I'm glad that was it. Feel free to let us know if anything else comes up. Thanks, Ben