| Summary: | sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Brian Hammond <bhammond> |
| Component: | Accounting | Assignee: | Ben Roberts <ben> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 20.11.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Albert Einstein | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Brian Hammond
2021-12-10 07:34:50 MST
Hi Brian, The user records stored in the database use the user name rather than the UID, but for authentication purposes we do use the UID that it gets from the system. We talk about the need for consistent UIDs on the cluster for authentication in the accounting documentation. https://slurm.schedmd.com/accounting.html#infrastructure Is it just your user that has a different UID on the headnode than you do on the compute nodes or do other users have a similar situation? I would recommend looking at making the UIDs match for your users, but I'm not sure that the mismatched UID is causing this error. It could be that you're requesting an account or partition that you don't have access to. Can I have you send the output of the following command: sacctmgr show assoc tree user=bhammond I would also like to see the sbatch arguments (or the #SBATCH directives in the script) to see which account and partition are being requested. Thanks, Ben The mismatch comes from the issue that we use LDAP for all the user accounts, but we don't allow LDAP on the headnode. So my account on the headnode has a different uid.
Here is the result:
sacctmgr show assoc tree user=bhammond
Cluster Account User Partition Share Priority GrpJobs GrpTRES GrpSubmit GrpWall GrpTRESMins MaxJobs MaxTRES MaxTRESPerNode MaxSubmit MaxWall MaxTRESMins QOS Def QOS GrpTRESRunMin
---------- -------------------- ---------- ---------- --------- ---------- ------- ------------- --------- ----------- ------------- ------- ------------- -------------- --------- ----------- ------------- -------------------- --------- -------------
vhead-702+ general bhammond 1 10 normal
---
Or from scontrol:
ClusterName=vhead-702_cluster Account=general UserName=bhammond(1007) Partition= Priority=0 ID=4
SharesRaw/Norm/Level/Factor=1/0.50/2/0.25
UsageRaw/Norm/Efctv=3855.74/0.00/1.00
ParentAccount= Lft=33 DefAssoc=Yes
GrpJobs=N(0) GrpJobsAccrue=N(0)
GrpSubmitJobs=N(0) GrpWall=N(27.32)
GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),billing=N(0),fs/disk=N(0),vmem=N(0),pages=N(0)
GrpTRESMins=cpu=N(64),mem=N(131609),energy=N(0),node=N(27),billing=N(64),fs/disk=N(0),vmem=N(0),pages=N(0)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),billing=N(0),fs/disk=N(0),vmem=N(0),pages=N(0)
MaxJobs=10(0) MaxJobsAccrue= MaxSubmitJobs= MaxWallPJ=
MaxTRESPJ=
MaxTRESPN=
MaxTRESMinsPJ=
MinPrioThresh=
---
Here is my script:
#!/bin/bash
#SBATCH -p test
#SBATCH --job-name=test
#SBATCH -o test.out
#SBATCH --cpus-per-task=3
#SBATCH --mem-per-cpu=2GB
module load intel/tools
/bin/date
/bin/hostname
sleep 120
cd ~/QMC/qmagic/bin
/bin/pwd
mpirun -np 2 ./c3d.x
echo "job " $SLURM_JOB_NAME
echo "ID " $SLURM_JOBID
echo " "
Hi Brian, That looks ok so far, I don't see that there is a partition defined for the user association you have. I don't see which account this job is requesting though. It should be defaulting to the 'general' account since that's the only one I see for you, but to verify that you add a line to your job script? #SBATCH -A general I would also like to see some debug information from sbatch to see if it shows anything else about why it is rejecting the job. Can I have you call sbatch one more time with verbose output and send the results? sbatch -vvvv <job_script> I'm also curious if it would be possible to try updating your UID on this system to match the UID you have in LDAP. Can you use usermod on the controller to try this for your user? Thanks, Ben Hi Ben, So changing my UID on the headnode then restarting the daemon worked. Thanks! Brian Ok, I'm glad that was it. Feel free to let us know if anything else comes up. Thanks, Ben |