Ticket 9793

Summary: [2020-09-10T00:12:35.187] _slurm_rpc_allocate_resources: Invalid account or account/partition combination specified
Product: Slurm Reporter: Jimmy Hui <jhui>
Component: AccountingAssignee: Felip Moll <felip.moll>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: felip.moll, sts
Version: 19.05.1   
Hardware: Linux   
OS: Linux   
Site: Roche/PHCIX Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Jimmy Hui 2020-09-09 18:46:02 MDT
Hi,

We have  a user that no longer can submit jobs without using "--account" option. Below is the information on the commands ran. 


sacctmgr list assoc -pn tree |grep userx
slurm-master-usw2-hpc-prd|   default|userx||1|||||||10|node=8|||||bronze|||
slurm-master-usw2-hpc-prd|   premium|userx||1|||||||||||||silver|||


This runs fine for the user. 

srun --account=default -p C-16Cpu-30GB uname
Linux

slurmctld.log
[2020-09-10T00:43:39.930] _job_complete: JobId=3174 done
[2020-09-10T00:43:43.117] sched: _slurm_rpc_allocate_resources JobId=3175 NodeList=pphpc-usw2-0004 usec=655
[2020-09-10T00:43:56.825] prolog_running_decr: Configuration for JobId=3175 is complete





This does not work.
srun -N 1 -p C-16Cpu-30GB hostname
srun: error: Unable to allocate resources: Invalid account or account/partition combination specified

slurmctld.log
[2020-09-10T00:32:43.550] error: User 756957 not found
[2020-09-10T00:32:43.551] _job_create: invalid account or partition for user 756957, account '(null)', and partition 'C-16Cpu-30GB'
[2020-09-10T00:32:43.551] _slurm_rpc_allocate_resources: Invalid account or account/partition combination specified


getent passwd 756957
userx:*:756957:20:userx:/home/userx:/bin/bash
Comment 1 Felip Moll 2020-09-10 04:46:36 MDT
(In reply to Jimmy Hui from comment #0)
> Hi,
> 
> We have  a user that no longer can submit jobs without using "--account"
> option. Below is the information on the commands ran. 
> 
> 
> sacctmgr list assoc -pn tree |grep userx
> slurm-master-usw2-hpc-prd|   default|userx||1|||||||10|node=8|||||bronze|||
> slurm-master-usw2-hpc-prd|   premium|userx||1|||||||||||||silver|||
> 
> 
> This runs fine for the user. 
> 
> srun --account=default -p C-16Cpu-30GB uname
> Linux
> 
> slurmctld.log
> [2020-09-10T00:43:39.930] _job_complete: JobId=3174 done
> [2020-09-10T00:43:43.117] sched: _slurm_rpc_allocate_resources JobId=3175
> NodeList=pphpc-usw2-0004 usec=655
> [2020-09-10T00:43:56.825] prolog_running_decr: Configuration for JobId=3175
> is complete
> 
> 
> 
> 
> 
> This does not work.
> srun -N 1 -p C-16Cpu-30GB hostname
> srun: error: Unable to allocate resources: Invalid account or
> account/partition combination specified
> 
> slurmctld.log
> [2020-09-10T00:32:43.550] error: User 756957 not found
> [2020-09-10T00:32:43.551] _job_create: invalid account or partition for user
> 756957, account '(null)', and partition 'C-16Cpu-30GB'
> [2020-09-10T00:32:43.551] _slurm_rpc_allocate_resources: Invalid account or
> account/partition combination specified
> 
> 
> getent passwd 756957
> userx:*:756957:20:userx:/home/userx:/bin/bash

Can you try again after running this?:

sacctmgr show user userx
sacctmgr update user userx set DefaultAccount=default

If that does not work, can you try restarting slurmctld?

This seems an issue synchronizing with db.

Are you in a multicluster environment?

Your issue seems a duplicate of bug 8849.



If any of this work, I'd need your slurmctld at debug2 if possible, catching an event of a failed srun.
Comment 2 Jimmy Hui 2020-09-10 18:09:08 MDT
Hi,

It looks like a restart of slurmctld did the trick. Not sure why the database was out of sync. Is there a way to detect this kind of errors?
Comment 3 Felip Moll 2020-09-11 01:57:15 MDT
(In reply to Jimmy Hui from comment #2)
> Hi,
> 
> It looks like a restart of slurmctld did the trick. Not sure why the
> database was out of sync. Is there a way to detect this kind of errors?

Not at the moment. That's definitively a dup of bug 8849.

I would be interested in your slurmctld, slurmdbd logs, and slurm.conf. Will take a little bit more time trying to find out why it did happen, and maybe help also 8849.
Comment 4 Felip Moll 2020-10-12 04:42:31 MDT
Hi,

I am closing this issue since no more feedback has been received. It seems a dup of bug 8849, so I'll assume work has to be done there.

Thanks.

*** This ticket has been marked as a duplicate of ticket 8849 ***