Ticket 9793

Summary:	[2020-09-10T00:12:35.187] _slurm_rpc_allocate_resources: Invalid account or account/partition combination specified
Product:	Slurm	Reporter:	Jimmy Hui <jhui>
Component:	Accounting	Assignee:	Felip Moll <felip.moll>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	felip.moll, sts
Version:	19.05.1
Hardware:	Linux
OS:	Linux
Site:	Roche/PHCIX	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Jimmy Hui 2020-09-09 18:46:02 MDT

Hi,

We have  a user that no longer can submit jobs without using "--account" option. Below is the information on the commands ran. 


sacctmgr list assoc -pn tree |grep userx
slurm-master-usw2-hpc-prd|   default|userx||1|||||||10|node=8|||||bronze|||
slurm-master-usw2-hpc-prd|   premium|userx||1|||||||||||||silver|||


This runs fine for the user. 

srun --account=default -p C-16Cpu-30GB uname
Linux

slurmctld.log
[2020-09-10T00:43:39.930] _job_complete: JobId=3174 done
[2020-09-10T00:43:43.117] sched: _slurm_rpc_allocate_resources JobId=3175 NodeList=pphpc-usw2-0004 usec=655
[2020-09-10T00:43:56.825] prolog_running_decr: Configuration for JobId=3175 is complete





This does not work.
srun -N 1 -p C-16Cpu-30GB hostname
srun: error: Unable to allocate resources: Invalid account or account/partition combination specified

slurmctld.log
[2020-09-10T00:32:43.550] error: User 756957 not found
[2020-09-10T00:32:43.551] _job_create: invalid account or partition for user 756957, account '(null)', and partition 'C-16Cpu-30GB'
[2020-09-10T00:32:43.551] _slurm_rpc_allocate_resources: Invalid account or account/partition combination specified


getent passwd 756957
userx:*:756957:20:userx:/home/userx:/bin/bash

Comment 1 Felip Moll 2020-09-10 04:46:36 MDT

(In reply to Jimmy Hui from comment #0)
> Hi,
> 
> We have  a user that no longer can submit jobs without using "--account"
> option. Below is the information on the commands ran. 
> 
> 
> sacctmgr list assoc -pn tree |grep userx
> slurm-master-usw2-hpc-prd|   default|userx||1|||||||10|node=8|||||bronze|||
> slurm-master-usw2-hpc-prd|   premium|userx||1|||||||||||||silver|||
> 
> 
> This runs fine for the user. 
> 
> srun --account=default -p C-16Cpu-30GB uname
> Linux
> 
> slurmctld.log
> [2020-09-10T00:43:39.930] _job_complete: JobId=3174 done
> [2020-09-10T00:43:43.117] sched: _slurm_rpc_allocate_resources JobId=3175
> NodeList=pphpc-usw2-0004 usec=655
> [2020-09-10T00:43:56.825] prolog_running_decr: Configuration for JobId=3175
> is complete
> 
> 
> 
> 
> 
> This does not work.
> srun -N 1 -p C-16Cpu-30GB hostname
> srun: error: Unable to allocate resources: Invalid account or
> account/partition combination specified
> 
> slurmctld.log
> [2020-09-10T00:32:43.550] error: User 756957 not found
> [2020-09-10T00:32:43.551] _job_create: invalid account or partition for user
> 756957, account '(null)', and partition 'C-16Cpu-30GB'
> [2020-09-10T00:32:43.551] _slurm_rpc_allocate_resources: Invalid account or
> account/partition combination specified
> 
> 
> getent passwd 756957
> userx:*:756957:20:userx:/home/userx:/bin/bash

Can you try again after running this?:

sacctmgr show user userx
sacctmgr update user userx set DefaultAccount=default

If that does not work, can you try restarting slurmctld?

This seems an issue synchronizing with db.

Are you in a multicluster environment?

Your issue seems a duplicate of bug 8849.



If any of this work, I'd need your slurmctld at debug2 if possible, catching an event of a failed srun.

Comment 2 Jimmy Hui 2020-09-10 18:09:08 MDT

Hi,

It looks like a restart of slurmctld did the trick. Not sure why the database was out of sync. Is there a way to detect this kind of errors?

Comment 3 Felip Moll 2020-09-11 01:57:15 MDT

(In reply to Jimmy Hui from comment #2)
> Hi,
> 
> It looks like a restart of slurmctld did the trick. Not sure why the
> database was out of sync. Is there a way to detect this kind of errors?

Not at the moment. That's definitively a dup of bug 8849.

I would be interested in your slurmctld, slurmdbd logs, and slurm.conf. Will take a little bit more time trying to find out why it did happen, and maybe help also 8849.

Comment 4 Felip Moll 2020-10-12 04:42:31 MDT

Hi,

I am closing this issue since no more feedback has been received. It seems a dup of bug 8849, so I'll assume work has to be done there.

Thanks.

*** This ticket has been marked as a duplicate of ticket 8849 ***