Ticket 9793 - [2020-09-10T00:12:35.187] _slurm_rpc_allocate_resources: Invalid account or account/partition combination specified
Summary: [2020-09-10T00:12:35.187] _slurm_rpc_allocate_resources: Invalid account or a...
Status: RESOLVED DUPLICATE of ticket 8849
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other tickets)
Version: 19.05.1
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Felip Moll
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-09-09 18:46 MDT by Jimmy Hui
Modified: 2021-06-01 11:49 MDT (History)
2 users (show)

See Also:
Site: Roche/PHCIX
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Jimmy Hui 2020-09-09 18:46:02 MDT
Hi,

We have  a user that no longer can submit jobs without using "--account" option. Below is the information on the commands ran. 


sacctmgr list assoc -pn tree |grep userx
slurm-master-usw2-hpc-prd|   default|userx||1|||||||10|node=8|||||bronze|||
slurm-master-usw2-hpc-prd|   premium|userx||1|||||||||||||silver|||


This runs fine for the user. 

srun --account=default -p C-16Cpu-30GB uname
Linux

slurmctld.log
[2020-09-10T00:43:39.930] _job_complete: JobId=3174 done
[2020-09-10T00:43:43.117] sched: _slurm_rpc_allocate_resources JobId=3175 NodeList=pphpc-usw2-0004 usec=655
[2020-09-10T00:43:56.825] prolog_running_decr: Configuration for JobId=3175 is complete





This does not work.
srun -N 1 -p C-16Cpu-30GB hostname
srun: error: Unable to allocate resources: Invalid account or account/partition combination specified

slurmctld.log
[2020-09-10T00:32:43.550] error: User 756957 not found
[2020-09-10T00:32:43.551] _job_create: invalid account or partition for user 756957, account '(null)', and partition 'C-16Cpu-30GB'
[2020-09-10T00:32:43.551] _slurm_rpc_allocate_resources: Invalid account or account/partition combination specified


getent passwd 756957
userx:*:756957:20:userx:/home/userx:/bin/bash
Comment 1 Felip Moll 2020-09-10 04:46:36 MDT
(In reply to Jimmy Hui from comment #0)
> Hi,
> 
> We have  a user that no longer can submit jobs without using "--account"
> option. Below is the information on the commands ran. 
> 
> 
> sacctmgr list assoc -pn tree |grep userx
> slurm-master-usw2-hpc-prd|   default|userx||1|||||||10|node=8|||||bronze|||
> slurm-master-usw2-hpc-prd|   premium|userx||1|||||||||||||silver|||
> 
> 
> This runs fine for the user. 
> 
> srun --account=default -p C-16Cpu-30GB uname
> Linux
> 
> slurmctld.log
> [2020-09-10T00:43:39.930] _job_complete: JobId=3174 done
> [2020-09-10T00:43:43.117] sched: _slurm_rpc_allocate_resources JobId=3175
> NodeList=pphpc-usw2-0004 usec=655
> [2020-09-10T00:43:56.825] prolog_running_decr: Configuration for JobId=3175
> is complete
> 
> 
> 
> 
> 
> This does not work.
> srun -N 1 -p C-16Cpu-30GB hostname
> srun: error: Unable to allocate resources: Invalid account or
> account/partition combination specified
> 
> slurmctld.log
> [2020-09-10T00:32:43.550] error: User 756957 not found
> [2020-09-10T00:32:43.551] _job_create: invalid account or partition for user
> 756957, account '(null)', and partition 'C-16Cpu-30GB'
> [2020-09-10T00:32:43.551] _slurm_rpc_allocate_resources: Invalid account or
> account/partition combination specified
> 
> 
> getent passwd 756957
> userx:*:756957:20:userx:/home/userx:/bin/bash

Can you try again after running this?:

sacctmgr show user userx
sacctmgr update user userx set DefaultAccount=default

If that does not work, can you try restarting slurmctld?

This seems an issue synchronizing with db.

Are you in a multicluster environment?

Your issue seems a duplicate of bug 8849.



If any of this work, I'd need your slurmctld at debug2 if possible, catching an event of a failed srun.
Comment 2 Jimmy Hui 2020-09-10 18:09:08 MDT
Hi,

It looks like a restart of slurmctld did the trick. Not sure why the database was out of sync. Is there a way to detect this kind of errors?
Comment 3 Felip Moll 2020-09-11 01:57:15 MDT
(In reply to Jimmy Hui from comment #2)
> Hi,
> 
> It looks like a restart of slurmctld did the trick. Not sure why the
> database was out of sync. Is there a way to detect this kind of errors?

Not at the moment. That's definitively a dup of bug 8849.

I would be interested in your slurmctld, slurmdbd logs, and slurm.conf. Will take a little bit more time trying to find out why it did happen, and maybe help also 8849.
Comment 4 Felip Moll 2020-10-12 04:42:31 MDT
Hi,

I am closing this issue since no more feedback has been received. It seems a dup of bug 8849, so I'll assume work has to be done there.

Thanks.

*** This ticket has been marked as a duplicate of ticket 8849 ***