| Summary: | [2020-09-10T00:12:35.187] _slurm_rpc_allocate_resources: Invalid account or account/partition combination specified | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Jimmy Hui <jhui> |
| Component: | Accounting | Assignee: | Felip Moll <felip.moll> |
| Status: | RESOLVED DUPLICATE | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | felip.moll, sts |
| Version: | 19.05.1 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Roche/PHCIX | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Jimmy Hui
2020-09-09 18:46:02 MDT
(In reply to Jimmy Hui from comment #0) > Hi, > > We have a user that no longer can submit jobs without using "--account" > option. Below is the information on the commands ran. > > > sacctmgr list assoc -pn tree |grep userx > slurm-master-usw2-hpc-prd| default|userx||1|||||||10|node=8|||||bronze||| > slurm-master-usw2-hpc-prd| premium|userx||1|||||||||||||silver||| > > > This runs fine for the user. > > srun --account=default -p C-16Cpu-30GB uname > Linux > > slurmctld.log > [2020-09-10T00:43:39.930] _job_complete: JobId=3174 done > [2020-09-10T00:43:43.117] sched: _slurm_rpc_allocate_resources JobId=3175 > NodeList=pphpc-usw2-0004 usec=655 > [2020-09-10T00:43:56.825] prolog_running_decr: Configuration for JobId=3175 > is complete > > > > > > This does not work. > srun -N 1 -p C-16Cpu-30GB hostname > srun: error: Unable to allocate resources: Invalid account or > account/partition combination specified > > slurmctld.log > [2020-09-10T00:32:43.550] error: User 756957 not found > [2020-09-10T00:32:43.551] _job_create: invalid account or partition for user > 756957, account '(null)', and partition 'C-16Cpu-30GB' > [2020-09-10T00:32:43.551] _slurm_rpc_allocate_resources: Invalid account or > account/partition combination specified > > > getent passwd 756957 > userx:*:756957:20:userx:/home/userx:/bin/bash Can you try again after running this?: sacctmgr show user userx sacctmgr update user userx set DefaultAccount=default If that does not work, can you try restarting slurmctld? This seems an issue synchronizing with db. Are you in a multicluster environment? Your issue seems a duplicate of bug 8849. If any of this work, I'd need your slurmctld at debug2 if possible, catching an event of a failed srun. Hi, It looks like a restart of slurmctld did the trick. Not sure why the database was out of sync. Is there a way to detect this kind of errors? (In reply to Jimmy Hui from comment #2) > Hi, > > It looks like a restart of slurmctld did the trick. Not sure why the > database was out of sync. Is there a way to detect this kind of errors? Not at the moment. That's definitively a dup of bug 8849. I would be interested in your slurmctld, slurmdbd logs, and slurm.conf. Will take a little bit more time trying to find out why it did happen, and maybe help also 8849. Hi, I am closing this issue since no more feedback has been received. It seems a dup of bug 8849, so I'll assume work has to be done there. Thanks. *** This ticket has been marked as a duplicate of ticket 8849 *** |