Hi, We have a user that no longer can submit jobs without using "--account" option. Below is the information on the commands ran. sacctmgr list assoc -pn tree |grep userx slurm-master-usw2-hpc-prd| default|userx||1|||||||10|node=8|||||bronze||| slurm-master-usw2-hpc-prd| premium|userx||1|||||||||||||silver||| This runs fine for the user. srun --account=default -p C-16Cpu-30GB uname Linux slurmctld.log [2020-09-10T00:43:39.930] _job_complete: JobId=3174 done [2020-09-10T00:43:43.117] sched: _slurm_rpc_allocate_resources JobId=3175 NodeList=pphpc-usw2-0004 usec=655 [2020-09-10T00:43:56.825] prolog_running_decr: Configuration for JobId=3175 is complete This does not work. srun -N 1 -p C-16Cpu-30GB hostname srun: error: Unable to allocate resources: Invalid account or account/partition combination specified slurmctld.log [2020-09-10T00:32:43.550] error: User 756957 not found [2020-09-10T00:32:43.551] _job_create: invalid account or partition for user 756957, account '(null)', and partition 'C-16Cpu-30GB' [2020-09-10T00:32:43.551] _slurm_rpc_allocate_resources: Invalid account or account/partition combination specified getent passwd 756957 userx:*:756957:20:userx:/home/userx:/bin/bash
(In reply to Jimmy Hui from comment #0) > Hi, > > We have a user that no longer can submit jobs without using "--account" > option. Below is the information on the commands ran. > > > sacctmgr list assoc -pn tree |grep userx > slurm-master-usw2-hpc-prd| default|userx||1|||||||10|node=8|||||bronze||| > slurm-master-usw2-hpc-prd| premium|userx||1|||||||||||||silver||| > > > This runs fine for the user. > > srun --account=default -p C-16Cpu-30GB uname > Linux > > slurmctld.log > [2020-09-10T00:43:39.930] _job_complete: JobId=3174 done > [2020-09-10T00:43:43.117] sched: _slurm_rpc_allocate_resources JobId=3175 > NodeList=pphpc-usw2-0004 usec=655 > [2020-09-10T00:43:56.825] prolog_running_decr: Configuration for JobId=3175 > is complete > > > > > > This does not work. > srun -N 1 -p C-16Cpu-30GB hostname > srun: error: Unable to allocate resources: Invalid account or > account/partition combination specified > > slurmctld.log > [2020-09-10T00:32:43.550] error: User 756957 not found > [2020-09-10T00:32:43.551] _job_create: invalid account or partition for user > 756957, account '(null)', and partition 'C-16Cpu-30GB' > [2020-09-10T00:32:43.551] _slurm_rpc_allocate_resources: Invalid account or > account/partition combination specified > > > getent passwd 756957 > userx:*:756957:20:userx:/home/userx:/bin/bash Can you try again after running this?: sacctmgr show user userx sacctmgr update user userx set DefaultAccount=default If that does not work, can you try restarting slurmctld? This seems an issue synchronizing with db. Are you in a multicluster environment? Your issue seems a duplicate of bug 8849. If any of this work, I'd need your slurmctld at debug2 if possible, catching an event of a failed srun.
Hi, It looks like a restart of slurmctld did the trick. Not sure why the database was out of sync. Is there a way to detect this kind of errors?
(In reply to Jimmy Hui from comment #2) > Hi, > > It looks like a restart of slurmctld did the trick. Not sure why the > database was out of sync. Is there a way to detect this kind of errors? Not at the moment. That's definitively a dup of bug 8849. I would be interested in your slurmctld, slurmdbd logs, and slurm.conf. Will take a little bit more time trying to find out why it did happen, and maybe help also 8849.
Hi, I am closing this issue since no more feedback has been received. It seems a dup of bug 8849, so I'll assume work has to be done there. Thanks. *** This ticket has been marked as a duplicate of ticket 8849 ***