Created attachment 28769 [details] Logs from controller, slurmdbd and slurmd headnodevm:/var/log/slurm # scontrol show job 3050 slurm_load_jobs error: Invalid job id specified dkrdc01-headnodevm:/var/log/slurm # sacct -j 3050 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 3050 Mechanical rack1 128 CANCELLED+ 0:0 3050.batch batch 64 CANCELLED 0:15 3050.extern extern 128 COMPLETED 0:0 headnodevm:/var/log/slurm # headnodevm:/var/log/slurm # slurmctld -V slurm 21.08.8 /var/log/slurm/slurmd.log on the node where the job ran (dkrdc01-computeserver001) has only this debug2 message _insert_job_state: we already have a job state for job 3050. No big deal, just an FYI.
this problem happens randomly on the job submission. Version of slurm : 21.08.8