Summary: | scontrol show job <job_id> shows error: invalid job id specified, while sacct is able to query | ||
---|---|---|---|
Product: | Slurm | Reporter: | William Durairaj <william.durairaj.s> |
Component: | Accounting | Assignee: | Jacob Jenson <jacob> |
Status: | OPEN --- | QA Contact: | |
Severity: | 6 - No support contract | ||
Priority: | --- | CC: | william.durairaj.s |
Version: | 21.08.8 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | -Other- | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- | ||
Attachments: | Logs from controller, slurmdbd and slurmd |
this problem happens randomly on the job submission. Version of slurm : 21.08.8 |
Created attachment 28769 [details] Logs from controller, slurmdbd and slurmd headnodevm:/var/log/slurm # scontrol show job 3050 slurm_load_jobs error: Invalid job id specified dkrdc01-headnodevm:/var/log/slurm # sacct -j 3050 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 3050 Mechanical rack1 128 CANCELLED+ 0:0 3050.batch batch 64 CANCELLED 0:15 3050.extern extern 128 COMPLETED 0:0 headnodevm:/var/log/slurm # headnodevm:/var/log/slurm # slurmctld -V slurm 21.08.8 /var/log/slurm/slurmd.log on the node where the job ran (dkrdc01-computeserver001) has only this debug2 message _insert_job_state: we already have a job state for job 3050. No big deal, just an FYI.