Ticket 9444 - Sacct is not recording all jobs
Summary: Sacct is not recording all jobs
Status: RESOLVED INVALID
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: - Unsupported Older Versions
Hardware: Linux Linux
: 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-07-21 16:28 MDT by Kamal
Modified: 2020-07-23 09:12 MDT (History)
1 user (show)

See Also:
Site: -Other-
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Kamal 2020-07-21 16:28:36 MDT
We are currently seeing issues with Slurm and what appears to be jobs not being recorded to slurmdb..

From a recent job that ran nothing is reported..
$ sacct -j 84509296
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------

But looking in slurmctrld logs I can see the job was allocated and such..

$ cat /var/log/slurm-llnl/slurmctld.log | grep 84509296
[2020-07-17T14:32:18.270] sched: _slurm_rpc_allocate_resources JobId=84509296 NodeList=sigma25 usec=3443
[2020-07-17T14:32:27.885] _pick_step_nodes: Configuration for job 84509296 is complete
[2020-07-17T15:14:35.527] Time limit exhausted for JobId=84509296
[2020-07-17T15:49:03.297] _slurm_rpc_kill_job2: REQUEST_KILL_JOB job 84509296 uid 9999
[2020-07-17T15:49:03.297] job_str_signal: 2 invalid job id 84509296
[2020-07-17T15:49:03.297] _slurm_rpc_kill_job2: job_str_signal() job 84509296 sig 9 returned Invalid job id specified
[2020-07-17T15:49:03.305] _slurm_rpc_kill_job2: REQUEST_KILL_JOB job 84509296 uid 9999
[2020-07-17T15:49:03.305] job_str_signal: 2 invalid job id 84509296
[2020-07-17T15:49:03.305] _slurm_rpc_kill_job2: job_str_signal() job 84509296 sig 9 returned Invalid job id specified
[2020-07-17T15:49:03.314] _slurm_rpc_kill_job2: REQUEST_KILL_JOB job 84509296 uid 9999
[2020-07-17T15:49:03.314] job_str_signal: 2 invalid job id 84509296
[2020-07-17T15:49:03.314] _slurm_rpc_kill_job2: job_str_signal() job 84509296 sig 9 returned Invalid job id specified
[2020-07-17T15:49:03.321] _slurm_rpc_kill_job2: REQUEST_KILL_JOB job 84509296 uid 9999
[2020-07-17T15:49:03.321] job_str_signal: 2 invalid job id 84509296
[2020-07-17T15:49:03.321] _slurm_rpc_kill_job2: job_str_signal() job 84509296 sig 9 returned Invalid job id specified

I found 1569533 jobs killed with _slurm_rpc_kill_job2: REQUEST_KILL_JOB in logs. And where the jobs are logged with sig 9 returned Invalid job id specified, there the accounting data is missing.

 

|12:26:26|kamals@sigma25:[~]> sacct -j 3954242
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
|12:29:55|kamals@sigma25:[~]> sacct -j 3954287
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
|12:30:07|kamals@sigma25:[~]> sacct -j 3954485
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
|12:30:13|kamals@sigma25:[~]> sacct -j 3954710
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
|12:30:21|kamals@sigma25:[~]> sacct -j 3954717
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------

No suspicious/related errors in slurmdbd.log files

Found the below in node machines

|21:50:53|kamals@sigma25:[slurm-llnl]> sudo zcat slurmd.log.*.gz | egrep 84509296
[sudo] password for kamals:
[2020-07-17T14:32:27.895] launch task 84509296.0 request from 9999.9997@10.14.17.77 (port 20184)
[2020-07-17T14:32:27.989] _run_prolog: prolog with lock for job 84509296 ran for 0 seconds
[2020-07-17T14:32:43.366] [84509296.0] done with job
Comment 1 Kamal 2020-07-22 16:04:42 MDT
Can I get some insights here why it's happening ?