Ticket 9444

Summary: Sacct is not recording all jobs
Product: Slurm Reporter: Kamal <kkraju90>
Component: slurmdAssignee: Jacob Jenson <jacob>
Status: RESOLVED INVALID QA Contact:
Severity: 6 - No support contract    
Priority: --- CC: ben
Version: - Unsupported Older Versions   
Hardware: Linux   
OS: Linux   
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Kamal 2020-07-21 16:28:36 MDT
We are currently seeing issues with Slurm and what appears to be jobs not being recorded to slurmdb..

From a recent job that ran nothing is reported..
$ sacct -j 84509296
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------

But looking in slurmctrld logs I can see the job was allocated and such..

$ cat /var/log/slurm-llnl/slurmctld.log | grep 84509296
[2020-07-17T14:32:18.270] sched: _slurm_rpc_allocate_resources JobId=84509296 NodeList=sigma25 usec=3443
[2020-07-17T14:32:27.885] _pick_step_nodes: Configuration for job 84509296 is complete
[2020-07-17T15:14:35.527] Time limit exhausted for JobId=84509296
[2020-07-17T15:49:03.297] _slurm_rpc_kill_job2: REQUEST_KILL_JOB job 84509296 uid 9999
[2020-07-17T15:49:03.297] job_str_signal: 2 invalid job id 84509296
[2020-07-17T15:49:03.297] _slurm_rpc_kill_job2: job_str_signal() job 84509296 sig 9 returned Invalid job id specified
[2020-07-17T15:49:03.305] _slurm_rpc_kill_job2: REQUEST_KILL_JOB job 84509296 uid 9999
[2020-07-17T15:49:03.305] job_str_signal: 2 invalid job id 84509296
[2020-07-17T15:49:03.305] _slurm_rpc_kill_job2: job_str_signal() job 84509296 sig 9 returned Invalid job id specified
[2020-07-17T15:49:03.314] _slurm_rpc_kill_job2: REQUEST_KILL_JOB job 84509296 uid 9999
[2020-07-17T15:49:03.314] job_str_signal: 2 invalid job id 84509296
[2020-07-17T15:49:03.314] _slurm_rpc_kill_job2: job_str_signal() job 84509296 sig 9 returned Invalid job id specified
[2020-07-17T15:49:03.321] _slurm_rpc_kill_job2: REQUEST_KILL_JOB job 84509296 uid 9999
[2020-07-17T15:49:03.321] job_str_signal: 2 invalid job id 84509296
[2020-07-17T15:49:03.321] _slurm_rpc_kill_job2: job_str_signal() job 84509296 sig 9 returned Invalid job id specified

I found 1569533 jobs killed with _slurm_rpc_kill_job2: REQUEST_KILL_JOB in logs. And where the jobs are logged with sig 9 returned Invalid job id specified, there the accounting data is missing.

 

|12:26:26|kamals@sigma25:[~]> sacct -j 3954242
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
|12:29:55|kamals@sigma25:[~]> sacct -j 3954287
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
|12:30:07|kamals@sigma25:[~]> sacct -j 3954485
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
|12:30:13|kamals@sigma25:[~]> sacct -j 3954710
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
|12:30:21|kamals@sigma25:[~]> sacct -j 3954717
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------

No suspicious/related errors in slurmdbd.log files

Found the below in node machines

|21:50:53|kamals@sigma25:[slurm-llnl]> sudo zcat slurmd.log.*.gz | egrep 84509296
[sudo] password for kamals:
[2020-07-17T14:32:27.895] launch task 84509296.0 request from 9999.9997@10.14.17.77 (port 20184)
[2020-07-17T14:32:27.989] _run_prolog: prolog with lock for job 84509296 ran for 0 seconds
[2020-07-17T14:32:43.366] [84509296.0] done with job
Comment 1 Kamal 2020-07-22 16:04:42 MDT
Can I get some insights here why it's happening ?