| Summary: | Sacct is not recording all jobs | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Kamal <kkraju90> |
| Component: | slurmd | Assignee: | Jacob Jenson <jacob> |
| Status: | RESOLVED INVALID | QA Contact: | |
| Severity: | 6 - No support contract | ||
| Priority: | --- | CC: | ben |
| Version: | - Unsupported Older Versions | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | -Other- | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
Can I get some insights here why it's happening ? |
We are currently seeing issues with Slurm and what appears to be jobs not being recorded to slurmdb.. From a recent job that ran nothing is reported.. $ sacct -j 84509296 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- But looking in slurmctrld logs I can see the job was allocated and such.. $ cat /var/log/slurm-llnl/slurmctld.log | grep 84509296 [2020-07-17T14:32:18.270] sched: _slurm_rpc_allocate_resources JobId=84509296 NodeList=sigma25 usec=3443 [2020-07-17T14:32:27.885] _pick_step_nodes: Configuration for job 84509296 is complete [2020-07-17T15:14:35.527] Time limit exhausted for JobId=84509296 [2020-07-17T15:49:03.297] _slurm_rpc_kill_job2: REQUEST_KILL_JOB job 84509296 uid 9999 [2020-07-17T15:49:03.297] job_str_signal: 2 invalid job id 84509296 [2020-07-17T15:49:03.297] _slurm_rpc_kill_job2: job_str_signal() job 84509296 sig 9 returned Invalid job id specified [2020-07-17T15:49:03.305] _slurm_rpc_kill_job2: REQUEST_KILL_JOB job 84509296 uid 9999 [2020-07-17T15:49:03.305] job_str_signal: 2 invalid job id 84509296 [2020-07-17T15:49:03.305] _slurm_rpc_kill_job2: job_str_signal() job 84509296 sig 9 returned Invalid job id specified [2020-07-17T15:49:03.314] _slurm_rpc_kill_job2: REQUEST_KILL_JOB job 84509296 uid 9999 [2020-07-17T15:49:03.314] job_str_signal: 2 invalid job id 84509296 [2020-07-17T15:49:03.314] _slurm_rpc_kill_job2: job_str_signal() job 84509296 sig 9 returned Invalid job id specified [2020-07-17T15:49:03.321] _slurm_rpc_kill_job2: REQUEST_KILL_JOB job 84509296 uid 9999 [2020-07-17T15:49:03.321] job_str_signal: 2 invalid job id 84509296 [2020-07-17T15:49:03.321] _slurm_rpc_kill_job2: job_str_signal() job 84509296 sig 9 returned Invalid job id specified I found 1569533 jobs killed with _slurm_rpc_kill_job2: REQUEST_KILL_JOB in logs. And where the jobs are logged with sig 9 returned Invalid job id specified, there the accounting data is missing. |12:26:26|kamals@sigma25:[~]> sacct -j 3954242 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- |12:29:55|kamals@sigma25:[~]> sacct -j 3954287 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- |12:30:07|kamals@sigma25:[~]> sacct -j 3954485 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- |12:30:13|kamals@sigma25:[~]> sacct -j 3954710 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- |12:30:21|kamals@sigma25:[~]> sacct -j 3954717 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- No suspicious/related errors in slurmdbd.log files Found the below in node machines |21:50:53|kamals@sigma25:[slurm-llnl]> sudo zcat slurmd.log.*.gz | egrep 84509296 [sudo] password for kamals: [2020-07-17T14:32:27.895] launch task 84509296.0 request from 9999.9997@10.14.17.77 (port 20184) [2020-07-17T14:32:27.989] _run_prolog: prolog with lock for job 84509296 ran for 0 seconds [2020-07-17T14:32:43.366] [84509296.0] done with job