We've been looking into idebtifyin, and possibly automatically handling, jobs that seem to take a long time to complete, however, in investigating one of these jobs, I realised was part of an array job, vis: JobID JobIDRaw JobName Account Alloc State Elapsed Nodes CPUs ---------- ------- ---------- ---------- ------ ---- ---------- ---------- 4172215_24 4172215 obs108013+ mwaeor 1 48 COMPLETED 00:00:54 4172215_24 4172215 batch mwaeor 1 24 COMPLETED 00:00:54 4172215_24 4172215 extern mwaeor 1 48 COMPLETED 00:00:56 4172215_1 4172396 obs108013+ mwaeor 1 48 COMPLETED 00:00:52 4172215_1. 4172396 batch mwaeor 1 24 COMPLETED 00:00:52 4172215_1. 4172396 extern mwaeor 1 48 COMPLETED 00:00:53 4172215_2 4172397 obs108013+ mwaeor 1 48 COMPLETED 00:00:52 4172215_2. 4172397 batch mwaeor 1 24 COMPLETED 00:00:52 4172215_2. 4172397 extern mwaeor 1 48 COMPLETED 00:01:31 4172215_3 4172398 obs108013+ mwaeor 1 48 COMPLETED 00:00:57 4172215_3. 4172398 batch mwaeor 1 24 COMPLETED 00:00:57 4172215_3. 4172398 extern mwaeor 1 48 CANCELLED 00:04:12 <== 4172215_4 4172399 obs108013+ mwaeor 1 48 COMPLETED 00:00:50 4172215_4. 4172399 batch mwaeor 1 24 COMPLETED 00:00:50 4172215_4. 4172399 extern mwaeor 1 48 COMPLETED 00:00:51 4172215_5 4172400 obs108013+ mwaeor 1 48 COMPLETED 00:00:52 4172215_5. 4172400 batch mwaeor 1 24 COMPLETED 00:00:52 4172215_5. 4172400 extern mwaeor 1 48 COMPLETED 00:00:54 4172215_6 4172401 obs108013+ mwaeor 1 48 COMPLETED 00:01:07 4172215_6. 4172401 batch mwaeor 1 24 COMPLETED 00:01:07 4172215_6. 4172401 extern mwaeor 1 48 COMPLETED 00:01:10 4172215_7 4172402 obs108013+ mwaeor 1 48 COMPLETED 00:00:52 4172215_7. 4172402 batch mwaeor 1 24 COMPLETED 00:00:52 4172215_7. 4172402 extern mwaeor 1 48 COMPLETED 00:00:55 4172215_8 4172403 obs108013+ mwaeor 1 48 COMPLETED 00:00:54 4172215_8. 4172403 batch mwaeor 1 24 COMPLETED 00:00:54 4172215_8. 4172403 extern mwaeor 1 48 COMPLETED 00:00:57 4172215_9 4172404 obs108013+ mwaeor 1 48 COMPLETED 00:00:53 4172215_9. 4172404 batch mwaeor 1 24 COMPLETED 00:00:53 4172215_9. 4172404 extern mwaeor 1 48 COMPLETED 00:00:55 4172215_10 4172405 obs108013+ mwaeor 1 48 COMPLETED 00:00:53 4172215_10 4172405 batch mwaeor 1 24 COMPLETED 00:00:53 4172215_10 4172405 extern mwaeor 1 48 COMPLETED 00:00:55 4172215_11 4172406 obs108013+ mwaeor 1 48 COMPLETED 00:00:55 4172215_11 4172406 batch mwaeor 1 24 COMPLETED 00:00:55 4172215_11 4172406 extern mwaeor 1 48 COMPLETED 00:00:58 4172215_12 4172407 obs108013+ mwaeor 1 48 COMPLETED 00:00:52 4172215_12 4172407 batch mwaeor 1 24 COMPLETED 00:00:52 4172215_12 4172407 extern mwaeor 1 48 COMPLETED 00:00:54 4172215_13 4172408 obs108013+ mwaeor 1 48 COMPLETED 00:00:51 4172215_13 4172408 batch mwaeor 1 24 COMPLETED 00:00:51 4172215_13 4172408 extern mwaeor 1 48 COMPLETED 00:00:54 4172215_14 4172409 obs108013+ mwaeor 1 48 COMPLETED 00:00:52 4172215_14 4172409 batch mwaeor 1 24 COMPLETED 00:00:52 4172215_14 4172409 extern mwaeor 1 48 COMPLETED 00:00:55 4172215_15 4172410 obs108013+ mwaeor 1 48 COMPLETED 00:00:55 4172215_15 4172410 batch mwaeor 1 24 COMPLETED 00:00:55 4172215_15 4172410 extern mwaeor 1 48 COMPLETED 00:00:58 4172215_16 4172411 obs108013+ mwaeor 1 48 COMPLETED 00:00:52 4172215_16 4172411 batch mwaeor 1 24 COMPLETED 00:00:52 4172215_16 4172411 extern mwaeor 1 48 COMPLETED 00:00:54 4172215_17 4172812 obs108013+ mwaeor 1 48 COMPLETED 00:00:50 4172215_17 4172812 batch mwaeor 1 24 COMPLETED 00:00:50 4172215_17 4172812 extern mwaeor 1 48 CANCELLED 00:03:32 <== 4172215_18 4172813 obs108013+ mwaeor 1 48 COMPLETED 00:00:53 4172215_18 4172813 batch mwaeor 1 24 COMPLETED 00:00:53 4172215_18 4172813 extern mwaeor 1 48 COMPLETED 00:00:54 4172215_19 4172823 obs108013+ mwaeor 1 48 COMPLETED 00:00:52 4172215_19 4172823 batch mwaeor 1 24 COMPLETED 00:00:52 4172215_19 4172823 extern mwaeor 1 48 COMPLETED 00:00:54 4172215_20 4172830 obs108013+ mwaeor 1 48 COMPLETED 00:00:52 4172215_20 4172830 batch mwaeor 1 24 COMPLETED 00:00:52 4172215_20 4172830 extern mwaeor 1 48 COMPLETED 00:00:54 4172215_21 4172833 obs108013+ mwaeor 1 48 COMPLETED 00:00:55 4172215_21 4172833 batch mwaeor 1 24 COMPLETED 00:00:55 4172215_21 4172833 extern mwaeor 1 48 COMPLETED 00:00:57 4172215_22 4172836 obs108013+ mwaeor 1 48 COMPLETED 00:00:53 4172215_22 4172836 batch mwaeor 1 24 COMPLETED 00:00:53 4172215_22 4172836 extern mwaeor 1 48 COMPLETED 00:00:55 4172215_23 4172840 obs108013+ mwaeor 1 48 COMPLETED 00:00:55 4172215_23 4172840 batch mwaeor 1 24 COMPLETED 00:00:55 4172215_23 4172840 extern mwaeor 1 48 COMPLETED 00:00:58 where the thing to note is that the two jobs in the array that ran for greater than 1 minute, are listed as being CANCELLED, and that one of thosehas the JobIDraw of the job that took what we recorded as around 17 minutes to complete. The user says that they are not cancelling jobs. If I look at the slurmd log of the node that 4172215_17 4172812 ran on, the relevant lines are: [2019-10-28T19:53:36.540] task_p_slurmd_batch_request: 4172812 [2019-10-28T19:53:36.540] task/affinity: job 4172812 CPU input mask for node: 0xFFFFFFFFFFFF [2019-10-28T19:53:36.540] task/affinity: job 4172812 CPU final HW mask for node: 0xFFFFFFFFFFFF [2019-10-28T19:53:36.549] _run_prolog: run job script took usec=11321 [2019-10-28T19:53:36.549] _run_prolog: prolog with lock for job 4172812 ran for 0 seconds [2019-10-28T19:53:36.562] [4172812.extern] core_spec/cray: init [2019-10-28T19:53:36.591] [4172812.extern] task/cgroup: /slurm/uid_22186/job_4172812: alloc=60000MB mem.limit=57000MB memsw.limit=57000MB [2019-10-28T19:53:36.591] [4172812.extern] task/cgroup: /slurm/uid_22186/job_4172812/step_extern: alloc=60000MB mem.limit=57000MB memsw.limit=57000MB [2019-10-28T19:53:39.161] Launching batch job 4172812 for UID 22186 [2019-10-28T19:53:39.173] [4172812.batch] core_spec/cray: init [2019-10-28T19:53:39.215] [4172812.batch] task/cgroup: /slurm/uid_22186/job_4172812: alloc=60000MB mem.limit=57000MB memsw.limit=57000MB [2019-10-28T19:53:39.215] [4172812.batch] task/cgroup: /slurm/uid_22186/job_4172812/step_batch: alloc=60000MB mem.limit=57000MB memsw.limit=57000MB [2019-10-28T19:53:39.289] [4172812.batch] task_p_pre_launch: Using sched_affinity for tasks [2019-10-28T19:54:24.471] [4172812.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0 [2019-10-28T19:54:24.476] [4172812.batch] done with job [2019-10-28T20:20:17.998] [4172812.extern] done with job so no evidence of Slurm cancelling anything ? In which case, what's likely to be doing the CANCEL-ing here ? 2) Can anyone at SchedMD suggest why sacct would belive a job to be in a CANCELLED state after 00:03:32 elapsed, whilst the slurmd didn't consider the extern part of the job "done" until some 26 minutes had elapsed ? Any clues? I doubt we have changed the slurm.conf from the last one we supplied you with. Kevin M. Buckley -- Supercomputing Systems Administrator Pawsey Supercomputing Centre
Hi Kevin - would you please attach the full logs from that day for us to review. Both the slurmctld and the slurmd logs would be helpful in understanding what occurred. I would be intrigued to know if you also have Entries such as the following: > _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=54 uid 1000 > job_signal: 9 of running JobId=54 successful 0x8004 On the node at "debug" you should also see entries such as >_rpc_terminate_job > Sent signal
Created attachment 12153 [details] Logs as requested
Hi Kevin, In the controller logs I can see this: [2019-10-28T19:53:34.546] backfill: Started JobId=4172215_17(4172812) in workq on nid00170 [2019-10-28T19:54:24.472] _job_complete: JobId=4172215_17(4172812) WEXITSTATUS 0 [2019-10-28T19:54:24.474] _job_complete: JobId=4172215_17(4172812) done But later on a lot of these: [2019-10-28T19:54:54.557] error: select/cons_res: node nid00170 memory is under-allocated (0-60000) for JobId=4172215_17(4172812) [2019-10-28T19:54:54.681] error: select/cons_res: node nid00170 memory is under-allocated (0-60000) for JobId=4172215_17(4172812) [2019-10-28T19:54:54.685] error: select/cons_res: node nid00170 memory is under-allocated (0-60000) for JobId=4172215_17(4172812) And for every 3 minutes: [2019-10-28T19:57:06.072] Resending TERMINATE_JOB request JobId=4172215_17(4172812) Nodelist=nid00170 [2019-10-28T20:00:06.111] Resending TERMINATE_JOB request JobId=4172215_17(4172812) Nodelist=nid00170 [2019-10-28T20:03:06.873] Resending TERMINATE_JOB request JobId=4172215_17(4172812) Nodelist=nid00170 (...) [2019-10-28T20:18:06.546] Resending TERMINATE_JOB request JobId=4172215_17(4172812) Nodelist=nid00170 And finally: [2019-10-28T20:20:17.994] step_partial_comp: JobId=4172215_17(4172812) StepID=4294967295 invalid [2019-10-28T20:20:25.189] cleanup_completing: JobId=4172215_17(4172812) completion process took 1561 seconds So, the problem seems that, once the batch scripts ends, Slurm tries to finish the .extern step and for some reason it cannot. So it finally marked as "Cancelled". I'm not totally sure, but I think that the problem may be related to those "memory is under-allocated" messages. That error messages are fixed in bug 6769, in 19.05.3. I will try to verify if they are really related. I'll let you know, Albert
On 2019/10/31 02:56, bugs@schedmd.com wrote: > So, the problem seems that, once the batch scripts ends, Slurm tries to finish > the .extern step and for some reason it cannot. So it finally marked as > "Cancelled". > > I'm not totally sure, but I think that the problem may be related to those > "memory is under-allocated" messages. > That error messages are fixed in bug 6769, in 19.05.3. > I will try to verify if they are really related. > > I'll let you know, > Albert > I checked out the master and slurm-19.05 branches from the GitHub repo and, from what I can see, the three commits mentioned in Bug 6769 (one of the three is just comments) would apply cleanly to the 18.08.6-2 codebase we are commited to using on our production until the next upgrade at the start of Jan 2010, just ahead of the start of our next project accounting period. I suppose the issue for us would be if the file that gets patched src/plugins/select/cons_res/job_test.c is part of a library that might affect many Slurm binaries/operations then we'd need to ensure nothing else breaks. Of course, our TDS is currently running 19.05.03-2 as we look to test that release out, ahead of the Jan upgrade.
Maybe answering my own question here, but a closer inspection of what gets deployed suggests that any commit patch(es) to src/plugins/select/cons_res/job_test.c would only end up in one file lib64/slurm/select_cons_res.so and so a rebuild of 18.08.06-2 and swapping into place of the one "patched" shared-object file might be all that's needed ?
Hi Kevin, > Maybe answering my own question here, but a closer > inspection of what gets deployed suggests that any > commit patch(es) to > > src/plugins/select/cons_res/job_test.c > > would only end up in one file > > lib64/slurm/select_cons_res.so > > and so a rebuild of 18.08.06-2 and swapping into place of > the one "patched" shared-object file might be all that's > needed ? Good research, you are basically right. But you will also need to restart slurmctld. Although what you say it's true, and although we don't recommend this kind of cherry-picking, it is also true that doing what you suggest could help us a lot to see if your actual problem (long killings and cancelled .extern steps), is because of the "memory is under-allocated" error. I'm not certain of it, but it might be. If you try that cherry picking, rebuild Slurm and restart the controller, and then the "memory is under-allocated" disappears (that's good by it self), but the other error don't disappear, then we know that they are unrelated. If you are willing to try it out, it sound good for me. Let me know, Albert
> I'm not certain of it, but it might be. Sadly, in terms of making progress on this, that's unlikely to be enough justification for us to sanction a change to our production systems. If SchedMD's testing had indicated that the "memory is under-allocated" messages WERE RELATED, ie were a symptom of the jobs hanging around in the CG state, then we'd probably try swapping the "select_cons_res.so" file, but I can't see it flying here, if SchedMD can't say for certain that it's the fix. In terms of impact on throughput, simply monitoring and putting "long term CG state" nodes into a "+DRAIN" state, so as to get them offside from the scheduler is less of a change to the production systems.
Hi Kevin, > If SchedMD's testing had indicated that the "memory is under-allocated" > messages WERE RELATED, ie were a symptom of the jobs hanging around in > the CG state, then we'd probably try swapping the "select_cons_res.so" > file, but I can't see it flying here, if SchedMD can't say for certain > that it's the fix. I totally agree. I didn't mean that you should do it. Yet. Let me verify if they are related or not. > In terms of impact on throughput, simply monitoring and putting "long > term CG state" nodes into a "+DRAIN" state, so as to get them offside > from the scheduler is less of a change to the production systems. If I understand correctly it seems that you are noticing that nodes that fail having jobs into this "long term CG state", they have not only one job, but they tend to have more jobs in such "state"? While the rest of the nodes, don't? Is that correct? If it is, then we are probably in the right direction, because it seems that it's not a job-wise or job-random problem, but a node-wise problem, like the "memory is under-allocated" is. I'll keep investigating and I'll let you know, Albert
Hi Kevin, Sorry that I didn't follow up on this ticket earlier. I'm closing it as timed out, but please don't hesitate to reopen it if you need further support. Regards, Albert