Summary: | Server drains after kill task failed - JOB NOT ENDING WITH SIGNALS | ||
---|---|---|---|
Product: | Slurm | Reporter: | GSK-ONYX-SLURM <slurm-support> |
Component: | slurmstepd | Assignee: | Brian Christiansen <brian> |
Status: | RESOLVED INFOGIVEN | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | csc-slurm-tickets, felip.moll |
Version: | 17.02.7 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=5068 https://bugs.schedmd.com/show_bug.cgi?id=5485 |
||
Site: | GSK | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Unknown | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Description
GSK-ONYX-SLURM
2018-06-05 08:29:16 MDT
Hey Mark, Are you able to isolate this to a single node -- just to make it easier to diagnose? In either case, when you see the situation again can you go to the node and see if there are any stray user processes still running? If you can isolate the issue, the proctack/cgroup plugin (which I believe you are using) also prints out messages at debug2 of what processes the stepd is repeatedly trying to kill. e.g. debug2("killing process %d (%s) with signal %d", pids[i], (slurm_task==1)?"slurm_task":"inherited_task", this will be helpful to see if the pid's that the stepd is trying to kill are still around. Let me know if you have any questions. Thanks, Brian Hi Brian. Yes this does seem to be affecting a single compute server at present. I've enabled debug2 for SlurmdDebug just for that server and done a reconfigure. I'm starting to see debug2 messages in the slurmd log file. So we'll wait and see if we get a hit tomorrow (today) which I'm sure we will. Something else odd I noticed. I set SlurmdDebug=debug2 in slurm.conf and then did scontrol reconfigure But when I do scontrol show config | grep SlurmdDebug it still shows... SlurmdDebug = info Weird. Thanks. Mark. (In reply to GSK-EIS-SLURM from comment #2) > Yes this does seem to be affecting a single compute server at present. I've > enabled debug2 for SlurmdDebug just for that server and done a reconfigure. > I'm starting to see debug2 messages in the slurmd log file. So we'll wait > and see if we get a hit tomorrow (today) which I'm sure we will. > > Something else odd I noticed. I set > > SlurmdDebug=debug2 > > in slurm.conf and then did > > scontrol reconfigure > > But when I do > > scontrol show config | grep SlurmdDebug > > it still shows... > > SlurmdDebug = info My guess - since you only changed slurm.conf on your compute node, this is probably showing your slurm.conf on your slurmctld. You might also see messages indicating your configuration files don't match. Anyway, let us know if you see an unkillable task again and see what the processes are that are unkillable, per comment 1. Thanks. - Marshall Ok, understood. Still monitoring the situation. Since we set debug2 there's been no repeat. I'll have to see what we can do to reproduce. Regarding the reference to bug 5068 you sent me: scontrol show config | grep -i MemLimitEnforce MemLimitEnforce = Yes scontrol show config | grep -i JobAcctGatherParams JobAcctGatherParams = NoOverMemoryKill grep ConstrainRAMSpace /etc/slurm/cgroup.conf ConstrainRAMSpace=yes So that seems to be saying we may have a conflict. I will look into that. Looking further into bug 5068 I see we have grep -i TaskPlugin /etc/slurm/slurm.conf TaskPlugin=task/cgroup,task/affinity grep ConstrainCores /etc/slurm/cgroup.conf ConstrainCores=no # if taskaffinity=no in cgroup.conf do we need to set constraincores=no explicitly? grep TaskAffinity /etc/slurm/cgroup.conf TaskAffinity=no Whats the likely impact of not having constraincores as yes ? And I will also look more at increasing UnkillableStepTimeout from our default of 60s, although I would like to get the debug info with the setting at 60s first if I can. (In reply to GSK-EIS-SLURM from comment #4) > Ok, understood. Still monitoring the situation. Since we set debug2 > there's been no repeat. I'll have to see what we can do to reproduce. > > Regarding the reference to bug 5068 you sent me: > > scontrol show config | grep -i MemLimitEnforce > MemLimitEnforce = Yes > > scontrol show config | grep -i JobAcctGatherParams > JobAcctGatherParams = NoOverMemoryKill > > grep ConstrainRAMSpace /etc/slurm/cgroup.conf > ConstrainRAMSpace=yes > > So that seems to be saying we may have a conflict. I will look into that. If you want cgroup memory enforcement, then yes, it's best to disable memory enforcement by the jobacctgather plugin and let only the cgroup plugin take care of it: MemLimitEnforce=No JobAcctGatherParams=NoOverMemoryKill > Looking further into bug 5068 I see we have > > grep -i TaskPlugin /etc/slurm/slurm.conf > TaskPlugin=task/cgroup,task/affinity This is what we recommend. > grep ConstrainCores /etc/slurm/cgroup.conf > ConstrainCores=no # if taskaffinity=no in cgroup.conf do we need to > set constraincores=no explicitly? No, since ConstrainCores=no is the default (though it doesn't harm anything by setting it explicitly). > grep TaskAffinity /etc/slurm/cgroup.conf > TaskAffinity=no This is what we recommend when using the task/affinity plugin (which you are). > Whats the likely impact of not having constraincores as yes ? It just means that tasks won't be constrained to specific cores with the cgroup cpuset subsystem. They will still be bound to specific CPUs by the task/affinity plugin using sched_setaffinity(), but that doesn't stop a program from using sched_setaffinity() itself to bind itself to CPUs it shouldn't have access to. Using ConstrainCores=yes enforces the cpu binding - tasks can't use sched_setaffinity() to be able to access CPUs that aren't part of the job allocation. So we recommend using ConstrainCores=yes. > And I will also look more at increasing UnkillableStepTimeout from our > default of 60s, although I would like to get the debug info with the setting > at 60s first if I can. Sounds good. Hey Mark, Are you still seeing draining issue? Were you able to reproduce it? Thanks, Brian Hi. Since enabling debug we haven't seen a repeat of the issue. But we got hit by other issues which may have masked this. We have implemented UnkillableStepTimeout at 180s in our test / dev clusters and will be implementing this in our production clusters tomorrow, 10 July. Once we've implemented this successfully in production I will confirm that we can close the call. If the problem re-occurs with the timeout at 180s then I'll log another ticket. I'll confirm in the next two days that this bug can be closed. Thanks. Mark. Thanks Mark. Keep us posted. Hi. Please go ahead and close this bug. Since increasing the timeout to 180s we've not seen any repeats. If we do I'll reopen this bug or log a new one. Thanks for your help. Cheers. Mark. Sounds good. Thanks. |