Created attachment 6430 [details] slurmd log file Hello, We have a ~300 node cluster that was running without any issues but recently we have started seeing nodes being set to drained with the following message: Reason=batch job complete failure [root@2018-03-19T16:38:40] The only thing that I am aware of is that we changed the S:C:T counts and I do see the following message in the logs but don't think this should be causing this behavior since it is not happening on all nodes: [2018-03-20T03:37:03.864] Node configuration differs from hardware: CPUs=28:56(hw) Boards=1:1(hw) SocketsPerBoard=2:2(hw) CoresPerSocket=14:14(hw) ThreadsPerCore=2:2(hw) I am attaching the slurmd log file from one of the nodes that is in the state. Any help would be appreciated. Thanks, -Simran
Did you restart the slurmctld and all of the slurmd's after changing the configuration? Will you attach the slurmctld logs as well?
Created attachment 6431 [details] slurmctld log from master
Yes all the slurm daemons were bounced on the server/client. My apologies for not attaching the slurmctl log file. It should now be attached to this ticket. Thanks, -Simran
No problem. Will you also attach your slurm.conf? Two possible situations that come to mind right now. 1. The nodes were rearranged or new nodes were added to the slurm.conf. This would require a restart of the slurmctld and slurmd's -- which you have done. When did you make the config changes? When were the daemons restarted? 2. network issues? The slurmd logs show that it can't connect to the other nodes e.g. [2018-03-20T11:28:41.624] error: forward_thread to nc045: Connection timed out and the slurmctld logs show that it can't connect to the backup (though this shouldn't be causing the node problems): [2018-03-20T13:05:07.903] error: _shutdown_backup_controller:send/recv: Connection refused 3. LDAP issues? [2018-03-20T12:55:49.721] error: batch_stepd_step_rec_create() failed for job 98266.4294967294 on nc277: User not found on host Do any of these sound possible at the moment?
Created attachment 6432 [details] slurm.conf
I have attached the slurm.conf for this cluster. The issue with nc277 is ok since it accidentally was resumed and has its main network disconnected so could not connect to our NIS DB. It has since been drained again. The config change for the cpu's were made a few weeks ago, I believe. I am not aware of any network related issues but will look into this as well. If there was a network issue then we would also see issue with our gpfs filesystems. Let me know if you see any other issues that could explain this behavior. Thanks, -Simran
The reason message in the initial comment shows that it happened at [root@2018-03-19T16:38:40], but the slurmctld logs start at [2018-03-20T03:37:03.700]. Can you attach yesterday's slurmctld logs? On the node that the slurmd logs are from, can you connect to nc045 on port 6818? e.g. telnet nc045 6818 On a separate note, looks like the controller isn't finding the elasticsearch plugin. I would investigate this after this issue is figured out.
The nc045 node is also a known issue. I have up'd the debug level for both slurmd/slurmctld to 9 and resume'd all the nodes that were drained with the reason "Job batch complete failure". Now if it happens again on a specific node, I will have a more detailed log and will re-submit my request. Feel free to close it or keep it open for a few more days in case I capture some failure sooner than expected. I will also attach the previous slurmctld log. Thanks, -Simran
Created attachment 6433 [details] rotated slurmctld log from master
Thanks. I see a DRAIN storm starting at [2018-03-19T16:38:39.999]. err. [2018-03-19T16:38:39.999] error: slurmd error running JobId=93580 on node(s)=nc066: Kill task failed [2018-03-19T16:38:39.999] drain_nodes: node nc066 state set to DRAIN [2018-03-19T16:38:40.000] error: slurmd error running JobId=93582 on node(s)=nc077: Kill task failed [2018-03-19T16:38:40.000] drain_nodes: node nc077 state set to DRAIN [2018-03-19T16:38:40.000] error: slurmd error running JobId=93603 on node(s)=nc226: Kill task failed [2018-03-19T16:38:40.000] drain_nodes: node nc226 state set to DRAIN The "Kill task failed" messages is coming from the stepd not being able to kill a process. Slurm will drain the node when it detects an unkillable processes. https://slurm.schedmd.com/slurm.conf.html#OPT_UnkillableStepTimeout You should see be seeing the following logs in the slurmd logs: error("*** %s STEPD TERMINATED ON %s AT %s DUE TO JOB NOT ENDING WITH SIGNALS ***", entity, job->node_name, time_str); Do you see these? Can you investigate the node and see if there are any unkillable processes from the user? You may need to reboot the node to clean out the processes.
Created attachment 6435 [details] node ng007 logs and slurmctld
Another node just got automatically set to drain and the job running on this node was killed. I have attached the slurmctld and slurmd (ng007) logs. This was a gpu burn job. Can you please take a look and let us know if you find something. I am also checking the hardware side. Thanks, -Simran
I see the stepd trying hard to kill the batch step. e.g. [2018-03-20T18:23:15.590] [92671] debug: _handle_terminate for step=92671.4294967294 uid=0 [2018-03-20T18:23:15.590] [92671] debug3: xcgroup_set_param: parameter 'freezer.state' set to 'THAWED' for '/sys/fs/cgroup/freezer/slurm/uid_48252/job_92671/step_batch' [2018-03-20T18:23:15.590] [92671] debug2: killing process 27341 (inherited_task) with signal 9 [2018-03-20T18:23:15.590] [92671] debug2: killing process 27356 (inherited_task) with signal 9 [2018-03-20T18:23:15.590] [92671] debug2: killing process 27357 (inherited_task) with signal 9 [2018-03-20T18:23:15.590] [92671] debug2: killing process 27358 (inherited_task) with signal 9 [2018-03-20T18:23:15.590] [92671] debug2: killing process 27359 (inherited_task) with signal 9 [2018-03-20T18:23:15.590] [92671] debug2: killing process 27360 (inherited_task) with signal 9 [2018-03-20T18:23:15.590] [92671] debug2: killing process 27361 (inherited_task) with signal 9 [2018-03-20T18:23:15.590] [92671] debug2: killing process 27362 (inherited_task) with signal 9 See if this directory still exists on the node: /sys/fs/cgroup/freezer/slurm/uid_48252/job_92671/step_batch if it does then cat the following to see if the pids are still running on the node. /sys/fs/cgroup/freezer/slurm/uid_48252/job_92671/tasks /sys/fs/cgroup/freezer/slurm/uid_48252/job_92671/step_batch/tasks Let me know what you find. I'll look at it more in the morning.
That directory does exist on the compute node but nothing in those tasks files: -- [ghpc1 root@ng007 step_batch]# cd /sys/fs/cgroup/freezer/slurm/uid_48252/job_92671/step_batch [ghpc1 root@ng007 step_batch]# ls cgroup.clone_children cgroup.procs freezer.self_freezing notify_on_release cgroup.event_control freezer.parent_freezing freezer.state tasks [ghpc1 root@ng007 step_batch]# cat /sys/fs/cgroup/freezer/slurm/uid_48252/job_92671/tasks [ghpc1 root@ng007 step_batch]# cat /sys/fs/cgroup/freezer/slurm/uid_48252/job_92671/step_batch/tasks [ghpc1 root@ng007 step_batch]# -- Regards, -Simran
Do you see any these pids still on the node? 27341 27356 27357 27358 27359 27360 27361 27362 The logs show the batch step trying multiple times to SIGKILL these pids.
Looks like changing the UnkillableStepTimeout to 120 sec has fixed our issues for the GPU servers for now. Feel free to close this request. I will reach out again if we run into further issues. Thanks again for your assistance. -Simran
ok. Glad that is helping. Let us know if you see anything else. Thanks, Brian