4959 – Nodes being set to drained

Ticket 4959 - Nodes being set to drained

Summary: Nodes being set to drained

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmd (show other tickets)
Version:	17.02.9
Hardware:	Linux Linux

Severity:	2 - High Impact
Assignee:	Brian Christiansen
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2018-03-20 13:46 MDT by Simran
Modified:	2018-03-21 12:40 MDT (History)
CC List:	0 users

See Also:
Site:	Genentech (Roche)
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurmd log file (38.84 KB, text/plain) 2018-03-20 13:46 MDT, Simran	Details
slurmctld log from master (4.02 MB, text/plain) 2018-03-20 14:06 MDT, Simran	Details
slurm.conf (5.47 KB, text/plain) 2018-03-20 14:23 MDT, Simran	Details
rotated slurmctld log from master (14.59 MB, text/plain) 2018-03-20 15:15 MDT, Simran	Details
node ng007 logs and slurmctld (34.32 MB, application/x-gzip) 2018-03-20 20:19 MDT, Simran	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Simran 2018-03-20 13:46:22 MDT

Created attachment 6430 [details]
slurmd log file

Hello,

We have a ~300 node cluster that was running without any issues but recently we have started seeing nodes being set to drained with the following message:

   Reason=batch job complete failure [root@2018-03-19T16:38:40]

The only thing that I am aware of is that we changed the S:C:T counts and I do see the following message in the logs but don't think this should be causing this behavior since it is not happening on all nodes:

[2018-03-20T03:37:03.864] Node configuration differs from hardware: CPUs=28:56(hw) Boards=1:1(hw) SocketsPerBoard=2:2(hw) CoresPerSocket=14:14(hw) ThreadsPerCore=2:2(hw)

I am attaching the slurmd log file from one of the nodes that is in the state.  Any help would be appreciated.

Thanks,
-Simran

Comment 1 Brian Christiansen 2018-03-20 13:53:13 MDT

Did you restart the slurmctld and all of the slurmd's after changing the configuration?

Will you attach the slurmctld logs as well?

Comment 2 Simran 2018-03-20 14:06:09 MDT

Created attachment 6431 [details]
slurmctld log from master

Comment 3 Simran 2018-03-20 14:06:54 MDT

Yes all the slurm daemons were bounced on the server/client.  My apologies for not attaching the slurmctl log file.  It should now be attached to this ticket.

Thanks,
-Simran

Comment 4 Brian Christiansen 2018-03-20 14:21:03 MDT

No problem. Will you also attach your slurm.conf? 

Two possible situations that come to mind right now.

1. The nodes were rearranged or new nodes were added to the slurm.conf. This would require a restart of the slurmctld and slurmd's -- which you have done.

When did you make the config changes?
When were the daemons restarted?

2. network issues? 
The slurmd logs show that it can't connect to the other nodes
e.g.
[2018-03-20T11:28:41.624] error: forward_thread to nc045: Connection timed out

and the slurmctld logs show that it can't connect to the backup (though this shouldn't be causing the node problems):
[2018-03-20T13:05:07.903] error: _shutdown_backup_controller:send/recv: Connection refused

3. LDAP issues?

[2018-03-20T12:55:49.721] error: batch_stepd_step_rec_create() failed for job 98266.4294967294 on nc277: User not found on host

Do any of these sound possible at the moment?

Comment 5 Simran 2018-03-20 14:23:40 MDT

Created attachment 6432 [details]
slurm.conf

Comment 6 Simran 2018-03-20 14:26:28 MDT

I have attached the slurm.conf for this cluster.  The issue with nc277 is ok since it accidentally was resumed and has its main network disconnected so could not connect to our NIS DB.  It has since been drained again.

The config change for the cpu's were made a few weeks ago, I believe.  I am not aware of any network related issues but will look into this as well.  If there was a network issue then we would also see issue with our gpfs filesystems.  Let me know if you see any other issues that could explain this behavior.

Thanks,
-Simran

Comment 7 Brian Christiansen 2018-03-20 14:55:03 MDT

The reason message in the initial comment shows that it happened at [root@2018-03-19T16:38:40], but the slurmctld logs start at [2018-03-20T03:37:03.700]. Can you attach yesterday's slurmctld logs?

On the node that the slurmd logs are from, can you connect to nc045 on port 6818?
e.g.
telnet nc045 6818


On a separate note, looks like the controller isn't finding the elasticsearch plugin. I would investigate this after this issue is figured out.

Comment 8 Simran 2018-03-20 15:14:26 MDT

The nc045 node is also a known issue.  I have up'd the debug level for both slurmd/slurmctld to 9 and resume'd all the nodes that were drained with the reason "Job batch complete failure".  Now if it happens again on a specific node, I will have a more detailed log and will re-submit my request.  Feel free to close it or keep it open for a few more days in case I capture some failure sooner than expected.  I will also attach the previous slurmctld log.

Thanks,
-Simran

Comment 9 Simran 2018-03-20 15:15:05 MDT

Created attachment 6433 [details]
rotated slurmctld log from master

Comment 10 Brian Christiansen 2018-03-20 15:28:39 MDT

Thanks. I see a DRAIN storm starting at [2018-03-19T16:38:39.999].

err.
[2018-03-19T16:38:39.999] error: slurmd error running JobId=93580 on node(s)=nc066: Kill task failed
[2018-03-19T16:38:39.999] drain_nodes: node nc066 state set to DRAIN
[2018-03-19T16:38:40.000] error: slurmd error running JobId=93582 on node(s)=nc077: Kill task failed
[2018-03-19T16:38:40.000] drain_nodes: node nc077 state set to DRAIN
[2018-03-19T16:38:40.000] error: slurmd error running JobId=93603 on node(s)=nc226: Kill task failed
[2018-03-19T16:38:40.000] drain_nodes: node nc226 state set to DRAIN

The "Kill task failed" messages is coming from the stepd not being able to kill a process. Slurm will drain the node when it detects an unkillable processes. 

https://slurm.schedmd.com/slurm.conf.html#OPT_UnkillableStepTimeout

You should see be seeing the following logs in the slurmd logs:

error("*** %s STEPD TERMINATED ON %s AT %s DUE TO JOB NOT ENDING WITH SIGNALS ***",
			      entity, job->node_name, time_str);

Do you see these?

Can you investigate the node and see if there are any unkillable processes from the user? You may need to reboot the node to clean out the processes.

Comment 11 Simran 2018-03-20 20:19:04 MDT

Created attachment 6435 [details]
node ng007 logs and slurmctld

Comment 12 Simran 2018-03-20 20:20:49 MDT

Another node just got automatically set to drain and the job running on this node was killed.  I have attached the slurmctld and slurmd (ng007) logs.  This was a gpu burn job.  Can you please take a look and let us know if you find something.  I am also checking the hardware side.

Thanks,
-Simran

Comment 13 Brian Christiansen 2018-03-20 22:10:06 MDT

I see the stepd trying hard to kill the batch step.

e.g.
[2018-03-20T18:23:15.590] [92671] debug:  _handle_terminate for step=92671.4294967294 uid=0
[2018-03-20T18:23:15.590] [92671] debug3: xcgroup_set_param: parameter 'freezer.state' set to 'THAWED' for '/sys/fs/cgroup/freezer/slurm/uid_48252/job_92671/step_batch'
[2018-03-20T18:23:15.590] [92671] debug2: killing process 27341 (inherited_task) with signal 9
[2018-03-20T18:23:15.590] [92671] debug2: killing process 27356 (inherited_task) with signal 9
[2018-03-20T18:23:15.590] [92671] debug2: killing process 27357 (inherited_task) with signal 9
[2018-03-20T18:23:15.590] [92671] debug2: killing process 27358 (inherited_task) with signal 9
[2018-03-20T18:23:15.590] [92671] debug2: killing process 27359 (inherited_task) with signal 9
[2018-03-20T18:23:15.590] [92671] debug2: killing process 27360 (inherited_task) with signal 9
[2018-03-20T18:23:15.590] [92671] debug2: killing process 27361 (inherited_task) with signal 9
[2018-03-20T18:23:15.590] [92671] debug2: killing process 27362 (inherited_task) with signal 9


See if this directory still exists on the node:
/sys/fs/cgroup/freezer/slurm/uid_48252/job_92671/step_batch

if it does then cat the following to see if the pids are still running on the node.
/sys/fs/cgroup/freezer/slurm/uid_48252/job_92671/tasks
/sys/fs/cgroup/freezer/slurm/uid_48252/job_92671/step_batch/tasks

Let me know what you find. I'll look at it more in the morning.

Comment 14 Simran 2018-03-21 06:48:59 MDT

That directory does exist on the compute node but nothing in those tasks files:

--
[ghpc1 root@ng007 step_batch]# cd /sys/fs/cgroup/freezer/slurm/uid_48252/job_92671/step_batch
[ghpc1 root@ng007 step_batch]# ls
cgroup.clone_children  cgroup.procs             freezer.self_freezing  notify_on_release
cgroup.event_control   freezer.parent_freezing  freezer.state          tasks
[ghpc1 root@ng007 step_batch]# cat /sys/fs/cgroup/freezer/slurm/uid_48252/job_92671/tasks
[ghpc1 root@ng007 step_batch]# cat /sys/fs/cgroup/freezer/slurm/uid_48252/job_92671/step_batch/tasks
[ghpc1 root@ng007 step_batch]# 
--

Regards,
-Simran

Comment 15 Brian Christiansen 2018-03-21 08:42:08 MDT

Do you see any these pids still on the node?  

27341
27356
27357
27358
27359
27360
27361
27362

The logs show the batch step trying multiple times to SIGKILL these pids.

Comment 16 Simran 2018-03-21 11:57:44 MDT

Looks like changing the UnkillableStepTimeout to 120 sec has fixed our issues for the GPU servers for now.  Feel free to close this request.  I will reach out again if we run into further issues.

Thanks again for your assistance.
-Simran

Comment 17 Brian Christiansen 2018-03-21 12:40:51 MDT

ok. Glad that is helping. Let us know if you see anything else.

Thanks,
Brian