Ticket 10547

Summary: Random nodes fail in "Kill task failed" due to "error: STEP JOBID.0 STEPD TERMINATED ON worker_node"
Product: Slurm Reporter: Misha Ahmadian <misha.ahmadian>
Component: slurmdAssignee: Tim McMullan <mcmullan>
Status: RESOLVED CANNOTREPRODUCE QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 20.11.3   
Hardware: Linux   
OS: Linux   
Site: TTU Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: Slurm.conf
cgroup.conf

Description Misha Ahmadian 2021-01-04 12:01:01 MST
Hello,

We rea recently facing a new type of issue that causes nodes to be drained with "Kill task failed" reason. I understand that "Kill task failed" happens most of the time due to a node-level failure. However, we are observing strange errors in slurmd.log files on multiple nodes:

First off, in a few multiple nodes, we see the "error: Failed to send MESSAGE_TASK_EXIT: Connection refused" for some random jobs (not every job). After looking at 2420, 3084, 3083 tickets, it sounds like some tuning is required in order to avoid this message (https://slurm.schedmd.com/high_throughput.html). However, the cluster has been used by a handful of test users after we upgrade to 20.11.0, and users are submitting simple MPI jobs rather than large array jobs (less than 500 jobs per second!) and do not see any reason that this would cause this issue:

>Slurmd.log:

[2021-01-01T09:40:56.488] [16564.extern] Considering each NUMA node as a socket
[2021-01-01T09:40:56.500] [16564.extern] task/cgroup: _memcg_initialize: /slurm/uid_98858/job_16564: alloc=515456MB mem.limit=515456MB memsw.limit=644320MB
[2021-01-01T09:40:56.500] [16564.extern] task/cgroup: _memcg_initialize: /slurm/uid_98858/job_16564/step_extern: alloc=515456MB mem.limit=515456MB memsw.limit=644320MB
[2021-01-01T09:40:57.381] launch task StepId=16564.0 request from UID:98858 GID:260 HOST:10.100.24.4 PORT:37120
[2021-01-01T09:40:57.382] task/affinity: lllp_distribution: JobId=16564 manual binding: none,one_thread
[2021-01-01T09:40:57.387] [16564.0] Considering each NUMA node as a socket
[2021-01-01T09:40:57.391] [16564.0] task/cgroup: _memcg_initialize: /slurm/uid_98858/job_16564: alloc=515456MB mem.limit=515456MB memsw.limit=644320MB
[2021-01-01T09:40:57.391] [16564.0] task/cgroup: _memcg_initialize: /slurm/uid_98858/job_16564/step_0: alloc=515456MB mem.limit=515456MB memsw.limit=644320MB
[2021-01-01T09:41:31.341] [16564.extern] done with job
[2021-01-01T09:41:32.551] [16564.0] error: Failed to send MESSAGE_TASK_EXIT: Connection refused
[2021-01-01T09:41:32.552] [16564.0] done with job

># sacct -j 16564 --format=jobid,user,jobname,nnodes,ncpus,state,exitcode

       JobID      User    JobName   NNodes      NCPUS      State ExitCode
------------ --------- ---------- -------- ---------- ---------- --------
16564           user1      Re5KSS        8       1024     FAILED     11:0
16564.batch                 batch        1        128     FAILED     11:0
16564.extern               extern        8       1024  COMPLETED      0:0
16564.0                     orted        7        896  COMPLETED      0:0


># sdiag | grep user1
   user1          (   98858) count:12067  ave_time:1238   total_time:14939269

Although it was suggested in one of the tickets to avoid this "Failed to send MESSAGE_TASK_EXIT: Connection refused" error (which I doubt wouldn't cause any issue in the future), but looks like in some cases, it leads to more serious issues, which I'm not sure if there is a relation between these two:

The following example shows the logs of the jobs from the same user that caused one of the nodes to go into the "Kill task failed" situation:

>slurmd.log

[2021-01-01T09:47:41.414] [16566.extern] Considering each NUMA node as a socket
[2021-01-01T09:47:41.426] [16566.extern] task/cgroup: _memcg_initialize: /slurm/uid_98858/job_16566: alloc=515456MB mem.limit=515456MB memsw.limit=644320MB
[2021-01-01T09:47:41.426] [16566.extern] task/cgroup: _memcg_initialize: /slurm/uid_98858/job_16566/step_extern: alloc=515456MB mem.limit=515456MB memsw.limit=644320MB
[2021-01-01T09:47:42.326] launch task StepId=16566.0 request from UID:98858 GID:260 HOST:10.100.25.2 PORT:43724
[2021-01-01T09:47:42.326] task/affinity: lllp_distribution: JobId=16566 manual binding: none,one_thread
[2021-01-01T09:47:42.332] [16566.0] Considering each NUMA node as a socket
[2021-01-01T09:47:42.336] [16566.0] task/cgroup: _memcg_initialize: /slurm/uid_98858/job_16566: alloc=515456MB mem.limit=515456MB memsw.limit=644320MB
[2021-01-01T09:47:42.336] [16566.0] task/cgroup: _memcg_initialize: /slurm/uid_98858/job_16566/step_0: alloc=515456MB mem.limit=515456MB memsw.limit=644320MB
[2021-01-01T19:58:05.425] [16566.extern] done with job
[2021-01-01T19:59:36.000] [16566.0] error: *** STEP 16566.0 STEPD TERMINATED ON cpu-26-15 AT 2021-01-01T19:59:35 DUE TO JOB NOT ENDING WITH SIGNALS ***
[2021-01-01T19:59:38.301] [16566.0] error: Failed to send MESSAGE_TASK_EXIT: Connection refused
[2021-01-01T19:59:38.303] [16566.0] done with job
[2021-01-01T20:00:30.083] epilog for job 16566 ran for 45 seconds

>slurmctld.log:

[2021-01-01T09:47:40.929] _slurm_rpc_submit_batch_job: JobId=16566 InitPrio=2351 usec=13008
[2021-01-01T09:47:41.228] email msg to use1@xxx: Slurm Job_id=16566 Name=Re5KSS Began, Queued time 00:00:01
[2021-01-01T09:47:41.230] sched: Allocate JobId=16566 NodeList=cpu-25-2,cpu-26-[14-16] #CPUs=512 Partition=nocona

[2021-01-01T19:58:05.411] _job_complete: JobId=16566 WEXITSTATUS 134
[2021-01-01T19:58:05.411] email msg to use1@xxx: Slurm Job_id=16566 Name=Re5KSS Failed, Run time 10:10:24, FAILED, ExitCode 134
[2021-01-01T19:58:05.411] _job_complete: JobId=16566 done
[2021-01-01T19:59:36.001] update_node: node cpu-26-15 reason set to: Kill task failed
[2021-01-01T19:59:36.001] update_node: node cpu-26-15 state set to DRAINING
[2021-01-01T19:59:36.808] Resending TERMINATE_JOB request JobId=16566 Nodelist=cpu-26-15
[2021-01-01T20:00:30.084] cleanup_completing: JobId=16566 completion process took 145 seconds

># sacct -j 16566 --format=jobid,user,jobname,nnodes,ncpus,state,exitcode
       JobID      User    JobName   NNodes      NCPUS      State ExitCode
------------ --------- ---------- -------- ---------- ---------- --------
16566           user1      Re5KSS        4        512     FAILED      6:0
16566.batch                 batch        1        128     FAILED      6:0
16566.extern               extern        4        512  COMPLETED      0:0
16566.0                     orted        3        384  COMPLETED      0:0


As you can see, this job has bee failed due to:

error: *** STEP 16566.0 STEPD TERMINATED ON cpu-26-15 AT 2021-01-01T19:59:35 DUE TO JOB NOT ENDING WITH SIGNALS ***

Then we see the same "error: Failed to send MESSAGE_TASK_EXIT: Connection refused" along with a new line that says "epilog for job 16566 ran for 45 seconds" which is very odd for a straightforward epilog script.

So, these errors are not very helpful to let us know the source of the issue since they happen randomly for the same type of jobs (same user, same mpi application, and almost the same number of nodes). Do you have any recommendations on where to look for, and what might be causing all these types of errors? 

Best Regards,
Misha
Comment 1 Misha Ahmadian 2021-01-04 12:01:44 MST
Created attachment 17329 [details]
Slurm.conf
Comment 2 Misha Ahmadian 2021-01-04 12:02:10 MST
Created attachment 17330 [details]
cgroup.conf
Comment 3 Tim McMullan 2021-01-04 12:47:42 MST
Hi Misha,

Would you be able to update your slurmctld and slurmd log levels to "debug2" and send a slurmd log from a failed node and the slurmctld log?  It may give some more insight into whats going on at the time of the failure.

Would you also be able to log the output of sdiag for a while just to see what the RPC traffic looks like?  I don't really expect it to be crazy, but it would be worth looking at as well.

Thanks!
--Tim
Comment 4 Misha Ahmadian 2021-01-04 13:05:32 MST
Hi Tim,

Thanks for your quick response. Actually, we just went on a maintenance downtime today until the next few days. I thought we can figure this out by looking at the current logs while the cluster is down, but looks like I made a mistake. Is there anything that we can do while the cluster is free, or is it better to continue working on this ticket once the cluster is back online?

Best,
Misha
Comment 5 Tim McMullan 2021-01-04 14:17:20 MST
Unfortunately from the logs right now, its not clear where the problem is and some more logs might be very helpful.  I'm still going to look at the possibilities though.  I see that the job is 512 cpus, would you be able to share the "#SBATCH" lines from the submit script?   There might be something else in there I could use?

With a lower user count, I wouldn't *expect* the ctld to be overrun and having connection issues, but its possible there is a burst of rpcs causing this.

It may make more sense to look at this deeper when its out of maintenance, but I'm happy to look with what we have to see what we can find.

Thanks!
--Tim
Comment 6 Misha Ahmadian 2021-01-05 09:16:19 MST
Hi Tim,

Sure. Please find the "#SBATCH" lines from the submit script of the job 16566 (more details in my first post)

#!/bin/bash
#SBATCH --job-name=Re5KSS
#SBATCH --output=%x-%j.out
#SBATCH --error=%x-%j.err
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=128
#SBATCH --cpus-per-task=1
#SBATCH --partition=nocona
#SBATCH -t 48:00:00
#SBATCH --mail-user=user1@xxx
#SBATCH --mail-type=all


The output of sdiag for this particular user:

># sdiag | grep user1

   user1          (   98858) count:12067  ave_time:1238   total_time:14939269

We never saw this when we were on Slurm 20.02.03. I'm not sure whether it is something relevant to this specific version of Slurm or something else.

Please let me know if I can provide more information. Otherwise, we will continue this ticket after we finished the maintenance.

Thank you,
Misha
Comment 7 Tim McMullan 2021-01-21 10:30:54 MST
Hi Misha,

I just wanted to check in and see if the cluster is back up and if you've been able to reproduce the issue?

Unfortunately the way the job was submitted wasn't enough to narrow this one down.

Thanks!
--Tim
Comment 8 Misha Ahmadian 2021-01-21 10:47:34 MST
Hi Tim,

I apologize haven't got back to you yet. We expected to have the cluster back online last week, but we ran into some storage side issues that we just fixed it. Meanwhile, I upgraded the slurm to 20.11.3 to overcome the mpirun/srun issue. We expect to have the cluster back online hopefully by tomorrow. I'll get back to you ASAP.

Sorry for the delay again. Please let me know what I can do once the cluster is back to speed up our investigation.

Thank you,
Misha
Comment 9 Tim McMullan 2021-01-21 12:08:40 MST
Hi Misha,

No worries at all, I just wanted to check in!

When the cluster is back up, if you are able to run with "debug2" for the slurmd and slurmctld, reproduce the issue, and attach those logs I think that will be the most helpful identifying the problem!

Thanks again, and I'm glad you were able to fix the storage issues!
--Tim
Comment 10 Misha Ahmadian 2021-01-26 11:35:27 MST
Hi Tim,

I just wanted to let you know that the cluster is back online, and we have had a couple of nodes failed due to the "Kill task failed". I haven't enabled the debug2 settings on slurmd and slurmctld yet for two reasons: 1) first, we're in the middle of adding more nodes to our support contract with SchedMD, and the "Kill task failed" errors have appeared on the new nodes as well. 2) enabling the debug2 generates a huge amount of logs, which prevents us from monitoring the scheduler at this moment. I'll enable that as soon as the support contract for all the nodes gets through.

The interesting thing about this "Kill task failed" is that it seems to me it only happens to specific jobs/users, but I have to make sure about that.

I'll get back to you as soon as I had the collected logs for you. I'll also update the slurm.conf file which includes the settings for the new nodes.

Best,
Misha
Comment 11 Tim McMullan 2021-01-26 12:13:08 MST
Thanks for the update Misha, and glad to hear the cluster is back online!

It is interesting that it seems to just be for some specific users/job types.  Let me know if that seems to pan out and we can look at it from that angle as 

Thanks!
--Tim
Comment 12 Misha Ahmadian 2021-02-01 07:45:01 MST
Hi Tim,

I've been watching the cluster whole last week, but looks like the "Kill task failed" doesn't want to show up anymore!! This is very frustrating since the debug2 has been set to trap the errors. So, I thought adding the new nodes to the support contract with SchedMD is still in progress and this issue has had a hard time showing up on its face, it's better to close this ticket and reopen it again (or create a new ticket) once everything was set. I really apologize for that since I didn't expect this issue goes too far without a good chance from our side to provide more information. I'll let you know as soon as everything was prepared, and I really appreciate your help and patience.

Best Regards,
Misha
Comment 13 Tim McMullan 2021-02-01 07:57:25 MST
(In reply to Misha Ahmadian from comment #12)
> Hi Tim,
> 
> I've been watching the cluster whole last week, but looks like the "Kill
> task failed" doesn't want to show up anymore!! This is very frustrating
> since the debug2 has been set to trap the errors. So, I thought adding the
> new nodes to the support contract with SchedMD is still in progress and this
> issue has had a hard time showing up on its face, it's better to close this
> ticket and reopen it again (or create a new ticket) once everything was set.
> I really apologize for that since I didn't expect this issue goes too far
> without a good chance from our side to provide more information. I'll let
> you know as soon as everything was prepared, and I really appreciate your
> help and patience.
> 
> Best Regards,
> Misha

Hi Misha,

No problem at all!  If the issue crops up again, please do let us know and we can work through it.  For now, I'm just glad to hear that the cluster is up and seems to be working.

Thanks for the update!
--Tim