| Summary: | Random nodes fail in "Kill task failed" due to "error: STEP JOBID.0 STEPD TERMINATED ON worker_node" | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Misha Ahmadian <misha.ahmadian> |
| Component: | slurmd | Assignee: | Tim McMullan <mcmullan> |
| Status: | RESOLVED CANNOTREPRODUCE | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 20.11.3 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | TTU | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
Slurm.conf
cgroup.conf |
||
|
Description
Misha Ahmadian
2021-01-04 12:01:01 MST
Created attachment 17329 [details]
Slurm.conf
Created attachment 17330 [details]
cgroup.conf
Hi Misha, Would you be able to update your slurmctld and slurmd log levels to "debug2" and send a slurmd log from a failed node and the slurmctld log? It may give some more insight into whats going on at the time of the failure. Would you also be able to log the output of sdiag for a while just to see what the RPC traffic looks like? I don't really expect it to be crazy, but it would be worth looking at as well. Thanks! --Tim Hi Tim, Thanks for your quick response. Actually, we just went on a maintenance downtime today until the next few days. I thought we can figure this out by looking at the current logs while the cluster is down, but looks like I made a mistake. Is there anything that we can do while the cluster is free, or is it better to continue working on this ticket once the cluster is back online? Best, Misha Unfortunately from the logs right now, its not clear where the problem is and some more logs might be very helpful. I'm still going to look at the possibilities though. I see that the job is 512 cpus, would you be able to share the "#SBATCH" lines from the submit script? There might be something else in there I could use? With a lower user count, I wouldn't *expect* the ctld to be overrun and having connection issues, but its possible there is a burst of rpcs causing this. It may make more sense to look at this deeper when its out of maintenance, but I'm happy to look with what we have to see what we can find. Thanks! --Tim Hi Tim,
Sure. Please find the "#SBATCH" lines from the submit script of the job 16566 (more details in my first post)
#!/bin/bash
#SBATCH --job-name=Re5KSS
#SBATCH --output=%x-%j.out
#SBATCH --error=%x-%j.err
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=128
#SBATCH --cpus-per-task=1
#SBATCH --partition=nocona
#SBATCH -t 48:00:00
#SBATCH --mail-user=user1@xxx
#SBATCH --mail-type=all
The output of sdiag for this particular user:
># sdiag | grep user1
user1 ( 98858) count:12067 ave_time:1238 total_time:14939269
We never saw this when we were on Slurm 20.02.03. I'm not sure whether it is something relevant to this specific version of Slurm or something else.
Please let me know if I can provide more information. Otherwise, we will continue this ticket after we finished the maintenance.
Thank you,
Misha
Hi Misha, I just wanted to check in and see if the cluster is back up and if you've been able to reproduce the issue? Unfortunately the way the job was submitted wasn't enough to narrow this one down. Thanks! --Tim Hi Tim, I apologize haven't got back to you yet. We expected to have the cluster back online last week, but we ran into some storage side issues that we just fixed it. Meanwhile, I upgraded the slurm to 20.11.3 to overcome the mpirun/srun issue. We expect to have the cluster back online hopefully by tomorrow. I'll get back to you ASAP. Sorry for the delay again. Please let me know what I can do once the cluster is back to speed up our investigation. Thank you, Misha Hi Misha, No worries at all, I just wanted to check in! When the cluster is back up, if you are able to run with "debug2" for the slurmd and slurmctld, reproduce the issue, and attach those logs I think that will be the most helpful identifying the problem! Thanks again, and I'm glad you were able to fix the storage issues! --Tim Hi Tim, I just wanted to let you know that the cluster is back online, and we have had a couple of nodes failed due to the "Kill task failed". I haven't enabled the debug2 settings on slurmd and slurmctld yet for two reasons: 1) first, we're in the middle of adding more nodes to our support contract with SchedMD, and the "Kill task failed" errors have appeared on the new nodes as well. 2) enabling the debug2 generates a huge amount of logs, which prevents us from monitoring the scheduler at this moment. I'll enable that as soon as the support contract for all the nodes gets through. The interesting thing about this "Kill task failed" is that it seems to me it only happens to specific jobs/users, but I have to make sure about that. I'll get back to you as soon as I had the collected logs for you. I'll also update the slurm.conf file which includes the settings for the new nodes. Best, Misha Thanks for the update Misha, and glad to hear the cluster is back online! It is interesting that it seems to just be for some specific users/job types. Let me know if that seems to pan out and we can look at it from that angle as Thanks! --Tim Hi Tim, I've been watching the cluster whole last week, but looks like the "Kill task failed" doesn't want to show up anymore!! This is very frustrating since the debug2 has been set to trap the errors. So, I thought adding the new nodes to the support contract with SchedMD is still in progress and this issue has had a hard time showing up on its face, it's better to close this ticket and reopen it again (or create a new ticket) once everything was set. I really apologize for that since I didn't expect this issue goes too far without a good chance from our side to provide more information. I'll let you know as soon as everything was prepared, and I really appreciate your help and patience. Best Regards, Misha (In reply to Misha Ahmadian from comment #12) > Hi Tim, > > I've been watching the cluster whole last week, but looks like the "Kill > task failed" doesn't want to show up anymore!! This is very frustrating > since the debug2 has been set to trap the errors. So, I thought adding the > new nodes to the support contract with SchedMD is still in progress and this > issue has had a hard time showing up on its face, it's better to close this > ticket and reopen it again (or create a new ticket) once everything was set. > I really apologize for that since I didn't expect this issue goes too far > without a good chance from our side to provide more information. I'll let > you know as soon as everything was prepared, and I really appreciate your > help and patience. > > Best Regards, > Misha Hi Misha, No problem at all! If the issue crops up again, please do let us know and we can work through it. For now, I'm just glad to hear that the cluster is up and seems to be working. Thanks for the update! --Tim |