| Summary: | Job termination trouble | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Josko Plazonic <plazonic> |
| Component: | slurmstepd | Assignee: | Carlos Tripiana Montes <tripiana> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | bart, csc-slurm-tickets, kilian, marshall |
| Version: | 21.08.6 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Princeton (PICSciE) | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Josko Plazonic
2022-04-14 10:22:38 MDT
Would you please attach your slurm.conf, cgroup.conf and the flavor of Linux your site uses for compute nodes? "Kill task failed" normally happens when a task can not be killed with SIG 9. These tasks may be hanging on I/O. Slurm provides a few options to gather metrics after a timeout period. These are UnkillableStepProgram and UnkillableStepTimeout. The program is something your site defines and is used to just gather output from say dmesg, mounts or see if any process is hung in the defunct state. Nodes should be considered unclean and rebooted if they see these errors. https://slurm.schedmd.com/slurm.conf.html#SECTION_UNKILLABLE-STEP-PROGRAM-SCRIPT https://slurm.schedmd.com/slurm.conf.html#OPT_UnkillableStepProgram https://slurm.schedmd.com/slurm.conf.html#OPT_UnkillableStepTimeout I am not sure that what you are saying makes much sense in this context. It is failing to read cgroup.procs file - it does not exist at that point. No extra timeout will help here. If anything this looks like some kind of a timing/synchronization problem and it looks to me like it should treat this case as a non fatal error. If the file is missing then that cgroup was probably destroyed in the meantime. > I am not sure that what you are saying makes much sense in this context. It is
> failing to read cgroup.procs file - it does not exist at that point. No extra
> timeout will help here. If anything this looks like some kind of a
> timing/synchronization problem and it looks to me like it should treat this case
> as a non fatal error. If the file is missing then that cgroup was probably
> destroyed in the meantime.
There are a few recent fixes that may address the "timing/synchronization problem"
and I will have Carlos look into this for you.
Josko, I have a couple of potential reasons that could have caused this: https://github.com/SchedMD/slurm/commit/1ddef9a0dd8 https://github.com/schedMD/slurm/commit/91bd26c4817 I'd encourage you to cherry pick those and see if applied the issue is still reproducible. If you still have the same issue please let me know and we'll require you additional information so we can better understand this flaw (if it's truly new). Cheers, Carlos. Josko, Had you some time to have a look at? Thanks, Carlos. Josko, We need to close the issue as info given by now, assuming the provided information in Comment 6 was enough. Please, let us know if you need further assistance. Regards, Carlos. Closing now. Please reopen if needed. |