| Summary: | Prolog stuck in completing state | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Institut Pasteur HPC Admin <hpc> |
| Component: | slurmd | Assignee: | Marshall Garey <marshall> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 22.05.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=16126 https://bugs.schedmd.com/show_bug.cgi?id=17481 |
||
| Site: | Institut Pasteur | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 23.02.3 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
slurmctld.log
slurm.conf prolog script slurmd log of maestro-1032 Backtrace_slurmd |
||
|
Description
Institut Pasteur HPC Admin
2023-02-27 15:43:36 MST
Created attachment 29071 [details]
slurm.conf
Created attachment 29072 [details]
prolog script
Created attachment 29073 [details]
slurmd log of maestro-1032
Hi, Thanks for your research and the logs and information you posted. I believe this is a duplicate of bug 16126. I posted a comment on that bug. Feel free to CC yourself on that ticket. I'll keep this ticket open until we are sure that it is a duplicate and you have a resolution. Are you able to reliably reproduce this issue? I uploaded a patch to bug 16126 - would you be interested in trying it on one slurmd? Hi, I am not able to reproduce the issue on our dev environment... The issue seems to be random on our production environment, so if I need to see if the patch is successful, I need to patch all our production environment... At the moment, I am reluctant to patch all our cluster... I keep checking the other ticket, depending on how the ticket turns out, I will set up or not the patch. Thanks for your understanding. Brice Hi, I have not yet gotten feedback from the site in bug 16126, but I wanted to get some more information from you. Are you using any spank plugins? Hi, We don't use spank plugin. For your information, since 3 days, we got some jobs blocked as the initial issue. This time, it seems there are not requeued. Brice Can you upload a backtrace of the stuck slurmd process? Created attachment 30602 [details]
Backtrace_slurmd
I don't know if it can help you, but all of those jobs are in completed state:
25785880
25785870
25785865
25785864
25785863
25785862
25785861
25785859
25785858
25785857
25785855
Brice
Are there any slurmd processes that are a child of the main slurmd process? Can you get a backtrace of one of those, too? It seems to not be the case, the main slurmd process is 12800 :
pstree -p 12800 -l
slurmd(12800)─┬─gpu_accounting_(722986)
[...]
├─gpu_accounting_(723869)
├─{slurmd}(722105)
[...]
└─{slurmd}(847476)
Or I am missing something?
Brice
I agree, it looks like those are just threads of the main slurmd process. Are there any defunct processes on the compute node? Would you be interested in setting PrologEpilogTimeout in slurm.conf? This should kill the prolog/epilog process when that timeout is reached to prevent nodes from being stuck indefinitely. I would also like to see if that actually clean us all of the prolog/epilog processes. All the "gpu_accounting_" processes are defunct and zombie state.
We already set a value to this parameter :
>PrologEpilogTimeout = 65534
Brice
(In reply to Institut Pasteur HPC Admin from comment #14) > All the "gpu_accounting_" processes are defunct and zombie state. Considering this, would you be interested in trying a patch? This patch fixes a case where slurmd waits for already defunct processes until they timeout and are killed. Considering that you have PrologEpilogTimeout set to 65534, it will be such a long time (about 18 hours) until they timeout that it might as well not be set at all. This 18 hour time is similar to the >15 hours that you mentioned in comment 0. This patch makes slurmd immediately reap defunct prolog/epilog processes: attachment 29142 [details] > We already set a value to this parameter : > > >PrologEpilogTimeout = 65534 > > Brice I meant, would you be interested in setting PrologEpilogTimeout such that stuck prologs/epilogs will actually timeout in a reasonable amount of time? Or if you would rather try the above patch, then apply that patch to all of the slurmd's. Hi,
This time the completing state seems to never end.
sacct -j 25785880 --format=submit,start,end,state
Submit Start End State
------------------- ------------------- ------------------- ----------
2023-05-30T02:28:19 2023-05-30T02:28:23 2023-05-31T10:49:24 CANCELLED
2023-05-30T02:28:23 2023-05-30T02:28:23 2023-06-03T11:16:49 CANCELLED
2023-05-30T02:28:23 2023-05-30T02:28:23 2023-06-03T11:16:49 CANCELLED
I don't check it before, I used 'scontrol show config' and don't check in the slurm.conf, but the value of "PrologEpilogTimeout" is the default one. In the documentation, it says :
>The default behavior is to wait indefinitely
I will just try to set a little value of PrologEpilogTimeout like 10 minutes, and check what will be the new behavior before thinking about a patch.
Do I only need to do a "scontrol reconfigure" to change the value properly?
(In reply to Institut Pasteur HPC Admin from comment #16) > Hi, > > This time the completing state seems to never end. > > sacct -j 25785880 --format=submit,start,end,state > Submit Start End State > ------------------- ------------------- ------------------- ---------- > 2023-05-30T02:28:19 2023-05-30T02:28:23 2023-05-31T10:49:24 CANCELLED > 2023-05-30T02:28:23 2023-05-30T02:28:23 2023-06-03T11:16:49 CANCELLED > 2023-05-30T02:28:23 2023-05-30T02:28:23 2023-06-03T11:16:49 CANCELLED > > > I don't check it before, I used 'scontrol show config' and don't check in > the slurm.conf, but the value of "PrologEpilogTimeout" is the default one. > In the documentation, it says : > >The default behavior is to wait indefinitely My mistake: I misremembered the default value (65534). I thought it was 65535 or zero. > I will just try to set a little value of PrologEpilogTimeout like 10 > minutes, and check what will be the new behavior before thinking about a > patch. > > Do I only need to do a "scontrol reconfigure" to change the value properly? scontrol reconfigure is sufficient. We pushed commit 5a3f79271b upstream - it will be part of 23.02.3. It fixes waiting indefinitely for defunct script processes. This is the same patch as attachment 29142 [details] except it is written for Slurm 23.02. I think that this will solve the defunct script process issue that you are seeing. You can feel free to apply this patch in 22.05 (attachment 29142 [details]) or upgrade to 23.02.3 after that is released. Are you interested in testing the patch in 22.05? If not, then I can just resolve this bug as fixed assuming that the commit does fix the issue. You can always reopen the bug if it is not solved after upgrading. Hi, We will wait the next slurm upgrade to test the fix. Thanks for your support, Brice We had to revert this change in commit 790b4a4738 ahead of 23.02.5. See bug 17481 for details. In short, Slurm has to wait for all processes in the script's process group to complete. This places an increased importance on setting timeout values for scripts, such as PrologEpilogTimeout (which by default has an unlimited timeout). Hi, I understand the reason of reverting the commit. But I didn't notice the initial state(before requeue) of the job was "NODE_FAILURE", I am wondering if you are still thinking about a fix for my initial ticket? I know that setting a little value for the parameter "PrologEpilogTimeout" will speed up the requeue, but it still not clean to get a "NODE_FAILURE" state for a job. Do you recommand us to do something to avoid the race condition? Thanks for your support, Brice (In reply to Institut Pasteur HPC Admin from comment #25) > But I didn't notice the initial state(before requeue) of the job was > "NODE_FAILURE", I am wondering if you are still thinking about a fix for my > initial ticket? I don't see anything about NODE_FAILURE in this ticket except for this comment. What ticket are you referring to? If the issue is separate from the "prolog stuck in completing state" issue in comment 0, open a new ticket. |