| Summary: | Job notification emails for end of job not getting sent after upgrade to 23.02.4 (from 23.02.2) | run_command doesn't wait for background processes | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Chris Samuel (NERSC) <csamuel> |
| Component: | slurmscriptd | Assignee: | Marshall Garey <marshall> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | dmjacobsen, felip.moll, vinodkumar.tana |
| Version: | 23.02.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=17426 https://bugs.schedmd.com/show_bug.cgi?id=17123 https://bugs.schedmd.com/show_bug.cgi?id=16126 https://bugs.schedmd.com/show_bug.cgi?id=16143 |
||
| Site: | NERSC | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 23.02.5 23.11.0rc1 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | First draft of a patch to use setsid when sending email | ||
Did you update smail when you upgraded to 23.02.4? The fix was in smail. Do you have this change? https://github.com/SchedMD/slurm/commit/cae63c6356 Here's the strace with the sleep in place and a successful email being sent:
[pid 738798] execve("/usr/bin/seff", ["/usr/bin/seff", "397120"], 0x5623c34d3640 /* 31 vars */ <unfinished ...>
[pid 738799] execve("/bin/mail", ["/bin/mail", "-s", "muller Slurm Job_id=397120 Name=wrap Ended, Run time 00:00:01, COMPLETED, ExitCode 0", "csamuel@nersc.gov"], 0x5623c34d3640 /* 31 vars */ <unfinished ...>
[pid 738798] <... execve resumed>) = 0
[pid 738799] <... execve resumed>) = 0
[pid 738798] +++ exited with 0 +++
strace: Process 738801 attached
[pid 738801] execve("/usr/sbin/sendmail", ["sendmail", "-i", "--", "csamuel@nersc.gov"], 0x7fffef639ae0 /* 31 vars */ <unfinished ...>
[pid 738799] +++ exited with 0 +++
strace: Process 738802 attached
[pid 738801] <... execve resumed>) = 0
[pid 738802] execve("/usr/bin/sleep", ["sleep", "2"], 0x5623c34d3640 /* 31 vars */) = 0
[pid 738801] +++ exited with 0 +++
[pid 738802] +++ exited with 0 +++
[pid 738791] +++ exited with 0 +++
[pid 738790] kill(-738791, SIGTERM) = -1 ESRCH (No such process)
[pid 738790] kill(-738791, SIGKILL) = -1 ESRCH (No such process)
[pid 738790] +++ exited with 0 +++
Hi Marshall, Yup, can confirm this is from our 23.02.4 RPMs. slurmctld:/ # rpm -qf /usr/bin/smail slurm-contribs-23.02.4-20230810052645_muller_7e618125.x86_64 All the best, Chris (In reply to Chris Samuel (NERSC) from comment #0) > I did some debugging and found I could reproduce this on our test system and > with application of strace to slurmscriptd I can see slurmscriptd kill the > process group for the spawned smail command and that in turn kills off the > sendmail that /bin/mail has in turned spawned off to send the email before > it has a chance to complete. > > Sequence seems to be: > > * slurmscript launches smail > * smail pipes seff output into the mail command > * mail command runs sendmail in the background to send email > * mail command returns So, the mail command does not wait for sendmail to complete sending the email? > * smail returns > * slurmscriptd kills process group > * sendmail gets killed before transmitting the email > * user wonders what happened to their job Here's that code (with our sleep at the end):
# If we decide later to seff based on specific status codes,
# we can test against $status.
status=`echo "${array[2]}" | tr -d ' '`
if [ -n "$status" ]; then
sarray=(${array[0]})
IFS="="
if [ "${sarray[1]}" = "Array" ]; then
sarray=(${sarray[3]})
else
sarray=(${sarray[1]})
fi
IFS=" "
jobid="${sarray[1]}"
# Remove the trailing "_*" until seff supports array jobs fully
jobid=${jobid%"_*"}
sleep $JOB_COMPLETION_TIME
$SEFF $jobid | $MAIL -s "$subject" $recipient
else
$MAIL -s "$subject" $recipient
fi
sleep 2
The issue is that it's /bin/mail that's running sendmail in the background, smail can't influence that.
(In reply to Marshall Garey from comment #4) > So, the mail command does not wait for sendmail to complete sending the > email? That's right, it just fires and forgets.. (In reply to Chris Samuel (NERSC) from comment #6) > (In reply to Marshall Garey from comment #4) > > > So, the mail command does not wait for sendmail to complete sending the > > email? > > That's right, it just fires and forgets.. Good to know. The strange thing is that if the mail command does background the sendmail, then the sendmail process should be reparented to the init process. And then when Slurm calls killpg(), that should not kill the sendmail process if it has been reparented. So there's still something here that I do not understand. Why is sendmail not reparented to init? (In reply to Marshall Garey from comment #7) > So there's still something here that I do not understand. Why is sendmail > not reparented to init? Ah, this is inside a container in kubernetes, there is no init there. Or at least, pid 1 is just the shell script entry point for the container. All the best, Chris I did catch it with `ps` just now and this is what it looks like:
slurmctld:/ # ps x --forest
PID TTY STAT TIME COMMAND
737735 pts/2 Ss+ 0:00 bash
737708 pts/1 Ss 0:00 bash
749896 pts/1 R+ 0:00 \_ ps x --forest
430162 pts/0 Ss+ 0:00 bash
1 ? Ss 0:00 /bin/bash /root/run-nscd-and-daemon.sh /usr/sbin/slurmctld -Dvv
203 ? Ss 0:00 /usr/sbin/cron
477442 ? Sl 8:03 /usr/sbin/slurmctld -Dvv
477443 ? Sl 0:00 \_ slurmctld: slurmscriptd
749868 ? S 0:00 \_ /bin/bash /usr/bin/smail -s Slurm Job_id=397238 Name=wrap Began, Queued time 00:00:01 csamuel
749874 ? S 0:00 | \_ sleep 2
749891 ? S 0:00 \_ /bin/bash /usr/bin/smail -s Slurm Job_id=397238 Name=wrap Ended, Run time 00:00:00, COMPLETED, ExitCode 0 csamuel
749895 ? S 0:00 \_ sleep 5
749873 ? S 0:00 sendmail -i -- csamuel@nersc.gov
So that does look like it's getting reparented - and another test here indicates that yes, it's parent is PID 1 (the shell script):
slurmctld:/ # ps x -o pid,ppid,cmd
PID PPID CMD
1 0 /bin/bash /root/run-nscd-and-daemon.sh /usr/sbin/slurmctld -Dvv
203 1 /usr/sbin/cron
430162 0 bash
477442 1 /usr/sbin/slurmctld -Dvv
477443 477442 slurmctld: slurmscriptd
737708 0 bash
737735 0 bash
750554 477443 /bin/bash /usr/bin/smail -s Slurm Job_id=397252 Name=wrap Began, Queued time 00:00:01 csamuel
750559 1 sendmail -i -- csamuel@nersc.gov
750560 750554 sleep 2
750577 477443 /bin/bash /usr/bin/smail -s Slurm Job_id=397252 Name=wrap Ended, Run time 00:00:01, COMPLETED, ExitCode 0 csamuel
750581 750577 sleep 5
750582 737708 ps x -o pid,ppid,cmd
and we don't have any zombie processes left around after.
I forgot that when a process is reparented, its process group id does not change. That is why a backgrounded process still gets killed when you kill its process group. A simple example: $ cat asdf.sh #!/bin/bash ./mysleep.sh & sleep 5 $ cat mysleep.sh #!/bin/bash sleep 300 $ ./asdf.sh Running ps before the parent script exits: $ ps --forest -e -o s,sid,uid,pid,ppid,pgid,cmd|egrep -i "PPID|asdf|mysleep" S SID UID PID PPID PGID CMD ... S 29229 1000 43245 29229 43245 | \_ /bin/bash ./asdf.sh S 29229 1000 43246 43245 43245 | \_ /bin/bash ./mysleep.sh Running ps after the parent script exits: $ ps --forest -e -o s,sid,uid,pid,ppid,pgid,cmd|egrep -i "PPID|asdf|mysleep" S SID UID PID PPID PGID CMD ... S 29229 1000 43246 1 43245 /bin/bash ./mysleep.sh (In reply to Chris Samuel (NERSC) from comment #6) > (In reply to Marshall Garey from comment #4) > > > So, the mail command does not wait for sendmail to complete sending the > > email? > > That's right, it just fires and forgets.. Ugh. It seems that there is a long history of simply backgrounding a process, exiting, and allowing the kernel to reparent the process so that its new parent will reap it when it is done. But this is tricky when you want to responsibly clean up after yourself, which Slurm does want to do. Should Slurm clean up stray processes or not? How is Slurm supposed to know if a stray process is intentional or not? When Slurm runs a script, it opens a pipe (which is a file) to get the output of the script. When the script forks (or backgrounds) a process, that child process is holding the files of the parent. So when the parent exits, the file (pipe) is still open. So, if Slurm only relies on the state of the file to know if the process is done, then Slurm will wait for the backgrounded process to complete. This is why, prior to 23.02.3, Slurm actually waited for backgrounded processes in a script to complete before considering the script itself complete. This can sometimes be a bad thing. For example, in bug 16143 and bug 16126 comment 0 (another NERSC bug), Slurm was waiting for backgrounded processes of a prolog or epilog to complete, but something went wrong and those backgrounded processes were not completing. This resulted in jobs stuck in the completing state. This problem was fixed by commit 5a3f79271b in 23.02.3: https://github.com/SchedMD/slurm/commit/5a3f79271b This fix is simply to reap the script immediately when it exits, instead of waiting for the file to close. But now we see that because of commit 5a3f79271b, /bin/mail exits before actually sending the email. We could revert commit 5a3f79271b, but then would be left with the problem in bug 16143 and bug 16126 comment 0. (I know that bug 16126 has a couple of different issues in it.) We could make Slurm not kill the process group of scripts anymore. We could limit this to just specific script types (like the mail script). I don't love either of these solutions, so I'm trying to think of other options. But we might have to settle for whatever option is the least bad. (In reply to Marshall Garey from comment #10) > I forgot that when a process is reparented, its process group id does not > change. That is why a backgrounded process still gets killed when you kill > its process group. Oh that jogged a memory! Excuse my hand nroff'ing. SETSID(1) User Commands SETSID(1) NAME setsid - run a program in a new session SYNOPSIS setsid [options] program [arguments] So I tried this change: #$SEFF $jobid | $MAIL -s "$subject" $recipient $SEFF $jobid | /usr/bin/setsid -w $MAIL -s "$subject" $recipient else /usr/bin/setsid -w $MAIL -s "$subject" $recipient and this appears to work! The -w option tells setsid to wait and return the same exit code that the command that was executed returned. [pid 753946] execve("/usr/bin/seff", ["/usr/bin/seff", "397272"], 0x564266ad8640 /* 31 vars */ <unfinished ...> [pid 753947] execve("/usr/bin/setsid", ["/usr/bin/setsid", "-w", "/bin/mail", "-s", "muller Slurm Job_id=397272 Name=wrap Ended, Run time 00:00:01, COMPLETED, ExitCode 0", "csamuel@nersc.gov"], 0x564266ad8640 /* 31 vars */ <unfinished ...> [pid 753946] <... execve resumed>) = 0 [pid 753947] <... execve resumed>) = 0 [pid 753947] execve("/bin/mail", ["/bin/mail", "-s", "muller Slurm Job_id=397272 Name=wrap Ended, Run time 00:00:01, COMPLETED, ExitCode 0", "csamuel@nersc.gov"], 0x7ffe652574d0 /* 31 vars */) = 0 [pid 753946] +++ exited with 0 +++ strace: Process 753949 attached [pid 753949] execve("/usr/sbin/sendmail", ["sendmail", "-i", "--", "csamuel@nersc.gov"], 0x7ffd11728700 /* 31 vars */ <unfinished ...> [pid 753947] +++ exited with 0 +++ [pid 753949] <... execve resumed>) = 0 [pid 753934] +++ exited with 0 +++ [pid 753933] kill(-753934, SIGTERM) = -1 ESRCH (No such process) [pid 753933] kill(-753934, SIGKILL) = -1 ESRCH (No such process) [pid 753933] +++ exited with 0 +++ [pid 753949] +++ exited with 0 +++ ...and I do get the emails I expect. In fact first time around I fixed the seff|mail one and forgot the second and so I got the end of job email but not the start of job, which does strongly imply to me that this does what is necessary. What do you think? All the best! Chris Also handily it appears that `setsid` is in a frequently used RPM (at least on SLES): slurmctld:~ # rpm -qf /usr/bin/setsid util-linux-2.36.2-150300.4.28.1.x86_64 That includes things like "more", "mount", "script", "renice" etc so I think it's likely a safe bet to assume will be present? Created attachment 31870 [details]
First draft of a patch to use setsid when sending email
I think that's a promising approach, though we will still discuss the other problems internally. I am contemplating how to document this behavior of Slurm killing a script's process group when the primary script exits. We began discussing this in 17123 but never actually made documentation changes. Here's some more background about this change: Arguably this was a change in behavior that maybe should have waited for a major release of Slurm. At time I was focused on the bug that it fixed, and I naively thought that it was a fairly harmless change. However, this behavior was inconsistent. The script could close its stdout and stderr file descriptors, in which case Slurm immediately killed the process group after the primary script process completed. Although this is uncommon, I have seen this in a customer script with the exec command (redirected stdout and stderr to a file and closed them in the process). The change made the behavior consistent, regardless of how the script treated stdout and stderr. Another problem: if someone sets MailProg=/bin/mail, then they'll just most likely run into this problem. And /bin/mail is the default value. So although I like a fix in smail, we may still need to do something on the Slurm side of things. Hey Marshall, Thanks so much for the background and a good catch about it not helping the default case! That is indeed a tricky one to think about and I can understand why having consistent behaviour here is good. FWIW I can confirm that having built new RPMs with that patch that at least for us smail now works as expected, and I've just put that same change on to Perlmutter by copying the script over. Thanks so much again! All the best, Chris Because we cannot control how all the scripts work and people's expectations for them, we have to wait for all processes in the script's process group to complete. We reverted the change in commit 790b4a4738 ahead of 23.02.5. Slurm will wait for all processes in the script's process group to complete, as long as at least one of them is still holding the stdout or stderr file descriptor open. In a future release of Slurm, we will look at making this consistent so that Slurm does not depend on the script holding the stdout or stderr file descriptor open to wait for the script's processes to complete. I'm closing this as fixed in 23.02.5. *** Ticket 17690 has been marked as a duplicate of this ticket. *** |
Hi there, Not sure if this is a duplicate of #17123 (which is closed at present), so opening a new bug in case it's different. In the last maintenance on Perlmutter (last Wednesda) from 23.02.2 to 23.02.4 users started complaining about not getting emails about job completions. I did some debugging and found I could reproduce this on our test system and with application of strace to slurmscriptd I can see slurmscriptd kill the process group for the spawned smail command and that in turn kills off the sendmail that /bin/mail has in turned spawned off to send the email before it has a chance to complete. Sequence seems to be: * slurmscript launches smail * smail pipes seff output into the mail command * mail command runs sendmail in the background to send email * mail command returns * smail returns * slurmscriptd kills process group * sendmail gets killed before transmitting the email * user wonders what happened to their job Here's an example strace showing sendmail getting killed: [pid 738005] execve("/bin/mail", ["/bin/mail", "-s", "muller Slurm Job_id=397103 Name=wrap Ended, Run time 00:00:01, COMPLETED, ExitCode 0", "csamuel@nersc.gov"], 0x55eac8a87640 /* 31 vars */ <unfinished ...> [pid 738004] <... execve resumed>) = 0 [pid 738005] <... execve resumed>) = 0 [pid 738004] +++ exited with 0 +++ strace: Process 738007 attached [pid 738007] execve("/usr/sbin/sendmail", ["sendmail", "-i", "--", "csamuel@nersc.gov"], 0x7ffcd46a7760 /* 31 vars */ <unfinished ...> [pid 738005] +++ exited with 0 +++ [pid 738007] <... execve resumed>) = 0 [pid 737997] +++ exited with 0 +++ [pid 737996] kill(-737997, SIGTERM) = 0 [pid 738007] --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=477443, si_uid=0} --- [pid 738007] +++ killed by SIGTERM +++ [pid 737996] kill(-737997, SIGKILL) = -1 ESRCH (No such process) [pid 737996] +++ exited with 0 +++ Adding a short "sleep 2" to the end of smail does seem to have at least worked around this, but I wanted to let you know that this is still an issue in 23.02.4. All the best, Chris