Ticket 17481

Summary:	Job notification emails for end of job not getting sent after upgrade to 23.02.4 (from 23.02.2) \| run_command doesn't wait for background processes
Product:	Slurm	Reporter:	Chris Samuel (NERSC) <csamuel>
Component:	slurmscriptd	Assignee:	Marshall Garey <marshall>
Status:	RESOLVED FIXED	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	dmjacobsen, felip.moll, vinodkumar.tana
Version:	23.02.4
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=17426 https://bugs.schedmd.com/show_bug.cgi?id=17123 https://bugs.schedmd.com/show_bug.cgi?id=16126 https://bugs.schedmd.com/show_bug.cgi?id=16143
Site:	NERSC	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	23.02.5 23.11.0rc1
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	First draft of a patch to use setsid when sending email

Description Chris Samuel (NERSC) 2023-08-21 13:29:44 MDT

Hi there,

Not sure if this is a duplicate of #17123 (which is closed at present), so opening a new bug in case it's different.

In the last maintenance on Perlmutter (last Wednesda) from 23.02.2 to 23.02.4 users started complaining about not getting emails about job completions.

I did some debugging and found I could reproduce this on our test system and with application of strace to slurmscriptd I can see slurmscriptd kill the process group for the spawned smail command and that in turn kills off the sendmail that /bin/mail has in turned spawned off to send the email before it has a chance to complete.

Sequence seems to be:

* slurmscript launches smail
* smail pipes seff output into the mail command
* mail command runs sendmail in the background to send email
* mail command returns
* smail returns
* slurmscriptd kills process group
* sendmail gets killed before transmitting the email
* user wonders what happened to their job

Here's an example strace showing sendmail getting killed:

[pid 738005] execve("/bin/mail", ["/bin/mail", "-s", "muller Slurm Job_id=397103 Name=wrap Ended, Run time 00:00:01, COMPLETED, ExitCode 0", "csamuel@nersc.gov"], 0x55eac8a87640 /* 31 vars */ <unfinished ...>
[pid 738004] <... execve resumed>)      = 0
[pid 738005] <... execve resumed>)      = 0
[pid 738004] +++ exited with 0 +++
strace: Process 738007 attached
[pid 738007] execve("/usr/sbin/sendmail", ["sendmail", "-i", "--", "csamuel@nersc.gov"], 0x7ffcd46a7760 /* 31 vars */ <unfinished ...>
[pid 738005] +++ exited with 0 +++
[pid 738007] <... execve resumed>)      = 0
[pid 737997] +++ exited with 0 +++
[pid 737996] kill(-737997, SIGTERM)     = 0
[pid 738007] --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=477443, si_uid=0} ---
[pid 738007] +++ killed by SIGTERM +++
[pid 737996] kill(-737997, SIGKILL)     = -1 ESRCH (No such process)
[pid 737996] +++ exited with 0 +++


Adding a short "sleep 2" to the end of smail does seem to have at least worked around this, but I wanted to let you know that this is still an issue in 23.02.4.

All the best,
Chris

Comment 1 Marshall Garey 2023-08-21 13:32:40 MDT

Did you update smail when you upgraded to 23.02.4? The fix was in smail. Do you have this change?

https://github.com/SchedMD/slurm/commit/cae63c6356

Comment 2 Chris Samuel (NERSC) 2023-08-21 13:33:40 MDT

Here's the strace with the sleep in place and a successful email being sent:

[pid 738798] execve("/usr/bin/seff", ["/usr/bin/seff", "397120"], 0x5623c34d3640 /* 31 vars */ <unfinished ...>
[pid 738799] execve("/bin/mail", ["/bin/mail", "-s", "muller Slurm Job_id=397120 Name=wrap Ended, Run time 00:00:01, COMPLETED, ExitCode 0", "csamuel@nersc.gov"], 0x5623c34d3640 /* 31 vars */ <unfinished ...>
[pid 738798] <... execve resumed>)      = 0
[pid 738799] <... execve resumed>)      = 0
[pid 738798] +++ exited with 0 +++
strace: Process 738801 attached
[pid 738801] execve("/usr/sbin/sendmail", ["sendmail", "-i", "--", "csamuel@nersc.gov"], 0x7fffef639ae0 /* 31 vars */ <unfinished ...>
[pid 738799] +++ exited with 0 +++
strace: Process 738802 attached
[pid 738801] <... execve resumed>)      = 0
[pid 738802] execve("/usr/bin/sleep", ["sleep", "2"], 0x5623c34d3640 /* 31 vars */) = 0
[pid 738801] +++ exited with 0 +++
[pid 738802] +++ exited with 0 +++
[pid 738791] +++ exited with 0 +++
[pid 738790] kill(-738791, SIGTERM)     = -1 ESRCH (No such process)
[pid 738790] kill(-738791, SIGKILL)     = -1 ESRCH (No such process)
[pid 738790] +++ exited with 0 +++

Comment 3 Chris Samuel (NERSC) 2023-08-21 13:34:26 MDT

Hi Marshall,

Yup, can confirm this is from our 23.02.4 RPMs.

slurmctld:/ # rpm -qf /usr/bin/smail
slurm-contribs-23.02.4-20230810052645_muller_7e618125.x86_64

All the best,
Chris

Comment 4 Marshall Garey 2023-08-21 13:35:12 MDT

(In reply to Chris Samuel (NERSC) from comment #0)
> I did some debugging and found I could reproduce this on our test system and
> with application of strace to slurmscriptd I can see slurmscriptd kill the
> process group for the spawned smail command and that in turn kills off the
> sendmail that /bin/mail has in turned spawned off to send the email before
> it has a chance to complete.
> 
> Sequence seems to be:
> 
> * slurmscript launches smail
> * smail pipes seff output into the mail command
> * mail command runs sendmail in the background to send email
> * mail command returns

So, the mail command does not wait for sendmail to complete sending the email?

> * smail returns
> * slurmscriptd kills process group
> * sendmail gets killed before transmitting the email
> * user wonders what happened to their job

Comment 5 Chris Samuel (NERSC) 2023-08-21 13:35:53 MDT

Here's that code (with our sleep at the end):

# If we decide later to seff based on specific status codes, 
# we can test against $status.
status=`echo "${array[2]}" | tr -d ' '`
if [ -n "$status" ]; then
    sarray=(${array[0]})
    IFS="="
    if [ "${sarray[1]}" = "Array" ]; then
        sarray=(${sarray[3]})
    else
        sarray=(${sarray[1]})
    fi
    IFS=" "
    jobid="${sarray[1]}"
    # Remove the trailing "_*" until seff supports array jobs fully
    jobid=${jobid%"_*"}
    sleep $JOB_COMPLETION_TIME
    $SEFF $jobid | $MAIL -s "$subject" $recipient
else
    $MAIL -s "$subject" $recipient
fi

sleep 2

The issue is that it's /bin/mail that's running sendmail in the background, smail can't influence that.

Comment 6 Chris Samuel (NERSC) 2023-08-21 13:36:33 MDT

(In reply to Marshall Garey from comment #4)

> So, the mail command does not wait for sendmail to complete sending the
> email?

That's right, it just fires and forgets..

Comment 7 Marshall Garey 2023-08-21 14:54:25 MDT

(In reply to Chris Samuel (NERSC) from comment #6)
> (In reply to Marshall Garey from comment #4)
> 
> > So, the mail command does not wait for sendmail to complete sending the
> > email?
> 
> That's right, it just fires and forgets..

Good to know. The strange thing is that if the mail command does background the sendmail, then the sendmail process should be reparented to the init process. And then when Slurm calls killpg(), that should not kill the sendmail process if it has been reparented.

So there's still something here that I do not understand. Why is sendmail not reparented to init?

Comment 8 Chris Samuel (NERSC) 2023-08-21 15:01:57 MDT

(In reply to Marshall Garey from comment #7)

> So there's still something here that I do not understand. Why is sendmail
> not reparented to init?

Ah, this is inside a container in kubernetes, there is no init there.

Or at least, pid 1 is just the shell script entry point for the container.

All the best,
Chris

Comment 9 Chris Samuel (NERSC) 2023-08-21 16:05:14 MDT

I did catch it with `ps` just now and this is what it looks like:

slurmctld:/ # ps x --forest
    PID TTY      STAT   TIME COMMAND
 737735 pts/2    Ss+    0:00 bash
 737708 pts/1    Ss     0:00 bash
 749896 pts/1    R+     0:00  \_ ps x --forest
 430162 pts/0    Ss+    0:00 bash
      1 ?        Ss     0:00 /bin/bash /root/run-nscd-and-daemon.sh /usr/sbin/slurmctld -Dvv
    203 ?        Ss     0:00 /usr/sbin/cron
 477442 ?        Sl     8:03 /usr/sbin/slurmctld -Dvv
 477443 ?        Sl     0:00  \_ slurmctld: slurmscriptd
 749868 ?        S      0:00      \_ /bin/bash /usr/bin/smail -s Slurm Job_id=397238 Name=wrap Began, Queued time 00:00:01 csamuel
 749874 ?        S      0:00      |   \_ sleep 2
 749891 ?        S      0:00      \_ /bin/bash /usr/bin/smail -s Slurm Job_id=397238 Name=wrap Ended, Run time 00:00:00, COMPLETED, ExitCode 0 csamuel
 749895 ?        S      0:00          \_ sleep 5
 749873 ?        S      0:00 sendmail -i -- csamuel@nersc.gov

So that does look like it's getting reparented - and another test here indicates that yes, it's parent is PID 1 (the shell script):

slurmctld:/ # ps x -o pid,ppid,cmd
    PID    PPID CMD
      1       0 /bin/bash /root/run-nscd-and-daemon.sh /usr/sbin/slurmctld -Dvv
    203       1 /usr/sbin/cron
 430162       0 bash
 477442       1 /usr/sbin/slurmctld -Dvv
 477443  477442 slurmctld: slurmscriptd
 737708       0 bash
 737735       0 bash
 750554  477443 /bin/bash /usr/bin/smail -s Slurm Job_id=397252 Name=wrap Began, Queued time 00:00:01 csamuel
 750559       1 sendmail -i -- csamuel@nersc.gov
 750560  750554 sleep 2
 750577  477443 /bin/bash /usr/bin/smail -s Slurm Job_id=397252 Name=wrap Ended, Run time 00:00:01, COMPLETED, ExitCode 0 csamuel
 750581  750577 sleep 5
 750582  737708 ps x -o pid,ppid,cmd

and we don't have any zombie processes left around after.

Comment 10 Marshall Garey 2023-08-21 16:32:27 MDT

I forgot that when a process is reparented, its process group id does not change. That is why a backgrounded process still gets killed when you kill its process group.

A simple example:

$ cat asdf.sh
#!/bin/bash
./mysleep.sh &
sleep 5

$ cat mysleep.sh 
#!/bin/bash
sleep 300

$ ./asdf.sh

Running ps before the parent script exits:

$ ps --forest -e -o s,sid,uid,pid,ppid,pgid,cmd|egrep -i "PPID|asdf|mysleep"
S     SID   UID     PID    PPID    PGID CMD
...
S   29229  1000   43245   29229   43245              |           \_ /bin/bash ./asdf.sh
S   29229  1000   43246   43245   43245              |               \_ /bin/bash ./mysleep.sh


Running ps after the parent script exits:

$ ps --forest -e -o s,sid,uid,pid,ppid,pgid,cmd|egrep -i "PPID|asdf|mysleep"
S     SID   UID     PID    PPID    PGID CMD
...
S   29229  1000   43246       1   43245 /bin/bash ./mysleep.sh

Comment 11 Marshall Garey 2023-08-21 16:42:33 MDT

(In reply to Chris Samuel (NERSC) from comment #6)
> (In reply to Marshall Garey from comment #4)
> 
> > So, the mail command does not wait for sendmail to complete sending the
> > email?
> 
> That's right, it just fires and forgets..

Ugh. It seems that there is a long history of simply backgrounding a process, exiting, and allowing the kernel to reparent the process so that its new parent will reap it when it is done. But this is tricky when you want to responsibly clean up after yourself, which Slurm does want to do.

Should Slurm clean up stray processes or not? How is Slurm supposed to know if a stray process is intentional or not?

When Slurm runs a script, it opens a pipe (which is a file) to get the output of the script. When the script forks (or backgrounds) a process, that child process is holding the files of the parent. So when the parent exits, the file (pipe) is still open. So, if Slurm only relies on the state of the file to know if the process is done, then Slurm will wait for the backgrounded process to complete. This is why, prior to 23.02.3, Slurm actually waited for backgrounded processes in a script to complete before considering the script itself complete.

This can sometimes be a bad thing. For example, in bug 16143 and bug 16126 comment 0 (another NERSC bug), Slurm was waiting for backgrounded processes of a prolog or epilog to complete, but something went wrong and those backgrounded processes were not completing. This resulted in jobs stuck in the completing state.

This problem was fixed by commit 5a3f79271b in 23.02.3:

https://github.com/SchedMD/slurm/commit/5a3f79271b

This fix is simply to reap the script immediately when it exits, instead of waiting for the file to close.

But now we see that because of commit 5a3f79271b, /bin/mail exits before actually sending the email.


We could revert commit 5a3f79271b, but then would be left with the problem in bug 16143 and bug 16126 comment 0. (I know that bug 16126 has a couple of different issues in it.)

We could make Slurm not kill the process group of scripts anymore. We could limit this to just specific script types (like the mail script).

I don't love either of these solutions, so I'm trying to think of other options. But we might have to settle for whatever option is the least bad.

Comment 12 Chris Samuel (NERSC) 2023-08-21 16:50:18 MDT

(In reply to Marshall Garey from comment #10)

> I forgot that when a process is reparented, its process group id does not
> change. That is why a backgrounded process still gets killed when you kill
> its process group.

Oh that jogged a memory!  Excuse my hand nroff'ing.

SETSID(1)                  User Commands                  SETSID(1)

NAME
       setsid - run a program in a new session

SYNOPSIS
       setsid [options] program [arguments]


So I tried this change:

    #$SEFF $jobid | $MAIL -s "$subject" $recipient
    $SEFF $jobid | /usr/bin/setsid -w $MAIL -s "$subject" $recipient
else
    /usr/bin/setsid -w $MAIL -s "$subject" $recipient

and this appears to work!  The -w option tells setsid to wait and return the same exit code that the command that was executed returned.

[pid 753946] execve("/usr/bin/seff", ["/usr/bin/seff", "397272"], 0x564266ad8640 /* 31 vars */ <unfinished ...>
[pid 753947] execve("/usr/bin/setsid", ["/usr/bin/setsid", "-w", "/bin/mail", "-s", "muller Slurm Job_id=397272 Name=wrap Ended, Run time 00:00:01, COMPLETED, ExitCode 0", "csamuel@nersc.gov"], 0x564266ad8640 /* 31 vars */ <unfinished ...>
[pid 753946] <... execve resumed>)      = 0
[pid 753947] <... execve resumed>)      = 0
[pid 753947] execve("/bin/mail", ["/bin/mail", "-s", "muller Slurm Job_id=397272 Name=wrap Ended, Run time 00:00:01, COMPLETED, ExitCode 0", "csamuel@nersc.gov"], 0x7ffe652574d0 /* 31 vars */) = 0
[pid 753946] +++ exited with 0 +++
strace: Process 753949 attached
[pid 753949] execve("/usr/sbin/sendmail", ["sendmail", "-i", "--", "csamuel@nersc.gov"], 0x7ffd11728700 /* 31 vars */ <unfinished ...>
[pid 753947] +++ exited with 0 +++
[pid 753949] <... execve resumed>)      = 0
[pid 753934] +++ exited with 0 +++
[pid 753933] kill(-753934, SIGTERM)     = -1 ESRCH (No such process)
[pid 753933] kill(-753934, SIGKILL)     = -1 ESRCH (No such process)
[pid 753933] +++ exited with 0 +++
[pid 753949] +++ exited with 0 +++


...and I do get the emails I expect.

In fact first time around I fixed the seff|mail one and forgot the second and so I got the end of job email but not the start of job, which does strongly imply to me that this does what is necessary.

What do you think?

All the best!
Chris

Comment 13 Chris Samuel (NERSC) 2023-08-21 17:08:27 MDT

Also handily it appears that `setsid` is in a frequently used RPM (at least on SLES):

slurmctld:~ # rpm -qf /usr/bin/setsid 
util-linux-2.36.2-150300.4.28.1.x86_64

That includes things like "more", "mount", "script", "renice" etc so I think it's likely a safe bet to assume will be present?

Comment 14 Chris Samuel (NERSC) 2023-08-21 17:10:30 MDT

Created attachment 31870 [details]
First draft of a patch to use setsid when sending email

Comment 15 Marshall Garey 2023-08-21 17:49:11 MDT

I think that's a promising approach, though we will still discuss the other problems internally.

I am contemplating how to document this behavior of Slurm killing a script's process group when the primary script exits. We began discussing this in 17123 but never actually made documentation changes.


Here's some more background about this change:

Arguably this was a change in behavior that maybe should have waited for a major release of Slurm. At time I was focused on the bug that it fixed, and I naively thought that it was a fairly harmless change.

However, this behavior was inconsistent. The script could close its stdout and stderr file descriptors, in which case Slurm immediately killed the process group after the primary script process completed. Although this is uncommon, I have seen this in a customer script with the exec command (redirected stdout and stderr to a file and closed them in the process).

The change made the behavior consistent, regardless of how the script treated stdout and stderr.

Comment 16 Marshall Garey 2023-08-21 18:33:13 MDT

Another problem: if someone sets MailProg=/bin/mail, then they'll just most likely run into this problem. And /bin/mail is the default value. So although I like a fix in smail, we may still need to do something on the Slurm side of things.

Comment 17 Chris Samuel (NERSC) 2023-08-21 18:41:53 MDT

Hey Marshall,

Thanks so much for the background and a good catch about it not helping the default case! That is indeed a tricky one to think about and I can understand why having consistent behaviour here is good.

FWIW I can confirm that having built new RPMs with that patch that at least for us smail now works as expected, and I've just put that same change on to Perlmutter by copying the script over.

Thanks so much again!

All the best,
Chris

Comment 23 Marshall Garey 2023-08-30 16:34:44 MDT

Because we cannot control how all the scripts work and people's expectations for them, we have to wait for all processes in the script's process group to complete. We reverted the change in commit 790b4a4738 ahead of 23.02.5. Slurm will wait for all processes in the script's process group to complete, as long as at least one of them is still holding the stdout or stderr file descriptor open. In a future release of Slurm, we will look at making this consistent so that Slurm does not depend on the script holding the stdout or stderr file descriptor open to wait for the script's processes to complete.

I'm closing this as fixed in 23.02.5.

Comment 25 Marshall Garey 2023-09-14 08:52:26 MDT

*** Ticket 17690 has been marked as a duplicate of this ticket. ***