17123 – Email notification regression in 23.02?

Ticket 17123 - Email notification regression in 23.02?

Summary: Email notification regression in 23.02?

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Other (show other tickets)
Version:	23.02.3
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Tim McMullan
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2023-07-05 14:37 MDT by Kilian Cavalotti
Modified:	2023-08-22 14:31 MDT (History)
CC List:	2 users (show)

See Also:	17481
Site:	Stanford
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	23.02.4 23.11.0rc1
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Kilian Cavalotti 2023-07-05 14:37:47 MDT

Hi,

Users alerted us that they don't seem to be receiving email notifications for their jobs after our 23.02 upgrade.

It seems like notifications for END and FAIL (maybe others?) are not sent anymore, indeed.

For instance, the following job:

$ sbatch --mail-type=ALL --wrap='sleep 500' -p test

results in a single email notification being sent when the job begins, but nothing when it ends (confirmed via the smtp logs on the Slurm controller).

Has anything changed in 23.02 with regards to email notifications?

Thanks!
--
Kilian

Comment 1 Tim McMullan 2023-07-07 11:26:52 MDT

Hi Kilian,

I took a look at the relevant code and it doesn't look like much has changed related to e-mail notifications.

I ran some quick tests and seem to be getting messages in my basic setup:

EG:
> sbatch --mail-type=ALL,TIME_LIMIT,TIME_LIMIT_90,TIME_LIMIT_80,TIME_LIMIT_50 -t 6 --wrap="sleep 300"
resulted in 4 messages, which is what I would have expected.

>  U   1 slurm@debslurm-pri Fri Jul  7 13:16  18/752   Slurm Job_id=385428 Name=wrap Began, Queued time 00:00:01
>  U   2 slurm@debslurm-pri Fri Jul  7 13:20  17/744   Slurm Job_id=385428 Name=wrap Reached 50% of time limit, Run time 00:03:18
> >N   3 slurm@debslurm-pri Fri Jul  7 13:21  14/724   Slurm Job_id=385428 Name=wrap Reached 80% of time limit, Run time 00:04:48
>  N   4 slurm@debslurm-pri Fri Jul  7 13:21  14/727   Slurm Job_id=385428 Name=wrap Ended, Run time 00:05:00, COMPLETED, ExitCode 0

What log level are you running the slurmctld at?   At the "debug" level, you should be able to see logs like this one when it sends mail:
> debug:  email msg to mcmult: Slurm Job_id=385428 Name=wrap Ended, Run time 00:05:00, COMPLETED, ExitCode 0

If you can, please run a test job with debug logging and see if you can find those messages in the log/attach the slurmctld log so I can look.

Thanks!
--Tim

Comment 3 Kilian Cavalotti 2023-07-10 17:01:28 MDT

Hi Tim, 

(In reply to Tim McMullan from comment #1)
> I took a look at the relevant code and it doesn't look like much has changed
> related to e-mail notifications.

Thanks, that's helpful!

Using your exact job example, and after having enabled debug logging, I was able to find logs matching the start, the 50% time limit, as well as the job end:

$ sbatch --mail-type=ALL,TIME_LIMIT,TIME_LIMIT_90,TIME_LIMIT_80,TIME_LIMIT_50 -t 6 -p test  --wrap="sleep 300"
Submitted batch job 23711749

# grep -E 'email.*23711749' /var/log/slurm/ctld/sh03-sl01.log
2023-07-10T15:41:44-07:00 sh03-sl01 slurmctld[8160]: debug:  email msg to kilian: Slurm Job_id=23711749 Name=wrap Began, Queued time 00:00:02
2023-07-10T15:45:13-07:00 sh03-sl01 slurmctld[8160]: debug:  email msg to kilian: Slurm Job_id=23711749 Name=wrap Reached 50% of time limit, Run time 00:03:29
2023-07-10T15:46:45-07:00 sh03-sl01 slurmctld[8160]: debug:  email msg to kilian: Slurm Job_id=23711749 Name=wrap Ended, Run time 00:05:01, COMPLETED, ExitCode 0

But than only translated in two emails being sent: 

# grep kilian /valog/maillog
2023-07-10T15:41:45.233503-07:00 sh03-sl01 postfix/smtp[26635]: 309411807DB03: to=<kilian@sh03-sl01.int>, orig_to=<kilian>, relay=smtp.int[10.16.0.1]:25, delay=0.03, delays=0/0/0/0.03, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as 31DD920116FFD)
2023-07-10T15:45:16.503439-07:00 sh03-sl01 postfix/smtp[449]: 651DF1807DB03: to=<kilian@sh03-sl01.int>, orig_to=<kilian>, relay=smtp.int[10.16.0.2]:25, delay=3, delays=0/0/0.01/3, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as 69853FEAF7)
#

The last email notification, for when the job ends, doesn't seem to be sent at all.

Now, we're using `smail` from the slurm-contribs package as our MailProg. Is it possible that something changed in the format of completed jobs, that would make `smail` fail?

Thanks!
--
Kilian

Comment 4 Kilian Cavalotti 2023-07-10 18:35:57 MDT

> Now, we're using `smail` from the slurm-contribs package as our MailProg. Is
> it possible that something changed in the format of completed jobs, that
> would make `smail` fail?

Looking into it more, I suspect the issue is related to that part in smail:
```
    # Fork a child so sleep is asynchronous.
    {
        sleep $JOB_COMPLETION_TIME
        $SEFF $jobid | $MAIL -s "$subject" $recipient
    } &
```

It seems that running this sub-script in the background doesn't work anymore in 23.02, maybe because the mail program is now executed via slurmscriptd? (https://github.com/SchedMD/slurm/commit/f576ab779d99a165915b4b15488ab00254f0cbc5)

Cheers,
--
Kilian

Comment 5 Tim McMullan 2023-07-11 06:37:02 MDT

Hey Kilian,

Thanks for mentioning that you are using smail, with that information I can easily reproduce this if I use smail for MailProg.

Dropping the sub-process does indeed fix the issue.  run_command kills the whole process group when the main process ends, so its likely the forked process is getting killed before it can send anything out.

I'll propose a patch for this shortly.

Thanks again!
--Tim

Comment 9 Kilian Cavalotti 2023-07-11 11:10:01 MDT

Thanks Tim!

> Dropping the sub-process does indeed fix the issue.  run_command kills the
> whole process group when the main process ends, so its likely the forked
> process is getting killed before it can send anything out.
> 
> I'll propose a patch for this shortly.

Looking forward to the patch! 
In the meantime, I'll remove the background process from the `smail` script.

Cheers,
--
Kilian

Comment 12 Tim McMullan 2023-07-11 11:58:02 MDT

Hi Kilian,

We've pushed the following fix for this. It should be available in 23.02.4!

https://github.com/SchedMD/slurm/commit/cae63c6356

Thanks for reporting this!  I'll resolve this now, but let us know if you have any questions!

Thanks!
--Tim

Comment 13 Kilian Cavalotti 2023-07-11 12:09:19 MDT

> Thanks for reporting this!  I'll resolve this now, but let us know if you
> have any questions!

Thank you!

I was more expecting a fix on the slurmscriptd side, though. Because as I understand things, it means that any forked process from a script executed under slurmscriptd will be killed and fail, correct? 

Especially now that more and more programs are moved under the slurmscriptd umbrella (ResumeProgram, ResumeFailProgram, SuspendProgram, ResvProlog, ResvEpilog, RebootProgram to only list the new ones in 23.02), won't that be a problem going forward?

Cheers,
--
Kilian

Comment 14 Tim McMullan 2023-07-11 12:59:38 MDT

Hi Kilian!

I don't think that allowing these scripts to spawn scripts that potentially run forever desirable.  It just feels like another way to unintentionally blow things up.

There was a recent change in 23.02.3 (commit https://github.com/SchedMD/slurm/commit/5a3f79271b) which is specifically the likely culprit here.  This was done specifically to kill this kind of process because things like the prolog/epilog could run forever if they spawned a child process and cause problems.

Is there a legitimate situation where you would want to spawn a child process from one of these places and then have slurm not care about it going forward?

Comment 15 Kilian Cavalotti 2023-07-11 13:20:51 MDT

Hey Tim,

(In reply to Tim McMullan from comment #14)
> I don't think that allowing these scripts to spawn scripts that potentially
> run forever desirable.  It just feels like another way to unintentionally
> blow things up.
> 
> There was a recent change in 23.02.3 (commit
> https://github.com/SchedMD/slurm/commit/5a3f79271b) which is specifically
> the likely culprit here.  This was done specifically to kill this kind of
> process because things like the prolog/epilog could run forever if they
> spawned a child process and cause problems.

Ah right, thanks for pointing that out. I think it makes good sense, actually, and is probably for the better, stability-wise.

> Is there a legitimate situation where you would want to spawn a child
> process from one of these places and then have slurm not care about it going
> forward?

Yes, like an email script waiting to get information from sacct for ended jobs? :))

I guess that without it, each `smail` instance will be hanging out for JOB_COMPLETION_TIME seconds out of the slurmscriptd process. I don't think that's a big issue, unless maybe there's a huge number of processes running, or if the slurmscriptd process crashes in which case the email notifications that were waiting will be lost.

So I agree that it's probably best to not let background processes run after the main script ends. BUT it should definitely be made very clear in the documentation that background processes in scripts won't work, because right now, it's completely hidden and quite puzzling when you don't know where to look.

Thanks!
--
Kilian

Comment 16 Tim McMullan 2023-07-11 13:38:54 MDT

(In reply to Kilian Cavalotti from comment #15)
> Hey Tim,
> 
> (In reply to Tim McMullan from comment #14)
> > I don't think that allowing these scripts to spawn scripts that potentially
> > run forever desirable.  It just feels like another way to unintentionally
> > blow things up.
> > 
> > There was a recent change in 23.02.3 (commit
> > https://github.com/SchedMD/slurm/commit/5a3f79271b) which is specifically
> > the likely culprit here.  This was done specifically to kill this kind of
> > process because things like the prolog/epilog could run forever if they
> > spawned a child process and cause problems.
> 
> Ah right, thanks for pointing that out. I think it makes good sense,
> actually, and is probably for the better, stability-wise.

Awesome!  I'm glad we agree there!

> > Is there a legitimate situation where you would want to spawn a child
> > process from one of these places and then have slurm not care about it going
> > forward?
> 
> Yes, like an email script waiting to get information from sacct for ended
> jobs? :))

Touche :) The slurmscriptd handling this is probably less bad than having it spawn a process and exit, then hope seff + mail completes in a reasonable amount of time though.

> I guess that without it, each `smail` instance will be hanging out for
> JOB_COMPLETION_TIME seconds out of the slurmscriptd process. I don't think
> that's a big issue, unless maybe there's a huge number of processes running,
> or if the slurmscriptd process crashes in which case the email notifications
> that were waiting will be lost.

Spawning and then keeping track of running scripts is essentially what that process is for, so even if it ends up managing a bunch if them it is just doing its job.  In the event of a crash, right now you would likely just lose the slurmscriptd management of those processes and they would (likely) just complete on their own.

> So I agree that it's probably best to not let background processes run after
> the main script ends. BUT it should definitely be made very clear in the
> documentation that background processes in scripts won't work, because right
> now, it's completely hidden and quite puzzling when you don't know where to
> look.

Thats not entirely true... its just background processes that you don't then wait for when you clean up.  The issue with smail was that it forked then just exited.  I could have changed it to wait for the fork to complete, but at that point why fork at all?  I can certainly see the argument for documenting that we expect if you do spawn a process in a script, that script is expected to wait until it completes before exiting.  I'm chatting about the best way to approach documenting that expectation.

Thanks Kilian!
--Tim

Comment 17 Kilian Cavalotti 2023-07-11 14:09:58 MDT

> > So I agree that it's probably best to not let background processes run after
> > the main script ends. BUT it should definitely be made very clear in the
> > documentation that background processes in scripts won't work, because right
> > now, it's completely hidden and quite puzzling when you don't know where to
> > look.
> 
> Thats not entirely true... its just background processes that you don't then
> wait for when you clean up.  

Right. What I meant is that admins will create their prolog/epilog & co. scripts, test them on their own, figure out that this one task that needs to be done in the background doesn't need to run synchronously, make it a background process, test again the script directly to make sure it works, and then scratch their heads for a while when they realize that it doesn't work anymore when executed through Slurm. Pretty much like what happened with smail. Hence the need for more documentation. :)

> The issue with smail was that it forked then
> just exited.  I could have changed it to wait for the fork to complete, but
> at that point why fork at all? 

Exactly.

> I can certainly see the argument for
> documenting that we expect if you do spawn a process in a script, that
> script is expected to wait until it completes before exiting.  I'm chatting
> about the best way to approach documenting that expectation.

That'd be great, thanks a lot!

Cheers,
--
Kilian

Comment 18 Chris Samuel (NERSC) 2023-08-21 13:30:47 MDT

I've run into an issue that seems similar with 23.02.4 - not sure if it's a duplicate so I opened #17481 just in case.