Ticket 17690

Summary: mailprog - emails not sent for jobs with non zero exit code
Product: Slurm Reporter: Tana Vinod <vinodkumar.tana>
Component: OtherAssignee: Marshall Garey <marshall>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: vinodkumar.tana
Version: 22.05.3   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=17769
Site: Cerence AI Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 23.02.6 23.11.0rc1
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: slurmctld log file

Description Tana Vinod 2023-09-14 04:58:00 MDT
Hi,

We recently migrated OS from Centos to Ubuntu.

On ubuntu smail is not sending alerts for jobs that failed or cancelled. basically for jobs which exits with non zero code.

Slurm version 22.05.03


Below cmd works fine:
srun --verbose --mail-type=END --mail-user=vinodkumar.tana@cerence.com bash -c "sleep 5; echo 'hello from slurm'"

This doesn't work on ubuntu, but works fine on Centos:
srun --verbose --mail-type=END --mail-user=vinodkumar.tana@cerence.com bash -c "sleep 5; echo 'hello from slurm'; exit 42"

if I set --mail-type=ALL on ubuntu, it works. But I dont want to receive emails for all.

Any idea?
Comment 1 Marshall Garey 2023-09-14 08:52:26 MDT
There was a regression in 23.02.3 where smail sometimes failed to send the emails due to a race condition. The details are in bug 17481 and it is fixed in 23.02.5. Can you upgrade slurmctld to 23.02.5 and check if it is fixed for you?

*** This ticket has been marked as a duplicate of ticket 17481 ***
Comment 3 Marshall Garey 2023-09-14 08:57:06 MDT
I unmarked this ticket as a duplicate; I read your Slurm version number wrong and that you were on 23.02.3. I realized that you are on 22.05.3. Can you enable the "script" debug flag and set slurmctlddebug=debug and run your test again, then upload the slurmctld log?

scontrol setdebug debug
scontrol setdebugflags +script

<run your test>

scontrol setdebug <whatever you have in slurm.conf>
scontrol setdebugflags -script



> This doesn't work on ubuntu, but works fine on Centos:

This is very odd; the OS shouldn't make a difference.
Comment 4 Tana Vinod 2023-09-14 09:36:22 MDT
Created attachment 32252 [details]
slurmctld log file

adding slurmctld log after enabling debug.
Comment 7 Marshall Garey 2023-09-14 14:30:15 MDT
I reproduced this. I found the following:

For a non-array job, we send mail for type END if the base state is COMPLETE or CANCELLED.


For an array job, we send mail for type END if all tasks (array sub-jobs) did not fail, or if any one task failed and mail type != FAIL. Basically, we always send an email when the array job ends no matter how it ended.


This looks like a regression in 22.05. An easy workaround is to specify the "FAIL" mail type as well:


--mail-type=end,fail

Is this workaround sufficient until we get this bug fixed? A fix will be targeted for 23.02.


I still don't understand why your OS type makes a difference. It should not matter if the Slurm version is the same. Because you say that the OS type makes a difference, I suspect that your Slurm version is different on those two nodes.
Comment 8 Tana Vinod 2023-09-14 14:45:17 MDT
Yes, I forgot to mention. slurm version on Centos is different 21.05. For now we'll adjust move on with --mail-type=end,fail
Comment 12 Marshall Garey 2023-09-22 14:23:06 MDT
While working on a fix for this, I noticed it was actually broken in 21.08.2. You said that your other cluster is running 21.05. That is not a Slurm version. Is it running one of the maintenance versions of 21.08? Which maintenance version is it running? 21.08.0 or 21.08.1?
Comment 21 Marshall Garey 2023-10-05 13:54:41 MDT
We pushed a fix and a documentation update upstream. They will be included in 23.02.6.

456afc6965 Docs - update --mail-type=ARRAY_TASKS for sbatch
021e0df77b Run mailprog for mail-type=end for jobs with non-zero exit codes

Closing as fixed