Hi, We recently migrated OS from Centos to Ubuntu. On ubuntu smail is not sending alerts for jobs that failed or cancelled. basically for jobs which exits with non zero code. Slurm version 22.05.03 Below cmd works fine: srun --verbose --mail-type=END --mail-user=vinodkumar.tana@cerence.com bash -c "sleep 5; echo 'hello from slurm'" This doesn't work on ubuntu, but works fine on Centos: srun --verbose --mail-type=END --mail-user=vinodkumar.tana@cerence.com bash -c "sleep 5; echo 'hello from slurm'; exit 42" if I set --mail-type=ALL on ubuntu, it works. But I dont want to receive emails for all. Any idea?
There was a regression in 23.02.3 where smail sometimes failed to send the emails due to a race condition. The details are in bug 17481 and it is fixed in 23.02.5. Can you upgrade slurmctld to 23.02.5 and check if it is fixed for you? *** This ticket has been marked as a duplicate of ticket 17481 ***
I unmarked this ticket as a duplicate; I read your Slurm version number wrong and that you were on 23.02.3. I realized that you are on 22.05.3. Can you enable the "script" debug flag and set slurmctlddebug=debug and run your test again, then upload the slurmctld log? scontrol setdebug debug scontrol setdebugflags +script <run your test> scontrol setdebug <whatever you have in slurm.conf> scontrol setdebugflags -script > This doesn't work on ubuntu, but works fine on Centos: This is very odd; the OS shouldn't make a difference.
Created attachment 32252 [details] slurmctld log file adding slurmctld log after enabling debug.
I reproduced this. I found the following: For a non-array job, we send mail for type END if the base state is COMPLETE or CANCELLED. For an array job, we send mail for type END if all tasks (array sub-jobs) did not fail, or if any one task failed and mail type != FAIL. Basically, we always send an email when the array job ends no matter how it ended. This looks like a regression in 22.05. An easy workaround is to specify the "FAIL" mail type as well: --mail-type=end,fail Is this workaround sufficient until we get this bug fixed? A fix will be targeted for 23.02. I still don't understand why your OS type makes a difference. It should not matter if the Slurm version is the same. Because you say that the OS type makes a difference, I suspect that your Slurm version is different on those two nodes.
Yes, I forgot to mention. slurm version on Centos is different 21.05. For now we'll adjust move on with --mail-type=end,fail
While working on a fix for this, I noticed it was actually broken in 21.08.2. You said that your other cluster is running 21.05. That is not a Slurm version. Is it running one of the maintenance versions of 21.08? Which maintenance version is it running? 21.08.0 or 21.08.1?
We pushed a fix and a documentation update upstream. They will be included in 23.02.6. 456afc6965 Docs - update --mail-type=ARRAY_TASKS for sbatch 021e0df77b Run mailprog for mail-type=end for jobs with non-zero exit codes Closing as fixed