Ticket 17821

Summary:	Federation Email Notifications issues
Product:	Slurm	Reporter:	kmckenzie <kmckenzie>
Component:	Federation	Assignee:	Jacob Jenson <jacob>
Status:	OPEN ---	QA Contact:
Severity:	6 - No support contract
Priority:	---	CC:	kmckenzie
Version:	23.02.2
Hardware:	Linux
OS:	Linux
Site:	-Other-	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Slurm emails tested working

Description kmckenzie@4dmedical.com 2023-10-02 20:30:34 MDT

Created attachment 32540 [details]
Slurm emails tested working

Hi Team,

Not entirely sure if this is a bug or a configuration issue but I seem to be unable to find much documentation on email notifications within a Federation. We are using postfix and smail on our primary SlurmCTLD machine (which also hosts DBD) to forward Slurm workload emails to our users.

We have tested this working correctly on one partition but it appears that this feature does not work when any partition other than the local partition is selected.
In our instance we have a Physical HPC Partition which is on the same Network as the SlurmDBD/CTLD machine.
When jobs are submitted to this cluster it works as intended and we get Began and Ended emails per the attached image.

However when we do the same test using any of our Cloud based HPCs we do not get any email notifications and the appropriate entries in the /var/log/mail or /mail.info do not appear.


From mail.info (with some information de-identified) after running on the Local partition first and Cloud partition second), it appears that no entry is made and also no mail attempted:

C10492DC226: message-id=<20231003021602.C10492DC226@slurmcentral>
2023-10-03T13:16:02.793218+11:00 slurmcentral postfix/qmgr[95304]: C10492DC226: from=<slurm@4dmedical.com>, size=535, nrcpt=1 (queue active)
2023-10-03T13:16:02.798345+11:00 slurmcentral postfix/smtp[76196]: error: unsupported dictionary type: hash
2023-10-03T13:16:06.248053+11:00 slurmcentral postfix/smtp[76196]: C10492DC226: to=<******@4dmedical.com>, relay=*****-com.mail.protection.outlook.com[104.47.71.202]:25, delay=3.5, delays=0/0.01/0.85/2.6, dsn=2.6.0, status=sent (250 2.6.0 <20231003021602.C10492DC226@slurmcentral> [InternalId=******.apcprd06.prod.outlook.com] 9665 bytes in 0.229, 41.139 KB/sec Queued mail for delivery)
2023-10-03T13:16:06.248487+11:00 slurmcentral postfix/qmgr[95304]: C10492DC226: removed
2023-10-03T13:16:06.279463+11:00 slurmcentral postfix/smtp[76156]: 856612DC224: to=<******@4dmedical.com>, relay=******-com.mail.protection.outlook.com[104.47.71.138]:25, delay=4.8, delays=0.05/0.02/2.1/2.6, dsn=2.6.0, status=sent (250 2.6.0 <20231003021601.856612DC224@slurmcentral> [InternalId=446676606213, Hostname=******.apcprd06.prod.outlook.com] 9722 bytes in 0.251, 37.776 KB/sec Queued mail for delivery)
2023-10-03T13:16:06.279657+11:00 slurmcentral postfix/qmgr[95304]: 856612DC224: removed


Any information or help on configuring the Federation email settings/resolving this if its a bug would be most appreciated as most of our workload is done on these cloud partitions.