Ticket 7705

Summary: slurmctld keep crash after recent upgrade
Product: Slurm Reporter: whong
Component: Build System and PackagingAssignee: Director of Support <support>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 2 - High Impact    
Priority: --- CC: alex
Version: 19.05.2   
Hardware: Linux   
OS: Linux   
Site: Swinburne Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description whong 2019-09-08 22:31:01 MDT
We just upgraded our slurm from 18.08.7 to 19.05.2.1 on last Thursday. The upgrade process itself went well. We built the new slurm version, performaed live db migration, restarted slurmdbd and slurmctld. We rebuilt all MPI's against the new version of slurm. Restarted all slurmd daemons on compute and login nodes.

Then we ran into issues on srun jobs. It seems some environment changed on the new release. It is not documented and hard to figure out. The following environment variables are required to prevent the job environment from being poluted by the shell jobs are submitted from, and also to fix the problem with srun not inheriting the sbatch environment:

SBATCH_EXPORT=none
SBATCH_EXPORT_ENV=none
SRUN_EXPORT_ENV=all

Also noticed is that X11 forwarding was also broken due to permission problems with XAUTHORITY in TmpFS directory, but we managed to fix it.

But during the weekend, slurmctld crashed twice. There is no dump file created.

[root@transom1 slurm]# grep fatal /var/log/slurm/slurmctld.log
[2019-09-06T16:33:02.458] fatal: _start_msg_tree_internal: pthread_create error Resource temporarily unavailable
[2019-09-06T16:33:02.464] fatal: _start_msg_tree_internal: pthread_create error Resource temporarily unavailable
[2019-09-07T06:03:17.227] fatal: prolog_slurmctld: pthread_create error Resource temporarily unavailable
[2019-09-08T13:24:34.125] fatal: agent: pthread_create error Resource temporarily unavailable

Since then, we added LimitNPROC=20000 to slurmctld.service while not sure if it would help.

Please advise.

Thanks,
Wei
Comment 1 whong 2019-09-08 22:52:47 MDT
We are on RHEL 7.4

Currently we also have the following configured in /etc/systemd/system/slurmctld.service under Service

LimitNOFILE=262144
LimitNPROC=20000
TasksMax=infinity
Comment 2 Alejandro Sanchez 2019-09-09 01:23:06 MDT
Hi,

The pthreate_create error has been tracked and solved un this other bug:

https://bugs.schedmd.com/show_bug.cgi?id=7360#c54

So I'm gonna mark this as a duplicate of that one.

With regards to the env and x11 issues, please open separate bugs for each.

*** This ticket has been marked as a duplicate of ticket 7360 ***
Comment 3 whong 2019-09-11 23:49:12 MDT
Our system is on RHEL 7.4

we used to have just "export SBATCH_EXPORT=none" and that worked. now we need the extra other 2 env variables to get the same behaviour.