Ticket 7705 - slurmctld keep crash after recent upgrade
Summary: slurmctld keep crash after recent upgrade
Status: RESOLVED DUPLICATE of ticket 7360
Alias: None
Product: Slurm
Classification: Unclassified
Component: Build System and Packaging (show other tickets)
Version: 19.05.2
Hardware: Linux Linux
: 2 - High Impact
Assignee: Director of Support
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-09-08 22:31 MDT by whong
Modified: 2019-09-11 23:49 MDT (History)
1 user (show)

See Also:
Site: Swinburne
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description whong 2019-09-08 22:31:01 MDT
We just upgraded our slurm from 18.08.7 to 19.05.2.1 on last Thursday. The upgrade process itself went well. We built the new slurm version, performaed live db migration, restarted slurmdbd and slurmctld. We rebuilt all MPI's against the new version of slurm. Restarted all slurmd daemons on compute and login nodes.

Then we ran into issues on srun jobs. It seems some environment changed on the new release. It is not documented and hard to figure out. The following environment variables are required to prevent the job environment from being poluted by the shell jobs are submitted from, and also to fix the problem with srun not inheriting the sbatch environment:

SBATCH_EXPORT=none
SBATCH_EXPORT_ENV=none
SRUN_EXPORT_ENV=all

Also noticed is that X11 forwarding was also broken due to permission problems with XAUTHORITY in TmpFS directory, but we managed to fix it.

But during the weekend, slurmctld crashed twice. There is no dump file created.

[root@transom1 slurm]# grep fatal /var/log/slurm/slurmctld.log
[2019-09-06T16:33:02.458] fatal: _start_msg_tree_internal: pthread_create error Resource temporarily unavailable
[2019-09-06T16:33:02.464] fatal: _start_msg_tree_internal: pthread_create error Resource temporarily unavailable
[2019-09-07T06:03:17.227] fatal: prolog_slurmctld: pthread_create error Resource temporarily unavailable
[2019-09-08T13:24:34.125] fatal: agent: pthread_create error Resource temporarily unavailable

Since then, we added LimitNPROC=20000 to slurmctld.service while not sure if it would help.

Please advise.

Thanks,
Wei
Comment 1 whong 2019-09-08 22:52:47 MDT
We are on RHEL 7.4

Currently we also have the following configured in /etc/systemd/system/slurmctld.service under Service

LimitNOFILE=262144
LimitNPROC=20000
TasksMax=infinity
Comment 2 Alejandro Sanchez 2019-09-09 01:23:06 MDT
Hi,

The pthreate_create error has been tracked and solved un this other bug:

https://bugs.schedmd.com/show_bug.cgi?id=7360#c54

So I'm gonna mark this as a duplicate of that one.

With regards to the env and x11 issues, please open separate bugs for each.

*** This ticket has been marked as a duplicate of ticket 7360 ***
Comment 3 whong 2019-09-11 23:49:12 MDT
Our system is on RHEL 7.4

we used to have just "export SBATCH_EXPORT=none" and that worked. now we need the extra other 2 env variables to get the same behaviour.