We just upgraded our slurm from 18.08.7 to 19.05.2.1 on last Thursday. The upgrade process itself went well. We built the new slurm version, performaed live db migration, restarted slurmdbd and slurmctld. We rebuilt all MPI's against the new version of slurm. Restarted all slurmd daemons on compute and login nodes. Then we ran into issues on srun jobs. It seems some environment changed on the new release. It is not documented and hard to figure out. The following environment variables are required to prevent the job environment from being poluted by the shell jobs are submitted from, and also to fix the problem with srun not inheriting the sbatch environment: SBATCH_EXPORT=none SBATCH_EXPORT_ENV=none SRUN_EXPORT_ENV=all Also noticed is that X11 forwarding was also broken due to permission problems with XAUTHORITY in TmpFS directory, but we managed to fix it. But during the weekend, slurmctld crashed twice. There is no dump file created. [root@transom1 slurm]# grep fatal /var/log/slurm/slurmctld.log [2019-09-06T16:33:02.458] fatal: _start_msg_tree_internal: pthread_create error Resource temporarily unavailable [2019-09-06T16:33:02.464] fatal: _start_msg_tree_internal: pthread_create error Resource temporarily unavailable [2019-09-07T06:03:17.227] fatal: prolog_slurmctld: pthread_create error Resource temporarily unavailable [2019-09-08T13:24:34.125] fatal: agent: pthread_create error Resource temporarily unavailable Since then, we added LimitNPROC=20000 to slurmctld.service while not sure if it would help. Please advise. Thanks, Wei
We are on RHEL 7.4 Currently we also have the following configured in /etc/systemd/system/slurmctld.service under Service LimitNOFILE=262144 LimitNPROC=20000 TasksMax=infinity
Hi, The pthreate_create error has been tracked and solved un this other bug: https://bugs.schedmd.com/show_bug.cgi?id=7360#c54 So I'm gonna mark this as a duplicate of that one. With regards to the env and x11 issues, please open separate bugs for each. *** This ticket has been marked as a duplicate of ticket 7360 ***
Our system is on RHEL 7.4 we used to have just "export SBATCH_EXPORT=none" and that worked. now we need the extra other 2 env variables to get the same behaviour.