Hi Support, we recently upgraded to 17.11.2 (from 16.05.x, as part of Bright Cluster Manager upgrade) and as the users have started submitting jobs (1000+) we're experiencing the failure of slurmctld with the following errors; fatal: prolog_slurmctld: pthread_create error Resource temporarily unavailable or fatal: _slurmctld_rpc_mgr: pthread_create error Resource temporarily unavailable seeming to occur just before scheduling, leaving our cluster idle. Any advice on what to look at would be most appreciated. I've tried setting default "SchedulerParameters" (commenting out this line in slurm.conf) and that made no difference. After some experimentation reducing the number of jobs to be considered for scheduling does help (in that it at least allows a couple of scheduling cycles before failing). So, as a temporary workaround, I'm able to get slurmctld to run though a couple of scheduling cycles before failing by limiting all users to 100 running jobs (via MAXJOBS). As Bright restarts slurmctld upon failure, we're able to continue processing for now. Will attach slurm.conf, logs & some environment settings. Cheers James
Created attachment 6624 [details] slurm.conf
Created attachment 6625 [details] environment
Created attachment 6626 [details] slurmctld log
Hi James. Looking at the slurmctld limits you reported: cm01:~ # cat /proc/3438/limits Limit Soft Limit Hard Limit Units ... Max processes 515222 515222 processes Max open files 4096 4096 files ... Will you try increasing the Max open files then restart the daemon and see if things improve? Also looking at yout logs I see this a couple of times: [2018-04-13T14:26:08.768] error: chdir(/var/log): Permission denied can you check the permissions there? There are a bunch of errors too due to slurm.conf not being consistent across all nodes in the cluster. Please make sure it's in sync. Thanks.
(In reply to Alejandro Sanchez from comment #4) > Hi James. Looking at the slurmctld limits you reported: > > cm01:~ # cat /proc/3438/limits > Limit Soft Limit Hard Limit Units > ... > Max processes 515222 515222 processes > Max open files 4096 4096 files > ... > > Will you try increasing the Max open files then restart the daemon and see > if things improve? cm01:~ # systemctl status slurmctld ● slurmctld.service - Slurm controller daemon Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; disabled; vendor preset: disabled) Active: active (running) since Mon 2018-04-16 09:12:09 AEST; 32s ago Process: 18863 ExecStart=/cm/shared/apps/slurm/17.11.2/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 18868 (slurmctld) Tasks: 15 (limit: 512) CGroup: /system.slice/slurmctld.service └─18868 /cm/shared/apps/slurm/17.11.2/sbin/slurmctld ... cm01:~ # cat /proc/18868/limits Limit Soft Limit Hard Limit Units ... Max processes 515222 515222 processes Max open files 65536 65536 files ... cm01:~ # grep fatal /var/log/slurmctld [2018-04-16T00:16:00.092] fatal: _slurmctld_rpc_mgr: pthread_create error Resource temporarily unavailable [2018-04-16T00:34:00.160] fatal: _slurmctld_rpc_mgr: pthread_create error Resource temporarily unavailable ... [2018-04-16T09:00:00.118] fatal: _slurmctld_rpc_mgr: pthread_create error Resource temporarily unavailable [2018-04-16T09:12:00.081] fatal: _slurmctld_rpc_mgr: pthread_create error Resource temporarily unavailable Less frequent fatal errors, but I suspect that may be as a consequence of a low number of queued jobs due to the weekend > Also looking at yout logs I see this a couple of times: > > [2018-04-13T14:26:08.768] error: chdir(/var/log): Permission denied > > can you check the permissions there? cm01:~ # ls -ld /var/log/ drwxr-xr-x 23 root root 12288 Apr 16 00:03 /var/log/ cm01:~ # ls -l /var/log/slurmctld -rw-r----- 1 slurm slurm 5110471 Apr 16 09:20 /var/log/slurmctld Changing /var/log to 777 permissions does remove that message in the slurmctld log > There are a bunch of errors too due to slurm.conf not being consistent > across all nodes in the cluster. Please make sure it's in sync. Thanks. All nodes in the cluster use a link to the same slurm.conf on shared storage. We see this error occasionally & it's puzzling Cheers James
A couple of more suggestions: Is the slurmctld.service file configured with: TasksMax=infinity in the [Service] section? Check/increase the system-wide: /proc/sys/kernel/threads-max Increase even more Max open files.
(In reply to Alejandro Sanchez from comment #6) > A couple of more suggestions: > > Is the slurmctld.service file configured with: > TasksMax=infinity in the [Service] section? No, added cm01:~ # cat /usr/lib/systemd/system/slurmctld.service ... [Service] Type=forking EnvironmentFile=-/etc/sysconfig/slurmctld ExecStart=/cm/shared/apps/slurm/17.11.2/sbin/slurmctld $SLURMCTLD_OPTIONS ExecReload=/bin/kill -HUP $MAINPID PIDFile=/var/run/slurmctld.pid LimitNOFILE=262144 TasksMax=infinity ... > Check/increase the system-wide: > /proc/sys/kernel/threads-max cm01:~ # cat /proc/sys/kernel/threads-max 1030760 That's sufficient I think > Increase even more Max open files. Increased from 64k to 256k Either the increase in Max open files or the addition of TasksMax=infinity has settled slurmctld. 5h20m so far without a fatal error cm01:~ # systemctl status slurmctld.service ● slurmctld.service - Slurm controller daemon Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; disabled; vendor preset: disabled) Active: active (running) since Tue 2018-04-17 10:46:25 AEST; 5h 20min ago ... cm01:~ # cat /proc/34530/limits Limit Soft Limit Hard Limit Units ... Max processes 515222 515222 processes Max open files 262144 262144 files ... cm01:~ # systemctl show -p TasksMax slurmctld.service TasksMax=18446744073709551615 I'll increase MAXJOBS to 1000 (where we were before upgrading) and see if we remain stable. Appreciate the help. Cheers James
Been 24Hrs since making the changes & no further fatal errors. I'm calling it solved. Thanks Support. Cheers James
(In reply to James Powell from comment #8) > Been 24Hrs since making the changes & no further fatal errors. I'm calling > it solved. Thanks Support. > > Cheers > > James Glad to see that. Closing the bug, please reopen if needed.