Summary: | slurmctld failing with pthread_create error Resource temporarily unavailable | ||
---|---|---|---|
Product: | Slurm | Reporter: | James Powell <James.Powell> |
Component: | slurmctld | Assignee: | Alejandro Sanchez <alex> |
Status: | RESOLVED INFOGIVEN | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | alex |
Version: | 17.11.2 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: | https://bugs.schedmd.com/show_bug.cgi?id=5068 | ||
Site: | CSIRO | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: |
slurm.conf
environment slurmctld log |
Description
James Powell
2018-04-12 22:37:26 MDT
Created attachment 6624 [details]
slurm.conf
Created attachment 6625 [details]
environment
Created attachment 6626 [details]
slurmctld log
Hi James. Looking at the slurmctld limits you reported: cm01:~ # cat /proc/3438/limits Limit Soft Limit Hard Limit Units ... Max processes 515222 515222 processes Max open files 4096 4096 files ... Will you try increasing the Max open files then restart the daemon and see if things improve? Also looking at yout logs I see this a couple of times: [2018-04-13T14:26:08.768] error: chdir(/var/log): Permission denied can you check the permissions there? There are a bunch of errors too due to slurm.conf not being consistent across all nodes in the cluster. Please make sure it's in sync. Thanks. (In reply to Alejandro Sanchez from comment #4) > Hi James. Looking at the slurmctld limits you reported: > > cm01:~ # cat /proc/3438/limits > Limit Soft Limit Hard Limit Units > ... > Max processes 515222 515222 processes > Max open files 4096 4096 files > ... > > Will you try increasing the Max open files then restart the daemon and see > if things improve? cm01:~ # systemctl status slurmctld ● slurmctld.service - Slurm controller daemon Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; disabled; vendor preset: disabled) Active: active (running) since Mon 2018-04-16 09:12:09 AEST; 32s ago Process: 18863 ExecStart=/cm/shared/apps/slurm/17.11.2/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 18868 (slurmctld) Tasks: 15 (limit: 512) CGroup: /system.slice/slurmctld.service └─18868 /cm/shared/apps/slurm/17.11.2/sbin/slurmctld ... cm01:~ # cat /proc/18868/limits Limit Soft Limit Hard Limit Units ... Max processes 515222 515222 processes Max open files 65536 65536 files ... cm01:~ # grep fatal /var/log/slurmctld [2018-04-16T00:16:00.092] fatal: _slurmctld_rpc_mgr: pthread_create error Resource temporarily unavailable [2018-04-16T00:34:00.160] fatal: _slurmctld_rpc_mgr: pthread_create error Resource temporarily unavailable ... [2018-04-16T09:00:00.118] fatal: _slurmctld_rpc_mgr: pthread_create error Resource temporarily unavailable [2018-04-16T09:12:00.081] fatal: _slurmctld_rpc_mgr: pthread_create error Resource temporarily unavailable Less frequent fatal errors, but I suspect that may be as a consequence of a low number of queued jobs due to the weekend > Also looking at yout logs I see this a couple of times: > > [2018-04-13T14:26:08.768] error: chdir(/var/log): Permission denied > > can you check the permissions there? cm01:~ # ls -ld /var/log/ drwxr-xr-x 23 root root 12288 Apr 16 00:03 /var/log/ cm01:~ # ls -l /var/log/slurmctld -rw-r----- 1 slurm slurm 5110471 Apr 16 09:20 /var/log/slurmctld Changing /var/log to 777 permissions does remove that message in the slurmctld log > There are a bunch of errors too due to slurm.conf not being consistent > across all nodes in the cluster. Please make sure it's in sync. Thanks. All nodes in the cluster use a link to the same slurm.conf on shared storage. We see this error occasionally & it's puzzling Cheers James A couple of more suggestions: Is the slurmctld.service file configured with: TasksMax=infinity in the [Service] section? Check/increase the system-wide: /proc/sys/kernel/threads-max Increase even more Max open files. (In reply to Alejandro Sanchez from comment #6) > A couple of more suggestions: > > Is the slurmctld.service file configured with: > TasksMax=infinity in the [Service] section? No, added cm01:~ # cat /usr/lib/systemd/system/slurmctld.service ... [Service] Type=forking EnvironmentFile=-/etc/sysconfig/slurmctld ExecStart=/cm/shared/apps/slurm/17.11.2/sbin/slurmctld $SLURMCTLD_OPTIONS ExecReload=/bin/kill -HUP $MAINPID PIDFile=/var/run/slurmctld.pid LimitNOFILE=262144 TasksMax=infinity ... > Check/increase the system-wide: > /proc/sys/kernel/threads-max cm01:~ # cat /proc/sys/kernel/threads-max 1030760 That's sufficient I think > Increase even more Max open files. Increased from 64k to 256k Either the increase in Max open files or the addition of TasksMax=infinity has settled slurmctld. 5h20m so far without a fatal error cm01:~ # systemctl status slurmctld.service ● slurmctld.service - Slurm controller daemon Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; disabled; vendor preset: disabled) Active: active (running) since Tue 2018-04-17 10:46:25 AEST; 5h 20min ago ... cm01:~ # cat /proc/34530/limits Limit Soft Limit Hard Limit Units ... Max processes 515222 515222 processes Max open files 262144 262144 files ... cm01:~ # systemctl show -p TasksMax slurmctld.service TasksMax=18446744073709551615 I'll increase MAXJOBS to 1000 (where we were before upgrading) and see if we remain stable. Appreciate the help. Cheers James Been 24Hrs since making the changes & no further fatal errors. I'm calling it solved. Thanks Support. Cheers James (In reply to James Powell from comment #8) > Been 24Hrs since making the changes & no further fatal errors. I'm calling > it solved. Thanks Support. > > Cheers > > James Glad to see that. Closing the bug, please reopen if needed. |