| Summary: | slurmstepd errors | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Praveen SV <vijayap> |
| Component: | slurmstepd | Assignee: | Oriol Vilarrubi <jvilarru> |
| Status: | RESOLVED TIMEDOUT | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | tripiana |
| Version: | 20.11.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Roche/PHCIX | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurmconf
scontrol show partitions scontrol show nodes |
||
|
Description
Praveen SV
2021-09-06 23:42:11 MDT
Take a look 1st at: slurmstepd: error: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory munge needs to be up in all cluster nodes, compute and management ones. Please, report after this. Thanks, Carlos. Hi Praveen, I'm going to assume the problem was munge not being properly running. If you don't mind, let's close the issue fro now, and reopen it if necessary. Regards. Hi Carlos,
Sorry for late response. I have verified the status of munge in all the node
Master Node
[root@MASTER:~ ] $ systemctl status munge
● munge.service - MUNGE authentication service
Loaded: loaded (/lib/systemd/system/munge.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2021-09-13 06:25:01 UTC; 1 day 5h ago
Docs: man:munged(8)
Process: 27576 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS)
Main PID: 27585 (munged)
Tasks: 4 (limit: 4915)
CGroup: /system.slice/munge.service
└─27585 /usr/sbin/munged
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
Management Node
root@management:~# systemctl status munge
● munge.service - MUNGE authentication service
Loaded: loaded (/lib/systemd/system/munge.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2021-09-13 06:25:02 UTC; 1 day 5h ago
Docs: man:munged(8)
Main PID: 21664 (munged)
Tasks: 4 (limit: 4915)
Memory: 3.2M
CPU: 51.667s
CGroup: /system.slice/munge.service
└─21664 /usr/sbin/munged
Compute Node
root@compute:~# systemctl status munge
● munge.service - MUNGE authentication service
Loaded: loaded (/lib/systemd/system/munge.service; enabled; vendor preset: enabled)
Active: active (running) since Sun 2021-09-12 06:25:02 UTC; 2 days ago
Docs: man:munged(8)
Process: 31676 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS)
Main PID: 31684 (munged)
Tasks: 4 (limit: 5529)
CGroup: /system.slice/munge.service
└─31684 /usr/sbin/munged
Sep 12 06:25:02 spcd-euc1-00538.aws.science.roche.com systemd[1]: Starting MUNGE authentication service...
Sep 12 06:25:02 spcd-euc1-00538.aws.science.roche.com systemd[1]: Started MUNGE authentication service.
But still at times we get this error
slurmstepd: error: slurm_send_node_msg: g_slurm_auth_create: REQUEST_COMPLETE_BATCH_SCRIPT has authentication error: Invalid authentication credential
slurmstepd: error: If munged is up, restart with --num-threads=10
slurmstepd: error: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory
slurmstepd: error: slurm_send_node_msg: g_slurm_auth_create: REQUEST_COMPLETE_BATCH_SCRIPT has authentication error: Invalid authentication credential
slurmstepd: error: If munged is up, restart with --num-threads=10
slurmstepd: error: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory
Regards,
Praveen
As the errors come from stepd, the problem is in compute nodes. Taking into account: slurmstepd: error: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory We need to see (ls command) if slurmd (is it running as root?) has access there or if it exists. Using (https://slurm.schedmd.com/slurm.conf.html) AuthInfo="socket=/var/run/munge/munge.socket.2" you can define another path, default value is this one. Check this, and see if this solves the problem. Thanks, Carlos. Hi Carlos, I have checked the ownership and please see below cd /var/run/ ls -l drwxr-xr-x 2 munge munge 100 Sep 12 06:25 munge cd /var/run/munge srwxrwxrwx 1 munge munge 0 Sep 12 06:25 munge.socket.2 --w------- 1 munge munge 0 Sep 12 06:25 munge.socket.2.lock -rw-r--r-- 1 munge munge 5 Sep 12 06:25 munged.pid our current slurm configuration slurm.conf #AuthType=auth/mungei #AuthType=auth/mungei AuthAltTypes=auth/jwt AuthAltTypes=auth/jwt slurm is running as root Best regards, Praveen This might be related (https://slurm.schedmd.com/slurm.conf.html): AuthAltTypes Comma-separated list of alternative authentication plugins that the slurmctld will permit for communication. Acceptable values at present include auth/jwt. NOTE: auth/jwt requires a jwt_hs256.key to be populated in the StateSaveLocation directory for slurmctld only. The jwt_hs256.key should only be visible to the SlurmUser and root. It is not suggested to place the jwt_hs256.key on any nodes but the controller running slurmctld. auth/jwt can be activated by the presence of the SLURM_JWT environment variable. When activated, it will override the default AuthType. "When activated, it will override the default AuthType" So, if your config is: #AuthType=auth/mungei #AuthType=auth/mungei AuthAltTypes=auth/jwt AuthAltTypes=auth/jwt my ideas here are: 1. Lines duplicated. Just remove dupes. 2. AuthType has wrong value auth/mungei. 'i' should be removed. 3. AuthType is commented. So it defaults to auth/munge. In the end, this extra 'i' has no effect. 4. AuthAltTypes=auth/jwt is desirable and properly configured in your system? It may be overriding expected behaviour with munge plugin, I need more info to check/reproduce. Please, post complete slurmctld.log, and the slurmd.log for the job's node. Also, please post slurm.conf file if possible. With these information should be enough to fully understand what's happening and reproduce the issue. Thanks, Carlos. Created attachment 21344 [details]
slurmconf
Praveen, I can't see from your config anything relevant that triggers this error, and I've been unable to reproduce the error until now too. Please post log files from controller and affected compute node, also munge log from both nodes as well, so I can dig deeper into the problem. Thanks, Carlos. Hi Team, Today we again got this error slurmstepd: error: If munged is up, restart with --num-threads=10 slurmstepd: error: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory slurmstepd: error: slurm_send_node_msg: g_slurm_auth_create: REQUEST_COMPLETE_BATCH_SCRIPT has authentication error: Invalid authentication credential slurmstepd: error: If munged is up, restart with --num-threads=10 slurmstepd: error: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory slurmstepd: error: slurm_send_node_msg: g_slurm_auth_create: REQUEST_COMPLETE_BATCH_SCRIPT has authentication error: Invalid authentication credential slurmstepd: error: If munged is up, restart with --num-threads=10 slurmstepd: error: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory slurmstepd: error: slurm_send_node_msg: g_slurm_auth_create: REQUEST_COMPLETE_BATCH_SCRIPT has authentication error: Invalid authentication credential slurmstepd: error: If munged is up, restart with --num-threads=10 slurmstepd: error: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory slurmstepd: error: slurm_send_node_msg: g_slurm_auth_create: REQUEST_COMPLETE_BATCH_SCRIPT has authentication error: Invalid authentication credential Regards, Praveen Hi Praveen, After double checking, I can confirm you that, as we discussed in the early beginning of this issue, there's only one possibility if you see this in compute nodes: slurmstepd: error: If munged is up, restart with --num-threads=10 slurmstepd: error: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory At the time of this error, the munge daemon in this node where probably down (munge.socket.2 not in disk). As I see that munge is controlled with systemd, I bet for systemd killing service for some reason (force restart, died daemon, ...?). I'll suggest you to check with journalctl the munge log, and check dmesg for OOM events or events related to munge. Without having access to this nodes, I feel like the the problem is related to memory overflow while running jobs, and maybe kernel killing processes while in trouble. You may want to take a look at this config for Slurm https://slurm.schedmd.com/core_spec.html, to ensure there's always some memory reserved to the OS and system processes. Let us know any discoveries. Regards, Carlos. Hi, Today again we got the same error. Error slurmstepd: error: If munged is up, restart with --num-threads=10 slurmstepd: error: Munge encode failed: Failed to access “/var/run/munge/munge.socket.2”: No such file or directory slurmstepd: error: slurm_send_node_msg: g_slurm_auth_create: REQUEST_COMPLETE_BATCH_SCRIPT has authentication error: Invalid authentication credential slurmstepd: error: If munged is up, restart with --num-threads=10 slurmstepd: error: Munge encode failed: Failed to access “/var/run/munge/munge.socket.2": No such file or directory slrumd.log root@spcd-euc1-00712:/var/log# cat slurmd.log [2021-10-14T08:29:39.727] error: Domain socket directory /var/spool/slurm/d: No such file or directory [2021-10-14T08:29:39.754] error: xcpuinfo_hwloc_topo_load: failed (load will be required after read failures). [2021-10-14T08:29:39.755] error: Node configuration differs from hardware: CPUs=8:16(hw) Boards=1:1(hw) SocketsPerBoard=4:1(hw) CoresPerSocket=1:8(hw) ThreadsPerCore=2:2(hw) [2021-10-14T08:29:39.774] WARNING: A line in gres.conf for GRES gpu:V100 has 1 more configured than expected in slurm.conf. Ignoring extra GRES. [2021-10-14T08:29:39.784] CPU frequency setting not configured for this node [2021-10-14T08:29:39.808] slurmd version 20.11.8 started [2021-10-14T08:29:39.829] slurmd started on Thu, 14 Oct 2021 08:29:39 +0000 [2021-10-14T08:29:39.830] CPUs=8 Boards=1 Sockets=4 Cores=1 Threads=2 Memory=63523 TmpDisk=99202 Uptime=356 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) [2021-10-14T08:30:30.323] reissued job credential for job 192806 [2021-10-14T08:30:31.145] Launching batch job 192806 for UID 2001786 (In reply to Praveen SV from comment #12) > Hi, > > Today again we got the same error. > > > Error > > slurmstepd: error: If munged is up, restart with --num-threads=10 > slurmstepd: error: Munge encode failed: Failed to access > “/var/run/munge/munge.socket.2”: No such file or directory > slurmstepd: error: slurm_send_node_msg: g_slurm_auth_create: > REQUEST_COMPLETE_BATCH_SCRIPT has authentication error: Invalid > authentication credential > slurmstepd: error: If munged is up, restart with --num-threads=10 > slurmstepd: error: Munge encode failed: Failed to access > “/var/run/munge/munge.socket.2": No such file or directory > > > slrumd.log > > root@spcd-euc1-00712:/var/log# cat slurmd.log > [2021-10-14T08:29:39.727] error: Domain socket directory /var/spool/slurm/d: > No such file or directory > [2021-10-14T08:29:39.754] error: xcpuinfo_hwloc_topo_load: failed (load will > be required after read failures). > [2021-10-14T08:29:39.755] error: Node configuration differs from hardware: > CPUs=8:16(hw) Boards=1:1(hw) SocketsPerBoard=4:1(hw) CoresPerSocket=1:8(hw) > ThreadsPerCore=2:2(hw) > [2021-10-14T08:29:39.774] WARNING: A line in gres.conf for GRES gpu:V100 has > 1 more configured than expected in slurm.conf. Ignoring extra GRES. > [2021-10-14T08:29:39.784] CPU frequency setting not configured for this node > [2021-10-14T08:29:39.808] slurmd version 20.11.8 started > [2021-10-14T08:29:39.829] slurmd started on Thu, 14 Oct 2021 08:29:39 +0000 > [2021-10-14T08:29:39.830] CPUs=8 Boards=1 Sockets=4 Cores=1 Threads=2 > Memory=63523 TmpDisk=99202 Uptime=356 CPUSpecList=(null) > FeaturesAvail=(null) FeaturesActive=(null) > [2021-10-14T08:30:30.323] reissued job credential for job 192806 > [2021-10-14T08:30:31.145] Launching batch job 192806 for UID 2001786 ------ slrmdlogs root@spcd-euc1-00712:/var/log# cat slurmd.log [2021-10-14T08:29:39.727] error: Domain socket directory /var/spool/slurm/d: No such file or directory [2021-10-14T08:29:39.754] error: xcpuinfo_hwloc_topo_load: failed (load will be required after read failures). [2021-10-14T08:29:39.755] error: Node configuration differs from hardware: CPUs=8:16(hw) Boards=1:1(hw) SocketsPerBoard=4:1(hw) CoresPerSocket=1:8(hw) ThreadsPerCore=2:2(hw) [2021-10-14T08:29:39.774] WARNING: A line in gres.conf for GRES gpu:V100 has 1 more configured than expected in slurm.conf. Ignoring extra GRES. [2021-10-14T08:29:39.784] CPU frequency setting not configured for this node [2021-10-14T08:29:39.808] slurmd version 20.11.8 started [2021-10-14T08:29:39.829] slurmd started on Thu, 14 Oct 2021 08:29:39 +0000 [2021-10-14T08:29:39.830] CPUs=8 Boards=1 Sockets=4 Cores=1 Threads=2 Memory=63523 TmpDisk=99202 Uptime=356 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) [2021-10-14T08:30:30.323] reissued job credential for job 192806[2021-10-14T08:30:31.145] Launching batch job 192806 for UID 2001786 [2021-10-14T08:31:49.604] error: Failed to load current user environment variables [2021-10-14T08:31:49.605] error: _get_user_env: Unable to get user's local environment, running only with passed environment [2021-10-14T08:31:49.605] Launching batch job 192806 for UID 2001786 [2021-10-14T08:31:49.615] [192806.batch] error: couldn't open `/var/spool/slurm/d/job192806/slurm_script': File exists [2021-10-14T08:31:49.617] [192806.batch] error: batch script setup failed for job 192806 on spcd-euc1-00712: File exists[2021-10-14T08:31:49.617] [192806.batch] error: _step_setup: no job returned [2021-10-14T08:31:49.617] error: slurmstepd return code 4010 [2021-10-14T08:31:49.617] [192806.batch] done with job [2021-10-14T08:31:49.639] [192806.batch] error: *** JOB 192806 ON spcd-euc1-00712 CANCELLED AT 2021-10-14T08:31:49 DUE TO JOB REQUEUE *** [2021-10-14T08:31:50.640] [192806.batch] error: unlink(/var/spool/slurm/d/job192806/slurm_script): No such file or directory[2021-10-14T08:31:50.640] [192806.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:15 [2021-10-14T08:31:50.642] [192806.batch] error: rmdir(/var/spool/slurm/d/job192806): No such file or directory [2021-10-14T08:31:50.643] [192806.batch] done with job [2021-10-14T08:33:03.969] Launching batch job 192822 for UID 87848 Praveen - We have been talking about this issue internally, and I do have a few ideas. Oriol will follow up with you once you attach the output requested below. Please confirm if you have enabled "--num-threads=10" with munge. Please also confirm if munge is still running "systemctl status munge". > [2021-10-14T08:29:39.727] error: Domain socket directory /var/spool/slurm/d: No such file or directory > SlurmdSpoolDir=/var/spool/slurm/d Does this directory exist? > error: Node configuration differs from hardware: CPUs=8:16(hw) Boards=1:1(hw) SocketsPerBoard=4:1(hw) CoresPerSocket=1:8(hw) ThreadsPerCore=2:2(hw) Please send the output of "slurmd -C" from that compute node. Please also attach the partition config and node config. The slurm.conf you have attached does not include this information. > WARNING: A line in gres.conf for GRES gpu:V100 has 1 more configured than expected in slurm.conf. Ignoring extra GRES. Please also attach your gres.conf and the "nvidia-smi" output from that node. The following errors are the result of munge failures and the inaccessible spool directory > [2021-10-14T08:31:49.604] error: Failed to load current user environment variables > [2021-10-14T08:31:49.605] error: _get_user_env: Unable to get user's local environment, running only with passed environment > [2021-10-14T08:31:49.615] [192806.batch] error: couldn't open `/var/spool/slurm/d/job192806/slurm_script': File exists > [2021-10-14T08:31:49.617] [192806.batch] error: batch script setup failed for job 192806 on spcd-euc1-00712: File exists[2021-10-14T08:31:49.617] [192806.batch] error: _step_setup: no job returned Hi Jason, See my reply inline after the dotted lines (In reply to Jason Booth from comment #14) > Praveen - We have been talking about this issue internally, and I do have a > few ideas. > > Oriol will follow up with you once you attach the output requested below. > > Please confirm if you have enabled "--num-threads=10" with munge. > > Please also confirm if munge is still running "systemctl status munge". > > > > [2021-10-14T08:29:39.727] error: Domain socket directory /var/spool/slurm/d: No such file or directory > > SlurmdSpoolDir=/var/spool/slurm/d > > Does this directory exist? > > > > error: Node configuration differs from hardware: CPUs=8:16(hw) Boards=1:1(hw) SocketsPerBoard=4:1(hw) CoresPerSocket=1:8(hw) ThreadsPerCore=2:2(hw) > Please send the output of "slurmd -C" from that compute node. > > Please also attach the partition config and node config. The slurm.conf you > have attached does not include this information. > > > WARNING: A line in gres.conf for GRES gpu:V100 has 1 more configured than expected in slurm.conf. Ignoring extra GRES. > Please also attach your gres.conf and the "nvidia-smi" output from that node. > > The following errors are the result of munge failures and the inaccessible > spool directory > > > [2021-10-14T08:31:49.604] error: Failed to load current user environment variables > > [2021-10-14T08:31:49.605] error: _get_user_env: Unable to get user's local environment, running only with passed environment > > > [2021-10-14T08:31:49.615] [192806.batch] error: couldn't open `/var/spool/slurm/d/job192806/slurm_script': File exists > > [2021-10-14T08:31:49.617] [192806.batch] error: batch script setup failed for job 192806 on spcd-euc1-00712: File exists[2021-10-14T08:31:49.617] [192806.batch] error: _step_setup: no job returned ------------------------------------------ ------------------------------------------ ------------------------------------------ my reply Hi Jason, Answer to your queries > Please confirm if you have enabled "--num-threads=10" with munge. how to find this ? > > Please also confirm if munge is still running "systemctl status munge". yes root@spcd-euc1-xxxx:/var/log# systemctl status munge ● munge.service - MUNGE authentication service Loaded: loaded (/lib/systemd/system/munge.service; enabled; vendor preset: enabled) Active: active (running) since Sun 2021-10-17 06:25:01 UTC; 5 days ago Docs: man:munged(8) Main PID: 17997 (munged) Tasks: 4 (limit: 4915) CGroup: /system.slice/munge.service └─17997 /usr/sbin/munged >> Does this directory exist ? yes root@spcd-euc1-xxxx:/var/spool/slurm/d# ls cred_state cred_state.old hwloc_topo_whole.xml > > error: Node configuration differs from hardware: CPUs=8:16(hw) Boards=1:1(hw) SocketsPerBoard=4:1(hw) CoresPerSocket=1:8(hw) ThreadsPerCore=2:2(hw) > Please send the output of "slurmd -C" from that compute node. root@spcd-euc1-00712:~# slurmd -C Command 'slurmd' not found, but can be installed with: apt install slurm-wlm-emulator apt install slurmd but slurmd is running root@spcd-euc1-xxxx:~# systemctl status slurmd ● slurmd.service - Slurm node daemon Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled) Active: active (running) since Thu 2021-10-14 08:29:39 UTC; 1 weeks 1 days ago Main PID: 9588 (slurmd) Tasks: 1 CGroup: /system.slice/slurmd.service └─9588 /shared/slurm_SLURM-MASTER-EUC1-HPC-PRD/sbin/slurmd Oct 14 08:29:39 spcd-euc1-xxxx systemd[1]: Starting Slurm node daemon... > Please also attach the partition config and node config. The slurm.conf you > have attached does not include this information. root@spcd-euc1-xxxx:/shared/slurm_SLURM-MASTER-EUC1-HPC-PRD/etc# cat gres.conf NodeName=spcd-euc1-[XXXX-XXXX] Name=gpu Type=V100 File=/dev/nvidia0 NodeName=spcd-euc1-[XXXX-XXXX] Name=gpu Type=V100 File=/dev/nvidia[0-3] NodeName=spcd-euc1-[XXXX-XXXX] Name=gpu Type=V100 File=/dev/nvidia[0-7] NodeName=spcd-euc1-[XXXX-XXXX] Name=gpu Type=V100 File=/dev/nvidia0 NodeName=spcd-euc1-[XXXX-XXXX] Name=gpu Type=V100 File=/dev/nvidia[0-3] NodeName=spcd-euc1-[XXXX-XXXX] Name=gpu Type=V100 File=/dev/nvidia[0-7] NodeName=spcd-euc1-[XXXX-XXXX] Name=gpu Type=V100 File=/dev/nvidia0 NodeName=spcd-euc1-[XXXX-XXXX] Name=gpu Type=V100 File=/dev/nvidia[0-3] NodeName=spcd-euc1-[XXXX-XXXX] Name=gpu Type=V100 File=/dev/nvidia[0-7] NodeName=spcd-euc1-[XXXX-XXXX] Name=gpu Type=V100 File=/dev/nvidia0 NodeName=spcd-euc1-[XXXX-XXXX] Name=gpu Type=V100 File=/dev/nvidia[0-3] NodeName=spcd-euc1-[XXXX-XXXX] Name=gpu Type=V100 File=/dev/nvidia[0-7] root@spcd-euc1-00712:/shared/slurm_SLURM-MASTER-EUC1-HPC-PRD/etc# cat range_ip.conf 10.174.xx.xx-10.174.xxx.xx subnet-0xxxxxxxx c5.4xlarge "" subnet-0xxxxxxxx c5.9xlarge "" subnet-0xxxxxxxx c5.18xlarge "" subnet-0xxxxxxxx r5.4xlarge "" subnet-0xxxxxxxx r5.12xlarge "" subnet-0xxxxxxxx r5.24xlarge "" subnet-0xxxxxxxx g4dn.4xlarge "" subnet-0xxxxxxxx g4dn.12xlarge "" subnet-0xxxxxxxx p3.16xlarge For your information we have configured slurm to run with Amazon Web Services Cloud Servers. The above mentioned are server types, subnet settings and ips. For confidential reasons Im hiding the ip and subnet details. Thanks Praveen (In reply to Praveen SV from comment #17) Hi Praveen I'll reply between lines > > Please confirm if you have enabled "--num-threads=10" with munge. > > how to find this ? you can see how the service is started by looking into the unit file: [root@centos munge]# systemctl cat munge # /usr/lib/systemd/system/munge.service [Unit] Description=MUNGE authentication service Documentation=man:munged(8) After=network.target After=time-sync.target [Service] Type=forking ExecStart=/usr/sbin/munged PIDFile=/var/run/munge/munged.pid User=munge Group=munge Restart=on-abort [Install] WantedBy=multi-user.target --- If no flag is specified(as you can see in my example) by default munged starts with 2 threads: [root@centos munge]# munged --help Usage: munged [OPTIONS] ... --num-threads=INT Specify number of threads to spawn [2] ... So what is needed here is that you modify the unit file (in this example /usr/lib/systemd/system/munge.service) and add --num-threads=10 next to the munged, so ExecStart=/usr/sbin/munged --num-threads=10 After modifying a unit file, remember to reload systemctl with the following command: systemctl daemon-reload > > > error: Node configuration differs from hardware: CPUs=8:16(hw) Boards=1:1(hw) SocketsPerBoard=4:1(hw) CoresPerSocket=1:8(hw) ThreadsPerCore=2:2(hw) > > Please send the output of "slurmd -C" from that compute node. > > root@spcd-euc1-00712:~# slurmd -C > > Command 'slurmd' not found, but can be installed with: > > apt install slurm-wlm-emulator > apt install slurmd > > but slurmd is running > root@spcd-euc1-xxxx:~# systemctl status slurmd > ● slurmd.service - Slurm node daemon > Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor > preset: enabled) > Active: active (running) since Thu 2021-10-14 08:29:39 UTC; 1 weeks 1 > days ago > Main PID: 9588 (slurmd) > Tasks: 1 > CGroup: /system.slice/slurmd.service > └─9588 /shared/slurm_SLURM-MASTER-EUC1-HPC-PRD/sbin/slurmd > > Oct 14 08:29:39 spcd-euc1-xxxx systemd[1]: Starting Slurm node daemon... Seems like slurmd is not in your path, you might use the full path: /shared/slurm_SLURM-MASTER-EUC1-HPC-PRD/sbin/slurmd -C > > Please also attach the partition config and node config. The slurm.conf you > > have attached does not include this information. > What you attached is good but we will need also the node and partition information, you can get that by issuing the following commands: scontrol show nodes scontrol show partitions You can also further confirm if munge is working properly issuing the following commands from the management node(substitute compute-node for the name of one of your compute nodes): munge -n | ssh compute-node unmunge ssh compute-node munge -n | unmunge Regards. Hi, Please find the details requested output = systemctl cat munge I have added --num-threads=10 root@xxxx:~# systemctl cat munge # /lib/systemd/system/munge.service [Unit] Description=MUNGE authentication service Documentation=man:munged(8) After=network.target After=time-sync.target [Service] Type=forking ExecStart=/usr/sbin/munged --num-threads=10 PIDFile=/var/run/munge/munged.pid User=munge Group=munge Restart=on-abort [Install] WantedBy=multi-user.target --------------------------------------- munge -n | ssh compute-node unmunge ssh compute-node munge -n | unmunge xxxx@xxxx:~$ munge -n | ssh xxxxx unmunge Password: STATUS: Success (0) ENCODE_HOST: xxxx (10.174.242.200) ENCODE_TIME: 2021-10-25 19:09:25 +0000 (1635188965) DECODE_TIME: 2021-10-25 19:09:36 +0000 (1635188976) TTL: 300 CIPHER: aes128 (4) MAC: sha256 (5) ZIP: none (0) UID: username (93343) GID: dialout (20) LENGTH: 0 xxxx@xxxx:~$ ssh xxxxxx munge -n | unmunge Password: STATUS: Success (0) ENCODE_HOST: xxxx (10.174.242.200) ENCODE_TIME: 2021-10-25 19:10:51 +0000 (1635189051) DECODE_TIME: 2021-10-25 19:10:51 +0000 (1635189051) TTL: 300 CIPHER: aes128 (4) MAC: sha256 (5) ZIP: none (0) UID: username (93343) GID: dialout (20) LENGTH: 0 ---------------------------------------------------------------------- /shared/slurm_SLURM-MASTER-EUC1-HPC-PRD/sbin/slurmd -C NodeName=xxxxx CPUs=16 Boards=1 SocketsPerBoard=1 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=63523 UpTime=11-10:50:27 --------------------------------------------------------------------- scontrol show nodes scontrol show partitions attached as txt file Created attachment 21923 [details]
scontrol show partitions
Created attachment 21924 [details]
scontrol show nodes
Hi Praveen, The munge | unmunge is to check that munge is working as expected in normal circumstances. We added the num_threads to prepare it for heavier loads. Is there a way that you can test with similar jobs to the ones that failed in order to check whether that was the culprit? Regarding the other errors: error: Node configuration differs from hardware: CPUs=8:16(hw) Boards=1:1(hw) SocketsPerBoard=4:1(hw) CoresPerSocket=1:8(hw) ThreadsPerCore=2:2(hw) This basically means that you have configured the node for different resources than it actually has, that is no big deal if you configure for less resources than the machine has, but in your case in the SocketsPerBoard you are doing it the other way around (configured 4 vs what the machine has 1), you should fix that, what we normally recommend is that you put the output of slurmd -C into the node type configuration. NodeName=xxxxx CPUs=16 Boards=1 SocketsPerBoard=1 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=63523 WARNING: A line in gres.conf for GRES gpu:V100 has 1 more configured than expected in slurm.conf. Ignoring extra GRES. This mean that you have and inconsistency between you node configuration and the gres.conf file, you have 1 more gpu configured in the gres conf than the ones stated in the node declaration. Please tell me as soon as you are able to tell the new munge parameters, thanks. Hi Praveen, I've lowered the Severity to 3 as to accordance to the support webpage and also because we are in the phase of testing if the munge solution is enough to solve the issues. Greetings. Hi Praveen, Do you have any news regarding the change to --num-threads=10 on munge? Greetings. We are timing out this issue. If you have any update regarding comment#24 then replying to this bug will re-open the issue. |