Ticket 12436

Summary: slurmstepd errors
Product: Slurm Reporter: Praveen SV <vijayap>
Component: slurmstepdAssignee: Oriol Vilarrubi <jvilarru>
Status: RESOLVED TIMEDOUT QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: tripiana
Version: 20.11.8   
Hardware: Linux   
OS: Linux   
Site: Roche/PHCIX Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurmconf
scontrol show partitions
scontrol show nodes

Description Praveen SV 2021-09-06 23:42:11 MDT
Hi Team,

We are currently facing some intermittent errors when we submit the job. Our job submission fails with below error messages. We are facing the issue after we upgraded the slurm version from 20.02.5 to 20.11.8. Can you please provide resolution on this.

slurmstepd: error: slurm_send_node_msg: g_slurm_auth_create: REQUEST_COMPLETE_BATCH_SCRIPT has authentication error: Invalid authentication credential
slurmstepd: error: If munged is up, restart with --num-threads=10
slurmstepd: error: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory
slurmstepd: error: slurm_send_node_msg: g_slurm_auth_create: REQUEST_COMPLETE_BATCH_SCRIPT has authentication error: Invalid authentication credential
slurmstepd: error: If munged is up, restart with --num-threads=10
slurmstepd: error: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory


Thanks
Praveen
Comment 1 Carlos Tripiana Montes 2021-09-07 02:22:19 MDT
Take a look 1st at:

slurmstepd: error: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory

munge needs to be up in all cluster nodes, compute and management ones.

Please, report after this.

Thanks,
Carlos.
Comment 2 Carlos Tripiana Montes 2021-09-13 00:24:39 MDT
Hi Praveen,

I'm going to assume the problem was munge not being properly running.

If you don't mind, let's close the issue fro now, and reopen it if necessary.

Regards.
Comment 3 Praveen SV 2021-09-14 06:13:06 MDT
Hi Carlos,

Sorry for late response. I have verified the status of munge in all the node


Master Node
[root@MASTER:~ ] $ systemctl status munge
● munge.service - MUNGE authentication service
   Loaded: loaded (/lib/systemd/system/munge.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2021-09-13 06:25:01 UTC; 1 day 5h ago
     Docs: man:munged(8)
  Process: 27576 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS)
 Main PID: 27585 (munged)
    Tasks: 4 (limit: 4915)
   CGroup: /system.slice/munge.service
           └─27585 /usr/sbin/munged

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

Management Node
root@management:~# systemctl status munge
● munge.service - MUNGE authentication service
   Loaded: loaded (/lib/systemd/system/munge.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2021-09-13 06:25:02 UTC; 1 day 5h ago
     Docs: man:munged(8)
 Main PID: 21664 (munged)
    Tasks: 4 (limit: 4915)
   Memory: 3.2M
      CPU: 51.667s
   CGroup: /system.slice/munge.service
           └─21664 /usr/sbin/munged

Compute Node
root@compute:~# systemctl status munge
● munge.service - MUNGE authentication service
   Loaded: loaded (/lib/systemd/system/munge.service; enabled; vendor preset: enabled)
   Active: active (running) since Sun 2021-09-12 06:25:02 UTC; 2 days ago
     Docs: man:munged(8)
  Process: 31676 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS)
 Main PID: 31684 (munged)
    Tasks: 4 (limit: 5529)
   CGroup: /system.slice/munge.service
           └─31684 /usr/sbin/munged

Sep 12 06:25:02 spcd-euc1-00538.aws.science.roche.com systemd[1]: Starting MUNGE authentication service...
Sep 12 06:25:02 spcd-euc1-00538.aws.science.roche.com systemd[1]: Started MUNGE authentication service.



But still at times we get this error
slurmstepd: error: slurm_send_node_msg: g_slurm_auth_create: REQUEST_COMPLETE_BATCH_SCRIPT has authentication error: Invalid authentication credential
slurmstepd: error: If munged is up, restart with --num-threads=10
slurmstepd: error: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory
slurmstepd: error: slurm_send_node_msg: g_slurm_auth_create: REQUEST_COMPLETE_BATCH_SCRIPT has authentication error: Invalid authentication credential
slurmstepd: error: If munged is up, restart with --num-threads=10
slurmstepd: error: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory

Regards,
Praveen
Comment 4 Carlos Tripiana Montes 2021-09-14 08:48:19 MDT
As the errors come from stepd, the problem is in compute nodes. Taking into account:

slurmstepd: error: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory

We need to see (ls command) if slurmd (is it running as root?) has access there or if it exists. Using (https://slurm.schedmd.com/slurm.conf.html) AuthInfo="socket=/var/run/munge/munge.socket.2" you can define another path, default value is this one.

Check this, and see if this solves the problem.

Thanks,
Carlos.
Comment 5 Praveen SV 2021-09-16 06:18:50 MDT
Hi Carlos,

I have checked the ownership and please see below

cd /var/run/
ls -l
drwxr-xr-x  2 munge   munge    100 Sep 12 06:25 munge

cd /var/run/munge

srwxrwxrwx 1 munge munge 0 Sep 12 06:25 munge.socket.2
--w------- 1 munge munge 0 Sep 12 06:25 munge.socket.2.lock
-rw-r--r-- 1 munge munge 5 Sep 12 06:25 munged.pid


our current slurm configuration

slurm.conf

#AuthType=auth/mungei
#AuthType=auth/mungei
AuthAltTypes=auth/jwt
AuthAltTypes=auth/jwt


slurm is running as root


Best regards,
Praveen
Comment 6 Carlos Tripiana Montes 2021-09-17 00:58:19 MDT
This might be related (https://slurm.schedmd.com/slurm.conf.html):

AuthAltTypes
Comma-separated list of alternative authentication plugins that the slurmctld will permit for communication. Acceptable values at present include auth/jwt.
NOTE: auth/jwt requires a jwt_hs256.key to be populated in the StateSaveLocation directory for slurmctld only. The jwt_hs256.key should only be visible to the SlurmUser and root. It is not suggested to place the jwt_hs256.key on any nodes but the controller running slurmctld. auth/jwt can be activated by the presence of the SLURM_JWT environment variable. When activated, it will override the default AuthType.


"When activated, it will override the default AuthType"

So, if your config is:

#AuthType=auth/mungei
#AuthType=auth/mungei
AuthAltTypes=auth/jwt
AuthAltTypes=auth/jwt

my ideas here are:

1. Lines duplicated. Just remove dupes.
2. AuthType has wrong value auth/mungei. 'i' should be removed.
3. AuthType is commented. So it defaults to auth/munge. In the end, this extra 'i' has no effect.
4. AuthAltTypes=auth/jwt is desirable and properly configured in your system? It may be overriding expected behaviour with munge plugin, I need more info to check/reproduce.

Please, post complete slurmctld.log, and the slurmd.log for the job's node. Also, please post slurm.conf file if possible.

With these information should be enough to fully understand what's happening and reproduce the issue.

Thanks,
Carlos.
Comment 7 Praveen SV 2021-09-17 12:24:25 MDT
Created attachment 21344 [details]
slurmconf
Comment 8 Carlos Tripiana Montes 2021-09-20 00:48:15 MDT
Praveen,

I can't see from your config anything relevant that triggers this error, and I've been unable to reproduce the error until now too.

Please post log files from controller and affected compute node, also munge log from both nodes as well, so I can dig deeper into the problem.

Thanks,
Carlos.
Comment 9 Praveen SV 2021-10-05 10:16:49 MDT
Hi Team,

Today we again got this error

slurmstepd: error: If munged is up, restart with --num-threads=10
slurmstepd: error: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory
slurmstepd: error: slurm_send_node_msg: g_slurm_auth_create: REQUEST_COMPLETE_BATCH_SCRIPT has authentication error: Invalid authentication credential
slurmstepd: error: If munged is up, restart with --num-threads=10
slurmstepd: error: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory
slurmstepd: error: slurm_send_node_msg: g_slurm_auth_create: REQUEST_COMPLETE_BATCH_SCRIPT has authentication error: Invalid authentication credential
slurmstepd: error: If munged is up, restart with --num-threads=10
slurmstepd: error: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory
slurmstepd: error: slurm_send_node_msg: g_slurm_auth_create: REQUEST_COMPLETE_BATCH_SCRIPT has authentication error: Invalid authentication credential
slurmstepd: error: If munged is up, restart with --num-threads=10
slurmstepd: error: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory
slurmstepd: error: slurm_send_node_msg: g_slurm_auth_create: REQUEST_COMPLETE_BATCH_SCRIPT has authentication error: Invalid authentication credential

Regards,
Praveen
Comment 11 Carlos Tripiana Montes 2021-10-05 23:56:46 MDT
Hi Praveen,

After double checking, I can confirm you that, as we discussed in the early beginning of this issue, there's only one possibility if you see this in compute nodes:

slurmstepd: error: If munged is up, restart with --num-threads=10
slurmstepd: error: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory

At the time of this error, the munge daemon in this node where probably down (munge.socket.2 not in disk). As I see that munge is controlled with systemd, I bet for systemd killing service for some reason (force restart, died daemon, ...?).

I'll suggest you to check with journalctl the munge log, and check dmesg for OOM events or events related to munge.

Without having access to this nodes, I feel like the the problem is related to memory overflow while running jobs, and maybe kernel killing processes while in trouble.

You may want to take a look at this config for Slurm https://slurm.schedmd.com/core_spec.html, to ensure there's always some memory reserved to the OS and system processes.

Let us know any discoveries.

Regards,
Carlos.
Comment 12 Praveen SV 2021-10-14 02:34:47 MDT
Hi,

Today again we got the same error.


Error

slurmstepd: error: If munged is up, restart with --num-threads=10
slurmstepd: error: Munge encode failed: Failed to access “/var/run/munge/munge.socket.2”: No such file or directory
slurmstepd: error: slurm_send_node_msg: g_slurm_auth_create: REQUEST_COMPLETE_BATCH_SCRIPT has authentication error: Invalid authentication credential
slurmstepd: error: If munged is up, restart with --num-threads=10
slurmstepd: error: Munge encode failed: Failed to access “/var/run/munge/munge.socket.2": No such file or directory


slrumd.log

root@spcd-euc1-00712:/var/log# cat slurmd.log
[2021-10-14T08:29:39.727] error: Domain socket directory /var/spool/slurm/d: No such file or directory
[2021-10-14T08:29:39.754] error: xcpuinfo_hwloc_topo_load: failed (load will be required after read failures).
[2021-10-14T08:29:39.755] error: Node configuration differs from hardware: CPUs=8:16(hw) Boards=1:1(hw) SocketsPerBoard=4:1(hw) CoresPerSocket=1:8(hw) ThreadsPerCore=2:2(hw)
[2021-10-14T08:29:39.774] WARNING: A line in gres.conf for GRES gpu:V100 has 1 more configured than expected in slurm.conf. Ignoring extra GRES.
[2021-10-14T08:29:39.784] CPU frequency setting not configured for this node
[2021-10-14T08:29:39.808] slurmd version 20.11.8 started
[2021-10-14T08:29:39.829] slurmd started on Thu, 14 Oct 2021 08:29:39 +0000
[2021-10-14T08:29:39.830] CPUs=8 Boards=1 Sockets=4 Cores=1 Threads=2 Memory=63523 TmpDisk=99202 Uptime=356 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2021-10-14T08:30:30.323] reissued job credential for job 192806
[2021-10-14T08:30:31.145] Launching batch job 192806 for UID 2001786
Comment 13 Praveen SV 2021-10-14 02:37:38 MDT
(In reply to Praveen SV from comment #12)
> Hi,
> 
> Today again we got the same error.
> 
> 
> Error
> 
> slurmstepd: error: If munged is up, restart with --num-threads=10
> slurmstepd: error: Munge encode failed: Failed to access
> “/var/run/munge/munge.socket.2”: No such file or directory
> slurmstepd: error: slurm_send_node_msg: g_slurm_auth_create:
> REQUEST_COMPLETE_BATCH_SCRIPT has authentication error: Invalid
> authentication credential
> slurmstepd: error: If munged is up, restart with --num-threads=10
> slurmstepd: error: Munge encode failed: Failed to access
> “/var/run/munge/munge.socket.2": No such file or directory
> 
> 
> slrumd.log
> 
> root@spcd-euc1-00712:/var/log# cat slurmd.log
> [2021-10-14T08:29:39.727] error: Domain socket directory /var/spool/slurm/d:
> No such file or directory
> [2021-10-14T08:29:39.754] error: xcpuinfo_hwloc_topo_load: failed (load will
> be required after read failures).
> [2021-10-14T08:29:39.755] error: Node configuration differs from hardware:
> CPUs=8:16(hw) Boards=1:1(hw) SocketsPerBoard=4:1(hw) CoresPerSocket=1:8(hw)
> ThreadsPerCore=2:2(hw)
> [2021-10-14T08:29:39.774] WARNING: A line in gres.conf for GRES gpu:V100 has
> 1 more configured than expected in slurm.conf. Ignoring extra GRES.
> [2021-10-14T08:29:39.784] CPU frequency setting not configured for this node
> [2021-10-14T08:29:39.808] slurmd version 20.11.8 started
> [2021-10-14T08:29:39.829] slurmd started on Thu, 14 Oct 2021 08:29:39 +0000
> [2021-10-14T08:29:39.830] CPUs=8 Boards=1 Sockets=4 Cores=1 Threads=2
> Memory=63523 TmpDisk=99202 Uptime=356 CPUSpecList=(null)
> FeaturesAvail=(null) FeaturesActive=(null)
> [2021-10-14T08:30:30.323] reissued job credential for job 192806
> [2021-10-14T08:30:31.145] Launching batch job 192806 for UID 2001786



------

slrmdlogs

root@spcd-euc1-00712:/var/log# cat slurmd.log
[2021-10-14T08:29:39.727] error: Domain socket directory /var/spool/slurm/d: No such file or directory
[2021-10-14T08:29:39.754] error: xcpuinfo_hwloc_topo_load: failed (load will be required after read failures).
[2021-10-14T08:29:39.755] error: Node configuration differs from hardware: CPUs=8:16(hw) Boards=1:1(hw) SocketsPerBoard=4:1(hw) CoresPerSocket=1:8(hw) ThreadsPerCore=2:2(hw)
[2021-10-14T08:29:39.774] WARNING: A line in gres.conf for GRES gpu:V100 has 1 more configured than expected in slurm.conf. Ignoring extra GRES.
[2021-10-14T08:29:39.784] CPU frequency setting not configured for this node
[2021-10-14T08:29:39.808] slurmd version 20.11.8 started
[2021-10-14T08:29:39.829] slurmd started on Thu, 14 Oct 2021 08:29:39 +0000
[2021-10-14T08:29:39.830] CPUs=8 Boards=1 Sockets=4 Cores=1 Threads=2 Memory=63523 TmpDisk=99202 Uptime=356 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2021-10-14T08:30:30.323] reissued job credential for job 192806[2021-10-14T08:30:31.145] Launching batch job 192806 for UID 2001786
[2021-10-14T08:31:49.604] error: Failed to load current user environment variables
[2021-10-14T08:31:49.605] error: _get_user_env: Unable to get user's local environment, running only with passed environment
[2021-10-14T08:31:49.605] Launching batch job 192806 for UID 2001786
[2021-10-14T08:31:49.615] [192806.batch] error: couldn't open `/var/spool/slurm/d/job192806/slurm_script': File exists
[2021-10-14T08:31:49.617] [192806.batch] error: batch script setup failed for job 192806 on spcd-euc1-00712: File exists[2021-10-14T08:31:49.617] [192806.batch] error: _step_setup: no job returned
[2021-10-14T08:31:49.617] error: slurmstepd return code 4010
[2021-10-14T08:31:49.617] [192806.batch] done with job
[2021-10-14T08:31:49.639] [192806.batch] error: *** JOB 192806 ON spcd-euc1-00712 CANCELLED AT 2021-10-14T08:31:49 DUE TO JOB REQUEUE ***
[2021-10-14T08:31:50.640] [192806.batch] error: unlink(/var/spool/slurm/d/job192806/slurm_script): No such file or directory[2021-10-14T08:31:50.640] [192806.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:15
[2021-10-14T08:31:50.642] [192806.batch] error: rmdir(/var/spool/slurm/d/job192806): No such file or directory
[2021-10-14T08:31:50.643] [192806.batch] done with job
[2021-10-14T08:33:03.969] Launching batch job 192822 for UID 87848
Comment 14 Jason Booth 2021-10-14 09:43:06 MDT
Praveen - We have been talking about this issue internally, and I do have a few ideas. 

Oriol will follow up with you once you attach the output requested below.

Please confirm if you have enabled "--num-threads=10" with munge.

Please also confirm if munge is still running "systemctl status munge".


> [2021-10-14T08:29:39.727] error: Domain socket directory /var/spool/slurm/d: No such file or directory
> SlurmdSpoolDir=/var/spool/slurm/d

Does this directory exist? 


> error: Node configuration differs from hardware: CPUs=8:16(hw) Boards=1:1(hw) SocketsPerBoard=4:1(hw) CoresPerSocket=1:8(hw) ThreadsPerCore=2:2(hw)
Please send the output of "slurmd -C" from that compute node.

Please also attach the partition config and node config. The slurm.conf you have attached does not include this information.

>  WARNING: A line in gres.conf for GRES gpu:V100 has 1 more configured than expected in slurm.conf. Ignoring extra GRES.
Please also attach your gres.conf and the "nvidia-smi" output from that node.

The following errors are the result of munge failures and the inaccessible spool directory 

> [2021-10-14T08:31:49.604] error: Failed to load current user environment variables
> [2021-10-14T08:31:49.605] error: _get_user_env: Unable to get user's local environment, running only with passed environment

> [2021-10-14T08:31:49.615] [192806.batch] error: couldn't open `/var/spool/slurm/d/job192806/slurm_script': File exists
> [2021-10-14T08:31:49.617] [192806.batch] error: batch script setup failed for job 192806 on spcd-euc1-00712: File exists[2021-10-14T08:31:49.617] [192806.batch] error: _step_setup: no job returned
Comment 17 Praveen SV 2021-10-22 12:06:02 MDT
Hi Jason,

See my reply inline after the dotted lines

(In reply to Jason Booth from comment #14)
> Praveen - We have been talking about this issue internally, and I do have a
> few ideas. 
> 
> Oriol will follow up with you once you attach the output requested below.
> 
> Please confirm if you have enabled "--num-threads=10" with munge.
> 
> Please also confirm if munge is still running "systemctl status munge".
> 
> 
> > [2021-10-14T08:29:39.727] error: Domain socket directory /var/spool/slurm/d: No such file or directory
> > SlurmdSpoolDir=/var/spool/slurm/d
> 
> Does this directory exist? 
> 
> 
> > error: Node configuration differs from hardware: CPUs=8:16(hw) Boards=1:1(hw) SocketsPerBoard=4:1(hw) CoresPerSocket=1:8(hw) ThreadsPerCore=2:2(hw)
> Please send the output of "slurmd -C" from that compute node.
> 
> Please also attach the partition config and node config. The slurm.conf you
> have attached does not include this information.
> 
> >  WARNING: A line in gres.conf for GRES gpu:V100 has 1 more configured than expected in slurm.conf. Ignoring extra GRES.
> Please also attach your gres.conf and the "nvidia-smi" output from that node.
> 
> The following errors are the result of munge failures and the inaccessible
> spool directory 
> 
> > [2021-10-14T08:31:49.604] error: Failed to load current user environment variables
> > [2021-10-14T08:31:49.605] error: _get_user_env: Unable to get user's local environment, running only with passed environment
> 
> > [2021-10-14T08:31:49.615] [192806.batch] error: couldn't open `/var/spool/slurm/d/job192806/slurm_script': File exists
> > [2021-10-14T08:31:49.617] [192806.batch] error: batch script setup failed for job 192806 on spcd-euc1-00712: File exists[2021-10-14T08:31:49.617] [192806.batch] error: _step_setup: no job returned

------------------------------------------
------------------------------------------
------------------------------------------
my reply 

Hi Jason,

Answer to your queries

> Please confirm if you have enabled "--num-threads=10" with munge.

how to find this ?
> 
> Please also confirm if munge is still running "systemctl status munge".

yes 
root@spcd-euc1-xxxx:/var/log# systemctl status munge
● munge.service - MUNGE authentication service
   Loaded: loaded (/lib/systemd/system/munge.service; enabled; vendor preset: enabled)
   Active: active (running) since Sun 2021-10-17 06:25:01 UTC; 5 days ago
     Docs: man:munged(8)
 Main PID: 17997 (munged)
    Tasks: 4 (limit: 4915)
   CGroup: /system.slice/munge.service
           └─17997 /usr/sbin/munged


>> Does this directory exist ?

yes
root@spcd-euc1-xxxx:/var/spool/slurm/d# ls
cred_state  cred_state.old  hwloc_topo_whole.xml

> > error: Node configuration differs from hardware: CPUs=8:16(hw) Boards=1:1(hw) SocketsPerBoard=4:1(hw) CoresPerSocket=1:8(hw) ThreadsPerCore=2:2(hw)
> Please send the output of "slurmd -C" from that compute node.

root@spcd-euc1-00712:~# slurmd -C

Command 'slurmd' not found, but can be installed with:

apt install slurm-wlm-emulator
apt install slurmd

but slurmd is running
root@spcd-euc1-xxxx:~#  systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2021-10-14 08:29:39 UTC; 1 weeks 1 days ago
 Main PID: 9588 (slurmd)
    Tasks: 1
   CGroup: /system.slice/slurmd.service
           └─9588 /shared/slurm_SLURM-MASTER-EUC1-HPC-PRD/sbin/slurmd

Oct 14 08:29:39 spcd-euc1-xxxx systemd[1]: Starting Slurm node daemon...


> Please also attach the partition config and node config. The slurm.conf you
> have attached does not include this information.

root@spcd-euc1-xxxx:/shared/slurm_SLURM-MASTER-EUC1-HPC-PRD/etc# cat gres.conf
NodeName=spcd-euc1-[XXXX-XXXX] Name=gpu Type=V100 File=/dev/nvidia0
NodeName=spcd-euc1-[XXXX-XXXX] Name=gpu Type=V100 File=/dev/nvidia[0-3]
NodeName=spcd-euc1-[XXXX-XXXX] Name=gpu Type=V100 File=/dev/nvidia[0-7]
NodeName=spcd-euc1-[XXXX-XXXX] Name=gpu Type=V100 File=/dev/nvidia0
NodeName=spcd-euc1-[XXXX-XXXX] Name=gpu Type=V100 File=/dev/nvidia[0-3]
NodeName=spcd-euc1-[XXXX-XXXX] Name=gpu Type=V100 File=/dev/nvidia[0-7]
NodeName=spcd-euc1-[XXXX-XXXX] Name=gpu Type=V100 File=/dev/nvidia0
NodeName=spcd-euc1-[XXXX-XXXX] Name=gpu Type=V100 File=/dev/nvidia[0-3]
NodeName=spcd-euc1-[XXXX-XXXX] Name=gpu Type=V100 File=/dev/nvidia[0-7]
NodeName=spcd-euc1-[XXXX-XXXX] Name=gpu Type=V100 File=/dev/nvidia0
NodeName=spcd-euc1-[XXXX-XXXX] Name=gpu Type=V100 File=/dev/nvidia[0-3]
NodeName=spcd-euc1-[XXXX-XXXX] Name=gpu Type=V100 File=/dev/nvidia[0-7]



root@spcd-euc1-00712:/shared/slurm_SLURM-MASTER-EUC1-HPC-PRD/etc# cat range_ip.conf
10.174.xx.xx-10.174.xxx.xx subnet-0xxxxxxxx c5.4xlarge
"" subnet-0xxxxxxxx c5.9xlarge
"" subnet-0xxxxxxxx c5.18xlarge
"" subnet-0xxxxxxxx r5.4xlarge
"" subnet-0xxxxxxxx r5.12xlarge
"" subnet-0xxxxxxxx r5.24xlarge
"" subnet-0xxxxxxxx g4dn.4xlarge
"" subnet-0xxxxxxxx g4dn.12xlarge
"" subnet-0xxxxxxxx p3.16xlarge

For your information we have configured slurm to run with Amazon Web Services Cloud Servers. The above mentioned are server types, subnet settings and ips. For confidential reasons Im hiding the ip and subnet details.


Thanks 
Praveen
Comment 18 Oriol Vilarrubi 2021-10-25 04:40:30 MDT
(In reply to Praveen SV from comment #17)
Hi Praveen I'll reply between lines

> > Please confirm if you have enabled "--num-threads=10" with munge.
> 
> how to find this ?

you can see how the service is started by looking into the unit file:

[root@centos munge]# systemctl cat munge
# /usr/lib/systemd/system/munge.service
[Unit]
Description=MUNGE authentication service
Documentation=man:munged(8)
After=network.target
After=time-sync.target

[Service]
Type=forking
ExecStart=/usr/sbin/munged
PIDFile=/var/run/munge/munged.pid
User=munge
Group=munge
Restart=on-abort

[Install]
WantedBy=multi-user.target

---

If no flag is specified(as you can see in my example) by default munged starts with 2 threads:

[root@centos munge]# munged --help
Usage: munged [OPTIONS]

...
  --num-threads=INT        Specify number of threads to spawn [2]
...

So what is needed here is that you modify the unit file (in this example /usr/lib/systemd/system/munge.service) and add --num-threads=10 next to the munged, so 

ExecStart=/usr/sbin/munged --num-threads=10

After modifying a unit file, remember to reload systemctl with the following command: systemctl daemon-reload


> > > error: Node configuration differs from hardware: CPUs=8:16(hw) Boards=1:1(hw) SocketsPerBoard=4:1(hw) CoresPerSocket=1:8(hw) ThreadsPerCore=2:2(hw)
> > Please send the output of "slurmd -C" from that compute node.
> 
> root@spcd-euc1-00712:~# slurmd -C
> 
> Command 'slurmd' not found, but can be installed with:
> 
> apt install slurm-wlm-emulator
> apt install slurmd
> 
> but slurmd is running
> root@spcd-euc1-xxxx:~#  systemctl status slurmd
> ● slurmd.service - Slurm node daemon
>    Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor
> preset: enabled)
>    Active: active (running) since Thu 2021-10-14 08:29:39 UTC; 1 weeks 1
> days ago
>  Main PID: 9588 (slurmd)
>     Tasks: 1
>    CGroup: /system.slice/slurmd.service
>            └─9588 /shared/slurm_SLURM-MASTER-EUC1-HPC-PRD/sbin/slurmd
> 
> Oct 14 08:29:39 spcd-euc1-xxxx systemd[1]: Starting Slurm node daemon...
 
Seems like slurmd is not in your path, you might use the full path: 

/shared/slurm_SLURM-MASTER-EUC1-HPC-PRD/sbin/slurmd -C
 
> > Please also attach the partition config and node config. The slurm.conf you
> > have attached does not include this information.
> 
What you attached is good but we will need also the node and partition information, you can get that by issuing the following commands:

scontrol show nodes
scontrol show partitions

You can also further confirm if munge is working properly issuing the following commands from the management node(substitute compute-node for the name of one of your compute nodes):

munge -n | ssh compute-node unmunge
ssh compute-node munge -n | unmunge

Regards.
Comment 19 Praveen SV 2021-10-25 13:17:43 MDT
Hi,

Please find the details requested

output = systemctl cat munge

I have added --num-threads=10

root@xxxx:~# systemctl cat munge
# /lib/systemd/system/munge.service
[Unit]
Description=MUNGE authentication service
Documentation=man:munged(8)
After=network.target
After=time-sync.target

[Service]
Type=forking
ExecStart=/usr/sbin/munged --num-threads=10
PIDFile=/var/run/munge/munged.pid
User=munge
Group=munge
Restart=on-abort

[Install]
WantedBy=multi-user.target

---------------------------------------

munge -n | ssh compute-node unmunge
ssh compute-node munge -n | unmunge

xxxx@xxxx:~$ munge -n | ssh xxxxx unmunge

Password:
STATUS:           Success (0)
ENCODE_HOST:      xxxx (10.174.242.200)
ENCODE_TIME:      2021-10-25 19:09:25 +0000 (1635188965)
DECODE_TIME:      2021-10-25 19:09:36 +0000 (1635188976)
TTL:              300
CIPHER:           aes128 (4)
MAC:              sha256 (5)
ZIP:              none (0)
UID:              username (93343)
GID:              dialout (20)
LENGTH:           0

xxxx@xxxx:~$ ssh xxxxxx munge -n | unmunge
Password:
STATUS:           Success (0)
ENCODE_HOST:      xxxx (10.174.242.200)
ENCODE_TIME:      2021-10-25 19:10:51 +0000 (1635189051)
DECODE_TIME:      2021-10-25 19:10:51 +0000 (1635189051)
TTL:              300
CIPHER:           aes128 (4)
MAC:              sha256 (5)
ZIP:              none (0)
UID:              username (93343)
GID:              dialout (20)
LENGTH:           0


----------------------------------------------------------------------

/shared/slurm_SLURM-MASTER-EUC1-HPC-PRD/sbin/slurmd -C
NodeName=xxxxx CPUs=16 Boards=1 SocketsPerBoard=1 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=63523
UpTime=11-10:50:27



---------------------------------------------------------------------

scontrol show nodes
scontrol show partitions

attached as txt file
Comment 20 Praveen SV 2021-10-25 13:19:10 MDT
Created attachment 21923 [details]
scontrol show partitions
Comment 21 Praveen SV 2021-10-25 13:20:01 MDT
Created attachment 21924 [details]
scontrol show nodes
Comment 22 Oriol Vilarrubi 2021-10-26 12:12:02 MDT
Hi Praveen,

The munge | unmunge is to check that munge is working as expected in normal circumstances. 
We added the num_threads to prepare it for heavier loads. Is there a way that you can test with similar jobs to the ones that failed in order to check whether that was the culprit?

Regarding the other errors:

error: Node configuration differs from hardware: CPUs=8:16(hw) Boards=1:1(hw) SocketsPerBoard=4:1(hw) CoresPerSocket=1:8(hw) ThreadsPerCore=2:2(hw)

This basically means that you have configured the node for different resources than it actually has, that is no big deal if you configure for less resources than the machine has, but in your case in the SocketsPerBoard you are doing it the other way around (configured 4 vs what the machine has 1), you should fix that, what we normally recommend is that you put the output of slurmd -C into the node type configuration.

NodeName=xxxxx CPUs=16 Boards=1 SocketsPerBoard=1 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=63523


WARNING: A line in gres.conf for GRES gpu:V100 has 1 more configured than expected in slurm.conf. Ignoring extra GRES.

This mean that you have and inconsistency between you node configuration and the gres.conf file, you have 1 more gpu configured in the gres conf than the ones stated in the node declaration.

Please tell me as soon as you are able to tell the new munge parameters, thanks.
Comment 23 Oriol Vilarrubi 2021-10-28 11:27:01 MDT
Hi Praveen,

I've lowered the Severity to 3 as to accordance to the support webpage and also because we are in the phase of testing if the munge solution is enough to solve the issues.


Greetings.
Comment 24 Oriol Vilarrubi 2021-11-03 13:06:14 MDT
Hi Praveen,

Do you have any news regarding the change to --num-threads=10 on munge?

Greetings.
Comment 25 Jason Booth 2021-11-10 12:32:41 MST
We are timing out this issue. If you have any update regarding comment#24 then replying to this bug will re-open the issue.