Ticket 11727

Summary: "srun: error: mpi/pmi2: failed to send temp kvs to compute nodes" after upgrade to 20.11.7
Product: Slurm Reporter: yitp.support
Component: slurmdAssignee: Felip Moll <felip.moll>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: felip.moll, tripiana
Version: 20.11.7   
Hardware: Linux   
OS: Linux   
Site: Kyoto University Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: config files
slumd logs
slurmctld.log

Description yitp.support 2021-05-28 23:39:58 MDT
Some users' jobs fail with the message "srun: error: mpi/pmi2: failed to send temp kvs to compute nodes" after upgrade to 20.11.7 from 20.02.6.
I didn't see the message in the 20.02.6 environment.

I increased the loglevel of slurmd and collected the logs.
 *I'll attach them later.

Could you tell me how to fix it or workaround?
Otherwise users cannot run their jobs.
Comment 1 yitp.support 2021-05-28 23:42:43 MDT
Created attachment 19711 [details]
config files

slurm.conf and topology.conf
OS is CentOS 8.
Comment 2 yitp.support 2021-05-28 23:46:54 MDT
Created attachment 19712 [details]
slumd logs

JOBID 51023: debug2
JOBID 51024: debug5

cn0001: BatchHost
cn0002: one of the compute nodes
Comment 3 Felip Moll 2021-05-31 02:06:05 MDT
(In reply to yitp.support from comment #2)
> Created attachment 19712 [details]
> slumd logs
> 
> JOBID 51023: debug2
> JOBID 51024: debug5
> 
> cn0001: BatchHost
> cn0002: one of the compute nodes

Hi,

Normally debug2 is enough. Higher levels add even too much verbosity.
Can you please upload the slurmctld log too?

I'd also need to know how these jobs are launched (nr. of tasks, ranks, nodes) and see the batch script.

At a first glance I see how the job receives a signal RPC which could interrupt the KVS operations and trigger the error you see.

[2021-05-28T01:02:48.209] debug2: Start processing RPC: REQUEST_SIGNAL_TASKS
[2021-05-28T01:02:48.209] debug2: Processing RPC: REQUEST_SIGNAL_TASKS
[2021-05-28T01:02:48.210] [51023.0] debug:  Handling REQUEST_STEP_UID
[2021-05-28T01:02:48.210] debug:  _rpc_signal_tasks: sending signal 9 to StepId=51023.0 flag 0
[2021-05-28T01:02:48.210] debug2: container signal 9 to StepId=51023.0
[2021-05-28T01:02:48.210] [51023.0] debug:  Handling REQUEST_SIGNAL_CONTAINER
[2021-05-28T01:02:48.210] [51023.0] debug:  _handle_signal_container for StepId=51023.0 uid=0 signal=9
[2021-05-28T01:02:48.210] [51023.0] error: *** STEP 51023.0 ON cn0001 CANCELLED AT 2021-05-28T01:02:48 ***
[2021-05-28T01:02:48.210] [51023.0] debug2: proctrack/cgroup: proctrack_p_signal: killing process 80874 (slurm_task) with signal 9
[2021-05-28T01:02:48.210] [51023.0] Sent signal 9 to StepId=51023.0

Does it happen to all jobs?

Thanks!
Comment 4 yitp.support 2021-05-31 02:16:31 MDT
I'll attach the slurmctld.log later, though it was collected with "info" level.

This problem occurred with 130 nodes, though it didn't with 120 nodes (JOBID: 51022).


The job script is as follows:

#!/bin/sh

#SBATCH -N 130
#SBATCH -n 130
#SBATCH -c 1

srun ./IMB-MPI1 AlltoAll -npmin 120
Comment 5 yitp.support 2021-05-31 02:17:00 MDT
Created attachment 19713 [details]
slurmctld.log
Comment 6 Felip Moll 2021-05-31 05:32:45 MDT
(In reply to yitp.support from comment #5)
> Created attachment 19713 [details]
> slurmctld.log

Okay, this log is not useful, but no problem:

Can you grep in the 130 nodes logs and find if any of them have this string in there?

unpackmem_xmalloc


If you find it, please, send me this node log.
Comment 7 yitp.support 2021-05-31 07:13:05 MDT
Hi,

I checked slurmd.log on all node, then there was no line that includes "unpackmem_xmalloc".
Comment 8 Felip Moll 2021-05-31 07:17:44 MDT
(In reply to yitp.support from comment #7)
> Hi,
> 
> I checked slurmd.log on all node, then there was no line that includes
> "unpackmem_xmalloc".

OK, then please, I need you to increase the debuglevel of slurmctld to debug2, run the job, make it fail, and send me back the logs.

Also, run the job with the -v flag, 'srun -v ..' , to get verbose info from srun.

Thanks
Comment 9 yitp.support 2021-05-31 07:22:02 MDT
It is difficult to get the debug logs until Aug.
I must wait for next maintenance.
Comment 10 Felip Moll 2021-05-31 08:35:51 MDT
This bug is very close to bug 10735 (also opened from Dell).

If you cannot provide any more information until August, is it ok to CC you in bug 10735 and close this one?
Comment 11 yitp.support 2021-05-31 08:50:06 MDT
I disagree.

I've already increased the array size as follows, but the problem is not fixed.

-#define MAX_ARRAY_LEN_SMALL    10000
-#define MAX_ARRAY_LEN_MEDIUM   1000000
-#define MAX_ARRAY_LEN_LARGE    100000000
+#define MAX_ARRAY_LEN_SMALL    100000
+#define MAX_ARRAY_LEN_MEDIUM   10000000
+#define MAX_ARRAY_LEN_LARGE    1000000000
Comment 12 Felip Moll 2021-05-31 11:54:26 MDT
(In reply to yitp.support from comment #11)
> I disagree.
> 
> I've already increased the array size as follows, but the problem is not
> fixed.
> 
> -#define MAX_ARRAY_LEN_SMALL    10000
> -#define MAX_ARRAY_LEN_MEDIUM   1000000
> -#define MAX_ARRAY_LEN_LARGE    100000000
> +#define MAX_ARRAY_LEN_SMALL    100000
> +#define MAX_ARRAY_LEN_MEDIUM   10000000
> +#define MAX_ARRAY_LEN_LARGE    1000000000

You have recompiled all Slurm and installed/deployed it with this changes in code?

If you have reproduced the issue this way then I guess getting the debug logs wouldn't be a problem? Why cannot we get the logs?

I may have misunderstood something.
Comment 13 yitp.support 2021-05-31 16:48:50 MDT
I created the bug 10735.
I modified the code and compiled/installed 20.11.7 during the previous maintenance window before getting debug logs I sent you.

The cluster consists of 137 compute nodes, so using 130 nodes is difficult during production window.
Therefore I need to wait for next maintenance.
Comment 14 Felip Moll 2021-07-05 04:44:17 MDT
(In reply to yitp.support from comment #13)
> I created the bug 10735.
> I modified the code and compiled/installed 20.11.7 during the previous
> maintenance window before getting debug logs I sent you.
> 
> The cluster consists of 137 compute nodes, so using 130 nodes is difficult
> during production window.
> Therefore I need to wait for next maintenance.

Hello, do you already have a fixed date for the next maintenance?
Is the issue still happening?

Thank you
Comment 15 yitp.support 2021-07-05 10:47:51 MDT
Next maintenance would be held in the middle of Aug, but it's not fixed yet.
I must wait for the maintenance to check if this problem still happens because I need 130/135 nodes in the cluster.
Comment 16 Felip Moll 2021-08-20 04:11:18 MDT
(In reply to yitp.support from comment #15)
> Next maintenance would be held in the middle of Aug, but it's not fixed yet.
> I must wait for the maintenance to check if this problem still happens
> because I need 130/135 nodes in the cluster.

Hi, did the maintenance already happened?

Do you have any feedback for me?
Comment 17 yitp.support 2021-08-22 20:47:16 MDT
Hi,
The maintenance will be held on Aug. 25-26 JST.
I'll let you know the result.
Comment 18 yitp.support 2021-08-26 22:30:52 MDT
This problem is resolved.

There was a node that was running old version (20.02.6).
The jobs failed only when the node was included in nodelist.

Please close this bug.
Comment 19 Felip Moll 2021-08-27 03:01:04 MDT
(In reply to yitp.support from comment #18)
> This problem is resolved.
> 
> There was a node that was running old version (20.02.6).
> The jobs failed only when the node was included in nodelist.
> 
> Please close this bug.

That's really good to know.

Thank you very much for this information.