| Summary: | "srun: error: mpi/pmi2: failed to send temp kvs to compute nodes" after upgrade to 20.11.7 | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | yitp.support |
| Component: | slurmd | Assignee: | Felip Moll <felip.moll> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | felip.moll, tripiana |
| Version: | 20.11.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Kyoto University | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
config files
slumd logs slurmctld.log |
||
|
Description
yitp.support
2021-05-28 23:39:58 MDT
Created attachment 19711 [details]
config files
slurm.conf and topology.conf
OS is CentOS 8.
Created attachment 19712 [details]
slumd logs
JOBID 51023: debug2
JOBID 51024: debug5
cn0001: BatchHost
cn0002: one of the compute nodes
(In reply to yitp.support from comment #2) > Created attachment 19712 [details] > slumd logs > > JOBID 51023: debug2 > JOBID 51024: debug5 > > cn0001: BatchHost > cn0002: one of the compute nodes Hi, Normally debug2 is enough. Higher levels add even too much verbosity. Can you please upload the slurmctld log too? I'd also need to know how these jobs are launched (nr. of tasks, ranks, nodes) and see the batch script. At a first glance I see how the job receives a signal RPC which could interrupt the KVS operations and trigger the error you see. [2021-05-28T01:02:48.209] debug2: Start processing RPC: REQUEST_SIGNAL_TASKS [2021-05-28T01:02:48.209] debug2: Processing RPC: REQUEST_SIGNAL_TASKS [2021-05-28T01:02:48.210] [51023.0] debug: Handling REQUEST_STEP_UID [2021-05-28T01:02:48.210] debug: _rpc_signal_tasks: sending signal 9 to StepId=51023.0 flag 0 [2021-05-28T01:02:48.210] debug2: container signal 9 to StepId=51023.0 [2021-05-28T01:02:48.210] [51023.0] debug: Handling REQUEST_SIGNAL_CONTAINER [2021-05-28T01:02:48.210] [51023.0] debug: _handle_signal_container for StepId=51023.0 uid=0 signal=9 [2021-05-28T01:02:48.210] [51023.0] error: *** STEP 51023.0 ON cn0001 CANCELLED AT 2021-05-28T01:02:48 *** [2021-05-28T01:02:48.210] [51023.0] debug2: proctrack/cgroup: proctrack_p_signal: killing process 80874 (slurm_task) with signal 9 [2021-05-28T01:02:48.210] [51023.0] Sent signal 9 to StepId=51023.0 Does it happen to all jobs? Thanks! I'll attach the slurmctld.log later, though it was collected with "info" level. This problem occurred with 130 nodes, though it didn't with 120 nodes (JOBID: 51022). The job script is as follows: #!/bin/sh #SBATCH -N 130 #SBATCH -n 130 #SBATCH -c 1 srun ./IMB-MPI1 AlltoAll -npmin 120 Created attachment 19713 [details]
slurmctld.log
(In reply to yitp.support from comment #5) > Created attachment 19713 [details] > slurmctld.log Okay, this log is not useful, but no problem: Can you grep in the 130 nodes logs and find if any of them have this string in there? unpackmem_xmalloc If you find it, please, send me this node log. Hi, I checked slurmd.log on all node, then there was no line that includes "unpackmem_xmalloc". (In reply to yitp.support from comment #7) > Hi, > > I checked slurmd.log on all node, then there was no line that includes > "unpackmem_xmalloc". OK, then please, I need you to increase the debuglevel of slurmctld to debug2, run the job, make it fail, and send me back the logs. Also, run the job with the -v flag, 'srun -v ..' , to get verbose info from srun. Thanks It is difficult to get the debug logs until Aug. I must wait for next maintenance. This bug is very close to bug 10735 (also opened from Dell). If you cannot provide any more information until August, is it ok to CC you in bug 10735 and close this one? I disagree. I've already increased the array size as follows, but the problem is not fixed. -#define MAX_ARRAY_LEN_SMALL 10000 -#define MAX_ARRAY_LEN_MEDIUM 1000000 -#define MAX_ARRAY_LEN_LARGE 100000000 +#define MAX_ARRAY_LEN_SMALL 100000 +#define MAX_ARRAY_LEN_MEDIUM 10000000 +#define MAX_ARRAY_LEN_LARGE 1000000000 (In reply to yitp.support from comment #11) > I disagree. > > I've already increased the array size as follows, but the problem is not > fixed. > > -#define MAX_ARRAY_LEN_SMALL 10000 > -#define MAX_ARRAY_LEN_MEDIUM 1000000 > -#define MAX_ARRAY_LEN_LARGE 100000000 > +#define MAX_ARRAY_LEN_SMALL 100000 > +#define MAX_ARRAY_LEN_MEDIUM 10000000 > +#define MAX_ARRAY_LEN_LARGE 1000000000 You have recompiled all Slurm and installed/deployed it with this changes in code? If you have reproduced the issue this way then I guess getting the debug logs wouldn't be a problem? Why cannot we get the logs? I may have misunderstood something. I created the bug 10735. I modified the code and compiled/installed 20.11.7 during the previous maintenance window before getting debug logs I sent you. The cluster consists of 137 compute nodes, so using 130 nodes is difficult during production window. Therefore I need to wait for next maintenance. (In reply to yitp.support from comment #13) > I created the bug 10735. > I modified the code and compiled/installed 20.11.7 during the previous > maintenance window before getting debug logs I sent you. > > The cluster consists of 137 compute nodes, so using 130 nodes is difficult > during production window. > Therefore I need to wait for next maintenance. Hello, do you already have a fixed date for the next maintenance? Is the issue still happening? Thank you Next maintenance would be held in the middle of Aug, but it's not fixed yet. I must wait for the maintenance to check if this problem still happens because I need 130/135 nodes in the cluster. (In reply to yitp.support from comment #15) > Next maintenance would be held in the middle of Aug, but it's not fixed yet. > I must wait for the maintenance to check if this problem still happens > because I need 130/135 nodes in the cluster. Hi, did the maintenance already happened? Do you have any feedback for me? Hi, The maintenance will be held on Aug. 25-26 JST. I'll let you know the result. This problem is resolved. There was a node that was running old version (20.02.6). The jobs failed only when the node was included in nodelist. Please close this bug. (In reply to yitp.support from comment #18) > This problem is resolved. > > There was a node that was running old version (20.02.6). > The jobs failed only when the node was included in nodelist. > > Please close this bug. That's really good to know. Thank you very much for this information. |