6235 – Comunication failure using salloc

Ticket 6235 - Comunication failure using salloc

Summary: Comunication failure using salloc

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	User Commands (show other tickets)
Version:	18.08.3
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Nate Rini
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2018-12-13 14:16 MST by Scott Sisco
Modified:	2019-01-23 15:16 MST (History)
CC List:	1 user (show)

See Also:
Site:	FHCRC - Fred Hutchinson Cancer Research Center
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf (53.76 KB, text/plain) 2018-12-13 14:16 MST, Scott Sisco	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Scott Sisco 2018-12-13 14:16:47 MST

Created attachment 8633 [details]
slurm.conf

Hi,

We are having random users experiencing the following error messages. The problems is intermittent. I will attach the slurm.conf


This is what the client sees.

salloc: Pending job allocation 31077281
salloc: job 31077281 queued and waiting for resources
salloc: error: Security violation, slurm message from uid 6281
salloc: Granted job allocation 31077281
salloc: Waiting for resource configuration
salloc: error: Security violation, slurm message from uid 6281
salloc: error: Job allocation 31077281 has been revoked
salloc: Relinquishing job allocation 31077281




Here is what the controller sees.

[2018-12-13T12:10:43.491] sched: _slurm_rpc_allocate_resources JobId=31077281 NodeLi
st=(null) usec=4377
[2018-12-13T12:10:43.558] sched: Allocate JobId=31077281 NodeList=nodef299 #CPUs=1 
Partition=campus
[2018-12-13T12:10:43.674] error: slurm_receive_msgs: Zero Bytes were transmitted or 
received
[2018-12-13T12:10:43.684] Killing interactive JobId=31077281: Communication connection failure
[2018-12-13T12:10:43.684] _job_complete: JobId=31077281 WEXITSTATUS 1
[2018-12-13T12:10:43.844] _job_complete: JobId=31077281 done
[2018-12-13T12:10:46.684] _slurm_rpc_complete_job_allocation: JobId=31077281 error Job/step already completing or completed




Here is error message in the slurm daemon logs.

[2018-12-13T12:10:43.788] _run_prolog: prolog with lock for job 31077281 ran for 0 seconds
[2018-12-13T12:10:55.155] _run_prolog: run job script took usec=115182



We have verified that the user id is the same on both the controller and the daemon. Please let me know what else I can provide. 

Thanks,
Scott

Comment 1 Nate Rini 2018-12-13 18:57:35 MST

Scott,

> I will attach the slurm.conf
Is your Slurm compiled with "--enable-multiple-slurmd"? If not, there is no 
need for the %n in your slurm.conf for the pidfile.

Is there a specific reason why MessageTimeout is set explicitly instead of the default?
>MessageTimeout=45

Can you please call remunge on the client and the slurmctld nodes?
> remunge

> Here is what the controller sees.
> [2018-12-13T12:10:43.674] error: slurm_receive_msgs: Zero Bytes were transmitted or received

Have you had any configuration changes recently? The problem may be similar to #6147. Could you please make sure your slurm.conf is synced on all nodes and restart all of your slurmctld, slurmdbd and slurmd daemons?

--Nate

Comment 2 Scott Sisco 2018-12-14 11:49:12 MST

1)Yes, slurm is compiled with "--enable-multiple-slurmd". However, we plan to remove this soon. The config is maybe 10 years old at this point.

2)Message time is set explicitly to 45. If we do have reason we don't remember why we could reset to default if needed. However, we are not reserving memory for slurmd and there is a fear that slurmd would get paged out.

3) Yes we will remunge 

4) Yes, we have been messing around with the configuration, but the dates do not seem to line up. However, it could be that the changes are what is causing the issues. As a result, we are going to re-sync the configuration to all of our nodes. We will run remunge on all our nodes and we are going to give all the slurmd's a restart on all our nodes. 

Thanks,
Scott

Comment 3 Nate Rini 2018-12-14 12:23:11 MST

(In reply to Scott Sisco from comment #2)
> 1)Yes, slurm is compiled with "--enable-multiple-slurmd". However, we plan
> to remove this soon. The config is maybe 10 years old at this point.
It shouldn't be needed for a production cluster but it also shouldn't hurt anything.

> 2)Message time is set explicitly to 45. If we do have reason we don't
> remember why we could reset to default if needed. However, we are not
> reserving memory for slurmd and there is a fear that slurmd would get paged
> out.
Then there should be no need to change it.

> 4) Yes, we have been messing around with the configuration, but the dates do
> not seem to line up. However, it could be that the changes are what is
> causing the issues. As a result, we are going to re-sync the configuration
> to all of our nodes. We will run remunge on all our nodes and we are going
> to give all the slurmd's a restart on all our nodes.
Please make sure to restart slurmctld and slurmdbd too.

--Nate

Comment 4 Scott Sisco 2018-12-20 15:12:08 MST

Here is an update.

I have remunged all our nodes. 

Additionally, I checked to see if we were running the most recent version of libslurmdb33 and found that many of our nodes were not which means chef, which we use for configuration management, has been failing to update our nodes leaving them in a pretty wonky state. I have now resolved that issue and all nodes currently running the most recent version of libslurmdb33.

Next up is scheduling a reboot of slurmctld and slurmdbd

Thanks,
Scott

Comment 5 Scott Sisco 2018-12-21 11:33:35 MST

I have restarted slurmctld and slurmdbd. I will continue to monitor the situation. 

Scott

Comment 6 Scott Sisco 2018-12-31 12:32:25 MST

Hi Nate,

Unfortunately, re-munging, making sure all our nodes are running the most recent slurm package, and restarting the slurmctld and slurmdbd services has not resolved the issue. Any idea what we should try next? 


Thanks,
Scott

Comment 7 Nate Rini 2018-12-31 12:55:46 MST

(In reply to Scott Sisco from comment #6)
> restarting the slurmctld and slurmdbd services has
> not resolved the issue. Any idea what we should try next? 

Did you also restart slurmd on every node?

Can you please attach recent slurmctld logs and slurmd on one affected node?

--Nate

Comment 8 Scott Sisco 2019-01-02 17:40:46 MST

Hi Nate,

I have restarted slurmd across all 500 of our nodes. I will monitor the logs over the next few days to see if the error comes back.

Thanks,
Scott

Comment 9 Nate Rini 2019-01-02 18:46:21 MST

(In reply to Scott Sisco from comment #8)
> Hi Nate,
> 
> I have restarted slurmd across all 500 of our nodes. I will monitor the logs
> over the next few days to see if the error comes back.
> 
> Thanks,
> Scott

If the issue comes back, please check the version on all the binaries being executed:

> $ srun -V
> $ sbatch -V
> $ salloc -V
> $ slurmctld -V

--Nate

Comment 10 Scott Sisco 2019-01-09 14:26:19 MST

Hi Nate,

Unfortunately, the error is still occurring.


At your request.

All 500 nodes reporting.

srun -V = slurm-wlm 18.08.3
sbatch -V = slurm-wlm 18.08.3
salloc -V = slurm-wlm 18.08.3

Slurm controller reporting 

slurmctld -V = slurm-wlm 18.08.3
srun -V = slurm-wlm 18.08.3
sbatch -V = slurm-wlm 18.08.3
salloc -V = slurm-wlm 18.08.3


Thanks,
Scott

Comment 11 Nate Rini 2019-01-09 19:34:49 MST

Scot

Is it possible that a user is running slurmd instead of SlurmUser? The error from your logs show a successful authentication, but as the wrong user to report privileged information to the controller (or vice versa).

Is it possible that your SlurmUser ("slurm" from the attached config) has a different uid on some of the nodes?

--Nate

Comment 12 Scott Sisco 2019-01-23 11:47:46 MST

Hi Nate,

This ticket can be marked as resolved. 

1 of the 3 servers, that our scientists ssh into to launch jobs on our cluster, had the wrong UID for the slurm user in /etc/passwd. 

The issue only occurred intermittently as the server would sometimes get the correct UID from LDAP and other times pull the incorrect UID from /etc/password. Once we fixed the UID in /etc/passwd on the server the issue went away.

Thanks so much for your help!

-Scott

Comment 13 Nate Rini 2019-01-23 15:16:49 MST

Scott,

Closing ticket per your response.

--Nate