Ticket 7729

Summary:	running Reconfigure kills GPU jobs on multiple GPU node
Product:	Slurm	Reporter:	darrellp
Component:	GPU	Assignee:	Dominik Bartkiewicz <bart>
Status:	RESOLVED FIXED	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	marcm
Version:	19.05.0
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=7727
Site:	Allen AI	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	19.05.3
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf gres.conf slurmdbd.conf

Description darrellp 2019-09-11 10:53:52 MDT

Created attachment 11544 [details]
slurm.conf

We have run into an issue where running scontrol reconfigure aborts running GPU jobs, but only on multi GPU nodes. 

First the details:
This is a 4 node cluster, each node has either 4 or 8 GPU
All running Ubuntu 18.04.2 LTS
All running 19.05.0 compiled from source
All running Nvidia GPUs (Rtx Titans, Quadro RTX 8000s, and GTX 1080s)

If we issue the command scontrol reconfigure to make a config change (changing AccountingStorageType in this example), running GPU jobs get aborted with an 'unsupported GRES options' error. We have reproduced this many time in this cluster. Example 

allennlp-server1.corp ~ # squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               671  allennlp     bash  sanjays  R   12:21:57      1 allennlp-server1
               698  allennlp sbatch_w     roys  R   10:28:47      1 allennlp-server4
               711  allennlp gqa_bala  sanjays  R    9:48:02      1 allennlp-server4
               783  allennlp allennlp  sanjays  R      45:26      1 allennlp-server4
allennlp-server1.corp ~ # sudo scontrol reconfigure
allennlp-server1.corp ~ # squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               671  allennlp     bash  sanjays CG   12:22:10      1 allennlp-server1
               783  allennlp allennlp  sanjays CG      45:39      1 allennlp-server4
allennlp-server1.corp ~ #


And in the logs, this is what we see:

Sep 10 22:39:28 allennlp-server1 slurmctld[25106]: Processing RPC: REQUEST_RECONFIGURE from uid=0
Sep 10 22:39:28 allennlp-server1 slurmctld[25106]: No memory enforcing mechanism configured.
Sep 10 22:39:28 allennlp-server1 slurmctld[25106]: layouts: no layout to initialize
Sep 10 22:39:28 allennlp-server1 slurmctld[25106]: restoring original state of nodes
Sep 10 22:39:28 allennlp-server1 slurmctld[25106]: select/cons_tres: select_p_node_init
Sep 10 22:39:28 allennlp-server1 slurmctld[25106]: select/cons_tres: preparing for 2 partitions
Sep 10 22:39:28 allennlp-server1 slurmctld[25106]: error: Aborting JobId=671 due to use of unsupported GRES options
Sep 10 22:39:28 allennlp-server1 slurmctld[25106]: error: Aborting JobId=698 due to use of unsupported GRES options
Sep 10 22:39:28 allennlp-server1 slurmctld[25106]: email msg to roys@allenai.org: Slurm Job_id=698 Name=sbatch_wrapper.sh Failed, Run time 10:29:00, NODE_FAIL, ExitCode 0
Sep 10 22:39:28 allennlp-server1 slurmctld[25106]: error: Aborting JobId=711 due to use of unsupported GRES options
Sep 10 22:39:28 allennlp-server1 slurmctld[25106]: error: Aborting JobId=783 due to use of unsupported GRES options


We believe that this is some kind of race condition with the loading of plugins. 

We also were not able to reproduce this on a single GPU test node that we spun up to investigate and the fact that one job seems to survive, leads us to suspect that a job on GPU0 survives but the others do not. 

Finally I rated this as a High impact as this issue will prevent adoption here at the Ai2 as we have many users that run multiple long running jobs for their research and these kinds of interruptions are not acceptable to them. Please feel free to reprioritize according to your standards.

Comment 1 darrellp 2019-09-11 10:54:10 MDT

Created attachment 11545 [details]
gres.conf

Comment 2 darrellp 2019-09-11 10:54:37 MDT

Created attachment 11547 [details]
slurmdbd.conf

Comment 3 darrellp 2019-09-11 10:58:30 MDT

This appears to be related based on the description and what we are seeing 

https://bugs.schedmd.com/show_bug.cgi?id=7727

Comment 4 Dominik Bartkiewicz 2019-09-12 05:13:57 MDT

Hi

I can reproduce this easily.
It looks that the patch from bug 7727 is correct and fixes this issue.
I let you know when it will be in the repo.

Dominik

Comment 5 Marc 2019-09-16 12:22:24 MDT

When will this be released?  Debating if we should just integrate our own patch or wait for yours.


Thanks!

Comment 6 Dominik Bartkiewicz 2019-09-16 12:46:06 MDT

Hi

As you probably have already noticed that fix is committed as:
https://github.com/SchedMD/slurm/commit/2abd2a3d8d6bdc

It will be included in 19.05.3.
We plan to release 19.05.3 before end of the month, but we have no strict date yet.

Let me know if we can close this ticket now.

Dominik

Comment 7 Dominik Bartkiewicz 2019-09-18 06:29:48 MDT

Hi

Did you apply this patch?
Please let me know when we can close this ticket.

Dominik

Comment 9 Marc 2019-09-18 07:23:26 MDT

Patch applied. Feel free to close. 

marc

> On Sep 18, 2019, at 5:29 AM, bugs@schedmd.com wrote:
> 
> Dominik Bartkiewicz changed bug 7729 
> What	Removed	Added
> Severity	2 - High Impact            	4 - Minor Issue            
> Comment # 7 on bug 7729 from Dominik Bartkiewicz
> Hi
> 
> Did you apply this patch?
> Please let me know when we can close this ticket.
> 
> Dominik
> You are receiving this mail because:
> You are on the CC list for the bug.