Ticket 6370

Summary:	topo gres count underflow
Product:	Slurm	Reporter:	Kilian Cavalotti <kilian>
Component:	slurmctld	Assignee:	Director of Support <support>
Status:	RESOLVED FIXED	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	davide.vanzo, felip.moll
Version:	18.08.4
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=1589 https://bugs.schedmd.com/show_bug.cgi?id=6500 https://bugs.schedmd.com/show_bug.cgi?id=7401
Site:	Stanford	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:	Sherlock
CLE Version:		Version Fixed:	18.08.6
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Kilian Cavalotti 2019-01-16 13:07:48 MST

Hi!

I'd like to revisit bug #1589, which was actually never really resolved. 

We've been seeing quite continuous error messages in our slurmctld logs, across all the Slurm versions we used since 14.11, that look like this:

error: gres/gpu: job 35630501 dealloc node sh-15-09 topo gres count underflow (0 1)

The mention pretty much all of our GPU nodes, and are logged quite often (there are 332,998 occurrences of that error message, spanning all of our GPU nodes, in our slurmctld.log for the last 2 days).

It doesn't seem to be preventing GPU jobs to be dispatched nor GRES resources to be correctly allocated, but we would like to get to the bottom of this, understand why this error is logged and hopefully fix the reason why it happens.

Conf and logs attached in the next comment.

Thanks!
-- 
Kilian

Comment 4 Michael Hinton 2019-01-28 10:49:11 MST

Hey Kilian,

I am able to reproduce the error by simply restarting the slurmctld while a job with multiple gpus allocated to it is running.

Here are the steps:

    # Get a job with > 1 gpus
    $ srun --gres=gpu:4 sleep 15 &
    # Restart the slurmctld
    $ slurmctld

Once the job finishes, I get an error in slurmctld.log like this:

    [2019-01-28T10:16:14.294] error: gres/gpu: job 2622 dealloc node hintron topo gres count underflow (1 4)

Interestingly, if I did an `scontrol reconfigure` instead of a slurmctld restart, I was not able to reproduce the error.

Also of note is that if I specified only 1 GPU (e.g. `srun --gres=gpu sleep 15 &`), there is no error.

Can you think of any other situations where this error is triggered for you? If you are seeing it all the time, maybe it’s not limited to when the slurmctld gets restarted.

I’ll dive into the code to see how this happens in the first place.

-Michael

Comment 5 Kilian Cavalotti 2019-01-28 11:06:43 MST

Hi Michael, 

(In reply to Michael Hinton from comment #4)
> I am able to reproduce the error by simply restarting the slurmctld while a
> job with multiple gpus allocated to it is running.

Thanks for the update, and great to hear you've been able to reproduce the error.
 
> Can you think of any other situations where this error is triggered for you?
> If you are seeing it all the time, maybe it’s not limited to when the
> slurmctld gets restarted.

Restarting slurmctld while multi-gpu jobs are running is a pretty common scenario for us, so that sounds completely plausible and very likely the source of the issue. 

I can't really think of any specific situation that would trigger this, although the number of occurrences doesn't seem to decrease over time after a slurmctld restart: for instance, the last slurmctld restart was on Jan 24:


2019-01-24T18:35:54-08:00 sh-sl01 slurmctld[166108]: slurmctld version 18.08.4 started on cluster sherlock

and the number of occurrences of that error message doesn't seem to be decreasing over time:

2019-01-24: 4,009,605 occurrences
2019-01-25: 4,012,105
2019-01-26:   393,597
2019-01-27:   516,347
2019-01-28:   592,146 (so far)

So there may be something else at play, but I don't know.

> I’ll dive into the code to see how this happens in the first place.

Thank you!

Cheers,
-- 
Kilian

Comment 27 Michael Hinton 2019-02-06 15:08:35 MST

Hey Kilian,

We have a solution for this that is currently under review.

Interestingly, this problem does not affect 19.05+, and the solution was actually a bit of a backport.

I'll keep you posted when the patch lands.

Thanks,
Michael

Comment 28 Kilian Cavalotti 2019-02-06 15:18:28 MST

Hi Michael, 

(In reply to Michael Hinton from comment #27)
> We have a solution for this that is currently under review.
> 
> Interestingly, this problem does not affect 19.05+, and the solution was
> actually a bit of a backport.
> 
> I'll keep you posted when the patch lands.

Thanks for the update!

I figure that could have been the case, with all the work around GPUs in 19.05.

Cheers,
-- 
Kilian

Comment 30 Michael Hinton 2019-02-11 14:53:17 MST

Hey Kilian,

The patch just landed, slated for 18.08.6: https://github.com/SchedMD/slurm/commit/6f8cd92e1091e3439352311a79449d7930d0870b

If this doesn't fix the problem, feel free to reopen this ticket.

Thanks!
-Michael

Comment 31 Michael Hinton 2019-02-19 16:43:36 MST

Hey Kilian,

Just a heads up: 6f8cd92 had a regression that caused slurmctld to abort when GRES are added or removed from a node's configuration and all the daemons are restarted. 

It was fixed here: https://github.com/SchedMD/slurm/commit/69d78159c33051f2f6a3cb3ef2bf97c31288df02

Since we caught this before 18.08.6, it shouldn't affect anyone.

Thanks,
Michael

Comment 32 Michael Hinton 2019-03-20 10:16:24 MDT

*** Ticket 6729 has been marked as a duplicate of this ticket. ***

Comment 33 Michael Hinton 2019-03-25 14:00:45 MDT

*** Ticket 6729 has been marked as a duplicate of this ticket. ***