Ticket 5645 - New GRES size not being picked up when increased from 350G to 1900G
Summary: New GRES size not being picked up when increased from 350G to 1900G
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Configuration (show other tickets)
Version: 17.11.7
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-08-30 00:34 MDT by Christopher Samuel
Modified: 2019-08-20 12:53 MDT (History)
0 users

See Also:
Site: Swinburne
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 19.05.0
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Christopher Samuel 2018-08-30 00:34:55 MDT
Hi there,

We use GRES to control access to temp disk space (the submit filter sets it for the user based on their --tmp request) and it's worked nicely up until now.

However, we've just upgraded our 8 high memory nodes from 350GB temp to 2TB local temp, and the gres.conf now looks like:

# Temporary directory space
NodeName=john[1-107],gina[1-4] Name=tmp Count=350G
NodeName=bryan[1-8] Name=tmp Count=1900G

We can see that slurmd picks that up fine:

slurmd: Gres Name=tmp Type=(null) Count=2040109465600

and:

[root@bryan7 ~]# scontrol show slurmd
Active Steps             = NONE
Actual CPUs              = 36
Actual Boards            = 1
Actual sockets           = 2
Actual cores             = 18
Actual threads per core  = 1
Actual real memory       = 772475 MB
Actual temp disk space   = 1906795 MB
Boot time                = 2018-08-30T16:09:02
Hostname                 = bryan7
Last slurmctld msg time  = 2018-08-30T16:18:28
Slurmd PID               = 4336
Slurmd Debug             = 8
Slurmd Logfile           = (null)
Version                  = 17.11.7


but this does not seem to get reflected in the control daemon:

[root@bryan7 ~]# scontrol show node bryan7
NodeName=bryan7 Arch=x86_64 CoresPerSocket=18
   CPUAlloc=0 CPUErr=0 CPUTot=36 CPULoad=0.02
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:p100:2,tmp:350G
   NodeAddr=bryan7 NodeHostName=bryan7 Version=17.11
   OS=Linux 3.10.0-862.9.1.el7.x86_64 #1 SMP Mon Jul 16 16:29:36 UTC 2018 
   RealMemory=768000 AllocMem=0 FreeMem=766687 Sockets=2 Boards=1
   State=DOWN+DRAIN ThreadsPerCore=1 TmpDisk=1900000 Weight=2000 Owner=N/A MCS_label=N/A
   Partitions=skylake,skylake-gpu,debug 
   BootTime=2018-08-30T15:36:15 SlurmdStartTime=2018-08-30T16:09:02
   CfgTRES=cpu=36,mem=750G,billing=36,gres/gpu=2,gres/tmp=375809638400
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=270 LowestJoules=569132 ConsumedJoules=152867
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=SW_nvme [rhumble@2018-08-30T14:41:29]

You can see the CfgTRES line still has the old version, even after multiple slurmd restarts and a reboot of the compute node as well as a restart of slurmctld:

   CfgTRES=cpu=36,mem=750G,billing=36,gres/gpu=2,gres/tmp=375809638400

Looking at the web page we can see:

https://slurm.schedmd.com/gres.html

# By default a node has no generic resources and its maximum
# count is 4,294,967,295.

Now our count is 2,040,109,465,600 which is a lot larger.

However, the manual page says instead:

# By default a node has no generic resources and its maximum
# count is that of an unsigned 64bit integer.

which should be large enough for just under 2TB.

I'm not seeing any messages in our slurmctld logs that would be relevant to this.

All the best,
Chris
Comment 1 Christopher Samuel 2018-08-31 18:52:25 MDT
Hi folks,

This was user error!  I had forgotten you had to specify GRES both in the node config in slurmctld.conf and in gres.conf so whilst I had updated gres.conf and also updated slurm.conf for the size of the temp area I had not updated the GRES for tmp in slurm.conf as well.

I keep forgetting you have to keep the two in step - it might be useful for slurmctld to warn you if the node reports a different GRES to that configured in slurm.conf.

So the change my colleague who spotted this made when I was away was:

-NodeName=bryan[1-4]  RealMemory=384000 Sockets=2 CoresPerSocket=18 ThreadsPerCore=1 State=UNKNOWN Gres=gpu:p100:2,tmp:350G Weight=1000 TmpDisk=1900000
-NodeName=bryan[5-8]  RealMemory=768000 Sockets=2 CoresPerSocket=18 ThreadsPerCore=1 State=UNKNOWN Gres=gpu:p100:2,tmp:350G Weight=2000 TmpDisk=1900000
+NodeName=bryan[1-4]  RealMemory=384000 Sockets=2 CoresPerSocket=18 ThreadsPerCore=1 State=UNKNOWN Gres=gpu:p100:2,tmp:1860G Weight=1000 TmpDisk=1904640
+NodeName=bryan[5-8]  RealMemory=768000 Sockets=2 CoresPerSocket=18 ThreadsPerCore=1 State=UNKNOWN Gres=gpu:p100:2,tmp:1860G Weight=2000 TmpDisk=1904640

All the best,
Chris
Comment 2 Christopher Samuel 2018-08-31 19:01:49 MDT
Looking back in my logs I think 2 of the 8 nodes did report something about gres/tmp, I suspect when slurmd was restarted on them:

[2018-08-30T14:21:37.381] gres/tmp: count changed for node bryan1 from 375809638400 to 2040109465600
[2018-08-30T14:21:37.381] error: Setting node bryan1 state to DRAIN
[2018-08-30T14:21:37.381] drain_nodes: node bryan1 state set to DRAIN
[2018-08-30T14:21:37.381] error: _slurm_rpc_node_registration node=bryan1: Invalid argument

and

[2018-08-30T15:32:57.090] gres/tmp: count changed for node bryan7 from 375809638400 to 2040109465600

But a warning about the GRES mismatch from slurmctld would be handy!

All the best,
Chris
Comment 3 Dominik Bartkiewicz 2018-09-03 10:18:50 MDT
Hi

Thanks for this update,
In last Friday I tried to recreate this without a success :).
I will try to prepare a patch with better log for that case.

Dominik
Comment 4 Dominik Bartkiewicz 2019-08-20 06:29:25 MDT
Hi

Apologies for not responding sooner.
Between the last versions, gres/gpu code was intensively modified
so today looks completely different.

This commits add additional log when slurmctld notice inconsistency gres number.

https://github.com/SchedMD/slurm/commit/6b6c00438ab
https://github.com/SchedMD/slurm/commit/fbc540a46d3

I'll go ahead and close out the bug.

Dominik
Comment 5 Christopher Samuel 2019-08-20 12:53:09 MDT
Hi Dominik,

Thanks so much for letting me know, I'll pass that on to the folks at Swinburne.

All the best,
Chris