Hi there, We use GRES to control access to temp disk space (the submit filter sets it for the user based on their --tmp request) and it's worked nicely up until now. However, we've just upgraded our 8 high memory nodes from 350GB temp to 2TB local temp, and the gres.conf now looks like: # Temporary directory space NodeName=john[1-107],gina[1-4] Name=tmp Count=350G NodeName=bryan[1-8] Name=tmp Count=1900G We can see that slurmd picks that up fine: slurmd: Gres Name=tmp Type=(null) Count=2040109465600 and: [root@bryan7 ~]# scontrol show slurmd Active Steps = NONE Actual CPUs = 36 Actual Boards = 1 Actual sockets = 2 Actual cores = 18 Actual threads per core = 1 Actual real memory = 772475 MB Actual temp disk space = 1906795 MB Boot time = 2018-08-30T16:09:02 Hostname = bryan7 Last slurmctld msg time = 2018-08-30T16:18:28 Slurmd PID = 4336 Slurmd Debug = 8 Slurmd Logfile = (null) Version = 17.11.7 but this does not seem to get reflected in the control daemon: [root@bryan7 ~]# scontrol show node bryan7 NodeName=bryan7 Arch=x86_64 CoresPerSocket=18 CPUAlloc=0 CPUErr=0 CPUTot=36 CPULoad=0.02 AvailableFeatures=(null) ActiveFeatures=(null) Gres=gpu:p100:2,tmp:350G NodeAddr=bryan7 NodeHostName=bryan7 Version=17.11 OS=Linux 3.10.0-862.9.1.el7.x86_64 #1 SMP Mon Jul 16 16:29:36 UTC 2018 RealMemory=768000 AllocMem=0 FreeMem=766687 Sockets=2 Boards=1 State=DOWN+DRAIN ThreadsPerCore=1 TmpDisk=1900000 Weight=2000 Owner=N/A MCS_label=N/A Partitions=skylake,skylake-gpu,debug BootTime=2018-08-30T15:36:15 SlurmdStartTime=2018-08-30T16:09:02 CfgTRES=cpu=36,mem=750G,billing=36,gres/gpu=2,gres/tmp=375809638400 AllocTRES= CapWatts=n/a CurrentWatts=270 LowestJoules=569132 ConsumedJoules=152867 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=SW_nvme [rhumble@2018-08-30T14:41:29] You can see the CfgTRES line still has the old version, even after multiple slurmd restarts and a reboot of the compute node as well as a restart of slurmctld: CfgTRES=cpu=36,mem=750G,billing=36,gres/gpu=2,gres/tmp=375809638400 Looking at the web page we can see: https://slurm.schedmd.com/gres.html # By default a node has no generic resources and its maximum # count is 4,294,967,295. Now our count is 2,040,109,465,600 which is a lot larger. However, the manual page says instead: # By default a node has no generic resources and its maximum # count is that of an unsigned 64bit integer. which should be large enough for just under 2TB. I'm not seeing any messages in our slurmctld logs that would be relevant to this. All the best, Chris
Hi folks, This was user error! I had forgotten you had to specify GRES both in the node config in slurmctld.conf and in gres.conf so whilst I had updated gres.conf and also updated slurm.conf for the size of the temp area I had not updated the GRES for tmp in slurm.conf as well. I keep forgetting you have to keep the two in step - it might be useful for slurmctld to warn you if the node reports a different GRES to that configured in slurm.conf. So the change my colleague who spotted this made when I was away was: -NodeName=bryan[1-4] RealMemory=384000 Sockets=2 CoresPerSocket=18 ThreadsPerCore=1 State=UNKNOWN Gres=gpu:p100:2,tmp:350G Weight=1000 TmpDisk=1900000 -NodeName=bryan[5-8] RealMemory=768000 Sockets=2 CoresPerSocket=18 ThreadsPerCore=1 State=UNKNOWN Gres=gpu:p100:2,tmp:350G Weight=2000 TmpDisk=1900000 +NodeName=bryan[1-4] RealMemory=384000 Sockets=2 CoresPerSocket=18 ThreadsPerCore=1 State=UNKNOWN Gres=gpu:p100:2,tmp:1860G Weight=1000 TmpDisk=1904640 +NodeName=bryan[5-8] RealMemory=768000 Sockets=2 CoresPerSocket=18 ThreadsPerCore=1 State=UNKNOWN Gres=gpu:p100:2,tmp:1860G Weight=2000 TmpDisk=1904640 All the best, Chris
Looking back in my logs I think 2 of the 8 nodes did report something about gres/tmp, I suspect when slurmd was restarted on them: [2018-08-30T14:21:37.381] gres/tmp: count changed for node bryan1 from 375809638400 to 2040109465600 [2018-08-30T14:21:37.381] error: Setting node bryan1 state to DRAIN [2018-08-30T14:21:37.381] drain_nodes: node bryan1 state set to DRAIN [2018-08-30T14:21:37.381] error: _slurm_rpc_node_registration node=bryan1: Invalid argument and [2018-08-30T15:32:57.090] gres/tmp: count changed for node bryan7 from 375809638400 to 2040109465600 But a warning about the GRES mismatch from slurmctld would be handy! All the best, Chris
Hi Thanks for this update, In last Friday I tried to recreate this without a success :). I will try to prepare a patch with better log for that case. Dominik
Hi Apologies for not responding sooner. Between the last versions, gres/gpu code was intensively modified so today looks completely different. This commits add additional log when slurmctld notice inconsistency gres number. https://github.com/SchedMD/slurm/commit/6b6c00438ab https://github.com/SchedMD/slurm/commit/fbc540a46d3 I'll go ahead and close out the bug. Dominik
Hi Dominik, Thanks so much for letting me know, I'll pass that on to the folks at Swinburne. All the best, Chris