| Summary: | New GRES size not being picked up when increased from 350G to 1900G | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Christopher Samuel <chris> |
| Component: | Configuration | Assignee: | Dominik Bartkiewicz <bart> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 17.11.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Swinburne | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 19.05.0 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
Christopher Samuel
2018-08-30 00:34:55 MDT
Hi folks, This was user error! I had forgotten you had to specify GRES both in the node config in slurmctld.conf and in gres.conf so whilst I had updated gres.conf and also updated slurm.conf for the size of the temp area I had not updated the GRES for tmp in slurm.conf as well. I keep forgetting you have to keep the two in step - it might be useful for slurmctld to warn you if the node reports a different GRES to that configured in slurm.conf. So the change my colleague who spotted this made when I was away was: -NodeName=bryan[1-4] RealMemory=384000 Sockets=2 CoresPerSocket=18 ThreadsPerCore=1 State=UNKNOWN Gres=gpu:p100:2,tmp:350G Weight=1000 TmpDisk=1900000 -NodeName=bryan[5-8] RealMemory=768000 Sockets=2 CoresPerSocket=18 ThreadsPerCore=1 State=UNKNOWN Gres=gpu:p100:2,tmp:350G Weight=2000 TmpDisk=1900000 +NodeName=bryan[1-4] RealMemory=384000 Sockets=2 CoresPerSocket=18 ThreadsPerCore=1 State=UNKNOWN Gres=gpu:p100:2,tmp:1860G Weight=1000 TmpDisk=1904640 +NodeName=bryan[5-8] RealMemory=768000 Sockets=2 CoresPerSocket=18 ThreadsPerCore=1 State=UNKNOWN Gres=gpu:p100:2,tmp:1860G Weight=2000 TmpDisk=1904640 All the best, Chris Looking back in my logs I think 2 of the 8 nodes did report something about gres/tmp, I suspect when slurmd was restarted on them: [2018-08-30T14:21:37.381] gres/tmp: count changed for node bryan1 from 375809638400 to 2040109465600 [2018-08-30T14:21:37.381] error: Setting node bryan1 state to DRAIN [2018-08-30T14:21:37.381] drain_nodes: node bryan1 state set to DRAIN [2018-08-30T14:21:37.381] error: _slurm_rpc_node_registration node=bryan1: Invalid argument and [2018-08-30T15:32:57.090] gres/tmp: count changed for node bryan7 from 375809638400 to 2040109465600 But a warning about the GRES mismatch from slurmctld would be handy! All the best, Chris Hi Thanks for this update, In last Friday I tried to recreate this without a success :). I will try to prepare a patch with better log for that case. Dominik Hi Apologies for not responding sooner. Between the last versions, gres/gpu code was intensively modified so today looks completely different. This commits add additional log when slurmctld notice inconsistency gres number. https://github.com/SchedMD/slurm/commit/6b6c00438ab https://github.com/SchedMD/slurm/commit/fbc540a46d3 I'll go ahead and close out the bug. Dominik Hi Dominik, Thanks so much for letting me know, I'll pass that on to the folks at Swinburne. All the best, Chris |