Created attachment 3677 [details] Output of various logs and config files I have upgraded from 14.11 to 16.05.4.1 recently and am having issues. I've set MemSpecLimit to 6144 for all nodes but some either go to 5887 or never change from their original setting. sinfo shows a lot of nodes Not responding and slurmctld shows mismatch slurm.conf files. This file is centralized and should be the same after slurmd restart. The node not responding and slurm.conf mismatch is only happening to some of our nodes. Any idea why this may be? I have attached a log showing outputs from various logs and config files.
Paul - can you upload the whole slurm.conf? I'm curious to see if you have task/cgroup enabled. There was this fix introduced in 15.08.12: -- Fix MemSpecLimit to explicitly require TaskPlugin=task/cgroup and ConstrainRAMSpace set in cgroup.conf. And I see no ConstrainRAMSpace in your cgroup.conf either.
Created attachment 3680 [details] Main settings
Created attachment 3681 [details] node classifications
Alejandro, Ah, but why would this seem to work on others? See attached. Slurm_common.conf is included in slurm.conf. So do I need to add TaskPlugin=task/cgroup to slurm.conf and ConstrainRAMSpace=yes cgroup.conf?
(In reply to paull from comment #4) > Alejandro, > > Ah, but why would this seem to work on others? > > See attached. Slurm_common.conf is included in slurm.conf. > > So do I need to add > > TaskPlugin=task/cgroup to slurm.conf > > and > > ConstrainRAMSpace=yes cgroup.conf? Exactly. Commit 588ce8bd9 introduced these requirements for MemSpecLimit to work. Otherwise a message like the following should be logged to your slurmd.log: + error("Resource spec: cgroup job confinement not configured. " + "MemSpecLimit requires TaskPlugin=task/cgroup and " + "ConstrainRAMSpace=yes in cgroup.conf"); Could you try to change these values and restart the daemons?
I did that and restarted slurmctld, scontrol reconfig, resumed a node and restarted slurmd on that node. Now I see this: NodeName=jnod0057 CoresPerSocket=10 CPUAlloc=0 CPUErr=0 CPUTot=40 CPULoad=0.00 AvailableFeatures=localdisk,gpu,nogpu,intel,4gpu ActiveFeatures=localdisk,gpu,nogpu,intel,4gpu Gres=(null) NodeAddr=nod0057 NodeHostName=nod0057 Version=16.05 RealMemory=129163 AllocMem=0 FreeMem=121658 Sockets=2 Boards=1 MemSpecLimit=123017 State=IDLE# ThreadsPerCore=2 TmpDisk=425702 Weight=1 Owner=N/A MCS_label=N/A BootTime=2016-11-04T17:31:10 SlurmdStartTime=2016-11-05T00:11:36 CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s I know the "#" means its configuring but whats configuring and why is the MemSpecLimit still incorrect?
Paul would you mind sending in your slurmd as well as your slurmctld log? From the snip it would appear the slurm.conf on your nodes is different than the one running on your slurmctld node. But I am supposing that is before you altered the slurm.conf with the new cgroup setting. My guess is there is a disconnect between the daemons. Please verify the slurm.conf and cgroup.conf files are the same and the daemons have restarted and send the logs from that process.
For what it is worth Paul, I am unable to reproduce this in 16.05.4 or 16.05.6. It is interesting though RealMemory=129163 - MemSpecLimit=123017 = 6146 which is close to what you have in your settings. Not sure how that is getting there though. Is anything running on the system? Are you able to get anything going? Is this problem happening on all your nodes?
Hi Alejandro, Due to jobs not being able run and the upcoming need for jobs to be running as soon as possible, I was forced to roll back to 14.11. For now, we will need to do more testing to make sure then next rollout goes smoother. For understanding, both slurmctld and the slurmd share the exact same file. It is an NFS mounted file, therefore restarting either daemon will/should point to the same file. For some reason, that was not happening. I'm going to attempt to extend our test environment and try to replicate this issue. At which time I can send you the logs requested. When I rolled back to 14.11, the slurmctld log produced this error: [2016-11-05T03:19:23.761] slurmctld version 14.11.10 started on cluster cluster [2016-11-05T03:19:24.401] error: *********************************************** [2016-11-05T03:19:24.401] error: Can not recover usage_mgr state, incompatible version, got 7680 need 1 [2016-11-05T03:19:24.401] error: *********************************************** [2016-11-05T03:19:24.401] error: *********************************************** [2016-11-05T03:19:24.401] error: Can not recover usage_mgr state, incompatible version, got 7680 need 1 [2016-11-05T03:19:24.401] error: *********************************************** [2016-11-05T03:19:24.405] layouts: no layout to initialize [2016-11-05T03:19:24.405] error: read_slurm_conf: default partition not set. [2016-11-05T03:19:24.416] layouts: loading entities/relations information [2016-11-05T03:19:24.416] error: gres_plugin_node_state_unpack: unpack error from node jnod0001 [2016-11-05T03:19:24.416] error: Incomplete node data checkpoint file [2016-11-05T03:19:24.416] Recovered state of 0 nodes [2016-11-05T03:19:24.416] error: unpackmem_xmalloc: Buffer to be unpacked is too large (825306218 > 67108864) [2016-11-05T03:19:24.416] error: Incomplete job record [2016-11-05T03:19:24.416] error: Incomplete job data checkpoint file [2016-11-05T03:19:24.416] Recovered information about 0 jobs We run slurmdbd as a mysql database, which was backed up prior to upgrade. To rollback, we simply restore the back up. All jobs were also lost in the process. Any idea why and how can we avoid this in the event a rollback is necessary again?
Hi Danny, For the MemSpecLimit, please confirm that I was supposed to put the floor value and not the ceiling in version 16.05.4.1. RealMem=120000 Required reserved space= 6000 MemSpecLimit=140000 (How it was in version 14.11) or MemSpecLimit=6000 (How we had it configured)
(In reply to paull from comment #10) > Hi Danny, > > For the MemSpecLimit, please confirm that I was supposed to put the floor > value and not the ceiling in version 16.05.4.1. > > RealMem=120000 > > Required reserved space= 6000 > > > MemSpecLimit=140000 (How it was in version 14.11) > or > MemSpecLimit=6000 (How we had it configured) MemSpec limit should never exceed RealMem. ( RealMem - MemSpecLimit ) is the upper limit of how much memory Slurm will assign to jobs on the node. As an alternative approach, assuming you're not using cgroups, you can just intentionally under-report the RealMemory value for the node as a way to set aside space for the OS and non-user jobs.
Hi Tim, We do use cgroups. Prior to updated the configs with the TaskPlugin and ConstrainRAMSpace, some nodes were missing the slurm sub-directory in the memory cgroup directory. Once I changed the configs they were updated, it was just weird that it worked from some and not for others prior to the config change. This is how we have reconfigured MemSpecLimit for update 16.05.4.1 ( RealMem - MemSpecLimit ). Are there any other parameters with MemSpecLimit that would need to be updated in our configs? Thanks, Paul
(In reply to paull from comment #12) > Hi Tim, > > We do use cgroups. Prior to updated the configs with the TaskPlugin and > ConstrainRAMSpace, some nodes were missing the slurm sub-directory in the > memory cgroup directory. Once I changed the configs they were updated, it > was just weird that it worked from some and not for others prior to the > config change. > > This is how we have reconfigured MemSpecLimit for update 16.05.4.1 ( RealMem > - MemSpecLimit ). Are there any other parameters with MemSpecLimit that > would need to be updated in our configs? The only other thing I'd note is that if you move to 16.05.5 or later you would no longer need the ReleaseAgent setting. That setting can be problematic on systemd-based distributions, as systemd prefers the release_agent mount option to point to something of its own devising. But as-is I believe you should be okay. Is there anything else I can answer on this?
No that will be all. Will do some more testing to be sure this is the extent of the changes needing to be made and try to push this out again soon. For now we can close this.
Hi, I have built a test environment and have jobs running. The things that bothers me is I have set the required (TaskPlugin=task/cgroup in slurm.conf and ConstrainRAMSpace=yes in cgroup.conf), but the MemSpecLimit on the scontrol output is incorrect: From slurm.conf: NodeName=htst0101 CPUs=1 RealMemory=10019 Sockets=1 CoresPerSocket=1 ThreadsPerCore=1 State=UNKNOWN Feature=amd MemSpecLimit=6114 From scontrol output: [root@htst0001 slurm-test_logs]# scontrol show node=htst0101 NodeName=htst0101 Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUErr=0 CPUTot=1 CPULoad=0.00 AvailableFeatures=amd ActiveFeatures=amd Gres=(null) NodeAddr=htst0101 NodeHostName=htst0101 Version=16.05 OS=Linux RealMemory=10019 AllocMem=0 FreeMem=1618 Sockets=1 Boards=1 MemSpecLimit=3875 State=IDLE ThreadsPerCore=1 TmpDisk=750 Weight=1 Owner=N/A MCS_label=N/A BootTime=2016-11-16T11:26:58 SlurmdStartTime=2016-11-16T11:42:58 CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Why isn't it updating when I restart slurmctld and scontrol reconfig??? Is there a new way of reconfiguring things after making changes to the slurm.conf?
Paul - are you actually restarting slurmctld or just scontrol reconfig? scontrol reconfig will not suffice for this change and an actual restart is required.
Just a heads up, the cgroup on the node is correct: [root@htst0101 slurm]# awk '{print int($1/1024/1024)}' '/cgroup/memory/slurm_htst0101/system/memory.limit_in_bytes' 6144
I always restart the slurmctld daemon then scontrol reconfig.
Are all the slurm components (clients: scontrol, sinfo + all the daemons) running the same 16.05.4.1 version? what's the output of scontrol -V?
Yes they are: On the Server: [root@htst0001 slurm-test_logs]# scontrol -V slurm 16.05.4 [root@htst0001 slurm-test_logs]# /etc/init.d/slurm-test status slurmctld (pid 30451) is running... [root@htst0001 slurm-test_logs]# readlink -f /proc/30451/exe /d/sw/slurm-test/20160909-16050401/sbin/slurmctld [root@htst0001 slurm-test_logs]# /etc/init.d/slurmdbd-test status slurmdbd (pid 20561) is running... [root@htst0001 slurm-test_logs]# readlink -f /proc/20561/exe /d/sw/slurm-test/20160909-16050401/sbin/slurmdbd [root@htst0001 slurm-test_logs]# which scontrol /d/sw/slurm-test/latest/bin/scontrol [root@htst0001 slurm-test_logs]# which sinfo /d/sw/slurm-test/latest/bin/sinfo [root@htst0001 slurm-test_logs]# ll /d/sw/slurm-test/latest lrwxrwxrwx 1 root root 17 Oct 3 21:19 /d/sw/slurm-test/latest -> 20160909-16050401 On the node: [root@htst0101 slurm]# /etc/init.d/slurm status slurmd (pid 4567) is running... [root@htst0101 slurm]# readlink -f /proc/4567/exe /d/sw/slurm-test/20160909-16050401/sbin/slurmd
I believe I may have found whats happening. Not sure if this is the design or not but here is what I have found: [root@htst0001 slurm-test_logs]# scontrol show node=htst0101 | grep -E "RealMem|MemSpec" OS=Linux RealMemory=10019 AllocMem=0 FreeMem=1609 Sockets=1 Boards=1 MemSpecLimit=3875 If you do (RealMemory - MemSpecLimit) this equals 6144 which is what I see the MemSpecLimit to in the config. If thats how its designed we will roll with it but it doesn't seem consistent since the same variable name is set differently in the config. Is this how it is supposed to be?
(In reply to paull from comment #21) > I believe I may have found whats happening. Not sure if this is the design > or not but here is what I have found: > > [root@htst0001 slurm-test_logs]# scontrol show node=htst0101 | grep -E > "RealMem|MemSpec" > OS=Linux RealMemory=10019 AllocMem=0 FreeMem=1609 Sockets=1 Boards=1 > MemSpecLimit=3875 > > If you do (RealMemory - MemSpecLimit) this equals 6144 which is what I see > the MemSpecLimit to in the config. If thats how its designed we will roll > with it but it doesn't seem consistent since the same variable name is set > differently in the config. > > Is this how it is supposed to be? No, scontrol show node should output MemSpecLimit value as configured. In fact I'm trying changing MemSpecLimit values on 16.05.4 and after restarting slurmctld I see the same configured value. This is strange. What happens if you restart slurmd on node htst0101? Maybe there's an issue saving/loading the node_state?
Also please upload the slurmd.log on htst0101 after the slurmd restart.
I had to make some hostname changes so htst0101 is now htst0701. I changed its MemSpecLimit in slurm.conf from 6144 to 5144. I restarted slurmcltd and did an scontrol reconfig on the controller and then restarted the slurmd on the node. After: NodeName=htst0701 Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUErr=0 CPUTot=1 CPULoad=0.23 AvailableFeatures=amd ActiveFeatures=amd Gres=(null) NodeAddr=htst0701 NodeHostName=htst0701 Version=16.05 OS=Linux RealMemory=10019 AllocMem=0 FreeMem=657 Sockets=1 Boards=1 MemSpecLimit=6144 State=IDLE ThreadsPerCore=1 TmpDisk=750 Weight=1 Owner=N/A MCS_label=N/A BootTime=2016-11-18T09:29:54 SlurmdStartTime=2016-11-18T09:39:49 CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s [root@htst0701 ~]# awk '{print int($1/1024/1024)}' '/cgroup/memory/slurm_htst0701/system/memory.limit_in_bytes' 5144 I have also attached the logs from htst0701 here. The node as you see changes it's limits. Just the controller doesn't present the MemSpecLimits the correct way.
Created attachment 3724 [details] slurmd logs after restart
Ow I can finally reproduce. The key is to set FastSchedule=0. We're gonna work on it and come back to you.
Paul - since nodes are responding and MemSpecLimit is set properly in the cgroups (just scontrol is not printing the right value), I'm switching severity to 3 and we'll continue with this next week.
Ok thanks for all your help Alejandro. I will await your response next week. Have a great weekend!
Hi Alejandro, Any update? Do you believe this will affect anything further? Thanks, Paul
(In reply to paull from comment #29) > Hi Alejandro, > > Any update? Do you believe this will affect anything further? > > Thanks, > Paul Hi Paul - I still have to work more on this bug. Hopefully tomorrow will do. My guess is that it's just a display issue on the client side, though, but have to confirm. Will come back to you with updates asap.
Paul - just as an udpate. I've set some breakpoints in the server side while a reconfig happens and it seems the new mem_spec_limit value is pushed to the node hash tables properly, which is responsible for keeping the info of the nodes in memory. Then I started setting breakpoints in the client side and it seems that scontrol_load_nodes calls to slurm_load_node and it returns the old info instead of the new. I'm trying to figure out why but I'm guessing it is an issue with any last_update member or the last_node_update variable. If these variables do not change when the mem_spec_limit is changed then scontrol will always retrieve the old information instead of the new. I continue working with this, will come back to you again.
Thanks Alejandro for the update.
Paul, following commit addressed for 16.05.7 fixes the issue: https://github.com/SchedMD/slurm/commit/1eeb9e457e7 mem_spec_limit should have never been packed/unpacked and saved/loaded to node_state in the first place. We don't need to keep state as the value is always read from the slurm.conf. If we used this value it would over write that value so we will throw it away when unpacked. In 17.02 we don't even pack/unpack it, with this other commit: https://github.com/SchedMD/slurm/commit/b8aec60b3d6 Marking as resolved/fixed. Please, reopen if you encounter further issues with this.
Thank you Alejandro for your support. I will be testing 16.05.7 as you recommended in 3287. Thanks, Paul