3239 – Upgraded to Slurm 16.05.4.1, nodes not responding and memspeclimit not set correctly

Ticket 3239 - Upgraded to Slurm 16.05.4.1, nodes not responding and memspeclimit not set correctly

Summary: Upgraded to Slurm 16.05.4.1, nodes not responding and memspeclimit not set co...

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Other (show other tickets)
Version:	16.05.4
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Alejandro Sanchez
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2016-11-04 10:22 MDT by paull
Modified:	2016-12-06 09:01 MST (History)
CC List:	2 users (show)

See Also:
Site:	DownUnder GeoSolutions
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	16.05.7 17.02
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Output of various logs and config files (4.47 KB, text/plain) 2016-11-04 10:22 MDT, paull	Details
Main settings (1.71 KB, text/plain) 2016-11-04 10:48 MDT, paull	Details
node classifications (3.92 KB, text/plain) 2016-11-04 10:49 MDT, paull	Details
slurmd logs after restart (21.55 KB, application/x-trash) 2016-11-18 08:51 MST, paull	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description paull 2016-11-04 10:22:55 MDT

Created attachment 3677 [details]
Output of various logs and config files

I have upgraded from 14.11 to 16.05.4.1 recently and am having issues. 

I've set MemSpecLimit to 6144 for all nodes but some either go to 5887 or never change from their original setting. 

sinfo shows a lot of nodes Not responding and slurmctld shows mismatch slurm.conf files. This file is centralized and should be the same after slurmd restart. The node not responding and slurm.conf mismatch is only happening to some of our nodes. 

Any idea why this may be? 

I have attached a log showing outputs from various logs and config files.

Comment 1 Alejandro Sanchez 2016-11-04 10:30:53 MDT

Paul - can you upload the whole slurm.conf? I'm curious to see if you have task/cgroup enabled. There was this fix introduced in 15.08.12:

 -- Fix MemSpecLimit to explicitly require TaskPlugin=task/cgroup and
    ConstrainRAMSpace set in cgroup.conf.

And I see no ConstrainRAMSpace in your cgroup.conf either.

Comment 2 paull 2016-11-04 10:48:41 MDT

Created attachment 3680 [details]
Main settings

Comment 3 paull 2016-11-04 10:49:10 MDT

Created attachment 3681 [details]
node classifications

Comment 4 paull 2016-11-04 10:50:09 MDT

Alejandro,

Ah, but why would this seem to work on others?

See attached. Slurm_common.conf is included in slurm.conf.

So do I need to add

TaskPlugin=task/cgroup to slurm.conf

and 

ConstrainRAMSpace=yes cgroup.conf?

Comment 5 Alejandro Sanchez 2016-11-04 10:54:06 MDT

(In reply to paull from comment #4)
> Alejandro,
> 
> Ah, but why would this seem to work on others?
> 
> See attached. Slurm_common.conf is included in slurm.conf.
> 
> So do I need to add
> 
> TaskPlugin=task/cgroup to slurm.conf
> 
> and 
> 
> ConstrainRAMSpace=yes cgroup.conf?

Exactly. Commit 588ce8bd9 introduced these requirements for MemSpecLimit to work. Otherwise a message like the following should be logged to your slurmd.log:

+		error("Resource spec: cgroup job confinement not configured. "
+		      "MemSpecLimit requires TaskPlugin=task/cgroup and "
+		      "ConstrainRAMSpace=yes in cgroup.conf");

Could you try to change these values and restart the daemons?

Comment 6 paull 2016-11-04 13:16:55 MDT

I did that and restarted slurmctld, scontrol reconfig, resumed a node and restarted slurmd on that node.

Now I see this:
NodeName=jnod0057 CoresPerSocket=10
   CPUAlloc=0 CPUErr=0 CPUTot=40 CPULoad=0.00
   AvailableFeatures=localdisk,gpu,nogpu,intel,4gpu
   ActiveFeatures=localdisk,gpu,nogpu,intel,4gpu
   Gres=(null)
   NodeAddr=nod0057 NodeHostName=nod0057 Version=16.05
   RealMemory=129163 AllocMem=0 FreeMem=121658 Sockets=2 Boards=1
   MemSpecLimit=123017
   State=IDLE# ThreadsPerCore=2 TmpDisk=425702 Weight=1 Owner=N/A MCS_label=N/A
   BootTime=2016-11-04T17:31:10 SlurmdStartTime=2016-11-05T00:11:36
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


I know the "#" means its configuring but whats configuring and why is the MemSpecLimit still incorrect?

Comment 7 Danny Auble 2016-11-04 15:21:28 MDT

Paul would you mind sending in your slurmd as well as your slurmctld log?  From the snip it would appear the slurm.conf on your nodes is different than the one running on your slurmctld node.  But I am supposing that is before you altered the slurm.conf with the new cgroup setting.

My guess is there is a disconnect between the daemons.  Please verify the slurm.conf and cgroup.conf files are the same and the daemons have restarted and send the logs from that process.

Comment 8 Danny Auble 2016-11-04 15:45:08 MDT

For what it is worth Paul, I am unable to reproduce this in 16.05.4 or 16.05.6.

It is interesting though RealMemory=129163 - MemSpecLimit=123017 = 6146 which is close to what you have in your settings.  Not sure how that is getting there though.

Is anything running on the system?  Are you able to get anything going?  Is this problem happening on all your nodes?

Comment 9 paull 2016-11-04 15:47:39 MDT

Hi Alejandro, 

Due to jobs not being able run and the upcoming need for jobs to be running as soon as possible, I was forced to roll back to 14.11. For now, we will need to do more testing to make sure then next rollout goes smoother. 

For understanding, both slurmctld and the slurmd share the exact same file. It is an NFS mounted file, therefore restarting either daemon will/should point to the same file. For some reason, that was not happening. 

I'm going to attempt to extend our test environment and try to replicate this issue. At which time I can send you the logs requested. 

When I rolled back to 14.11, the slurmctld log produced this error:

[2016-11-05T03:19:23.761] slurmctld version 14.11.10 started on cluster cluster
[2016-11-05T03:19:24.401] error: ***********************************************
[2016-11-05T03:19:24.401] error: Can not recover usage_mgr state, incompatible version, got 7680 need 1
[2016-11-05T03:19:24.401] error: ***********************************************
[2016-11-05T03:19:24.401] error: ***********************************************
[2016-11-05T03:19:24.401] error: Can not recover usage_mgr state, incompatible version, got 7680 need 1
[2016-11-05T03:19:24.401] error: ***********************************************
[2016-11-05T03:19:24.405] layouts: no layout to initialize
[2016-11-05T03:19:24.405] error: read_slurm_conf: default partition not set.
[2016-11-05T03:19:24.416] layouts: loading entities/relations information
[2016-11-05T03:19:24.416] error: gres_plugin_node_state_unpack: unpack error from node jnod0001
[2016-11-05T03:19:24.416] error: Incomplete node data checkpoint file
[2016-11-05T03:19:24.416] Recovered state of 0 nodes
[2016-11-05T03:19:24.416] error: unpackmem_xmalloc: Buffer to be unpacked is too large (825306218 > 67108864)
[2016-11-05T03:19:24.416] error: Incomplete job record
[2016-11-05T03:19:24.416] error: Incomplete job data checkpoint file
[2016-11-05T03:19:24.416] Recovered information about 0 jobs


We run slurmdbd as a mysql database, which was backed up prior to upgrade. To rollback, we simply restore the back up. All jobs were also lost in the process. Any idea why and how can we avoid this in the event a rollback is necessary again?

Comment 10 paull 2016-11-04 15:50:04 MDT

Hi Danny,

For the MemSpecLimit, please confirm that I was supposed to put the floor value and not the ceiling in version 16.05.4.1.

RealMem=120000

Required reserved space= 6000


MemSpecLimit=140000 (How it was in version 14.11)
or
MemSpecLimit=6000 (How we had it configured)

Comment 11 Tim Wickberg 2016-11-07 16:13:28 MST

(In reply to paull from comment #10)
> Hi Danny,
> 
> For the MemSpecLimit, please confirm that I was supposed to put the floor
> value and not the ceiling in version 16.05.4.1.
> 
> RealMem=120000
> 
> Required reserved space= 6000
> 
> 
> MemSpecLimit=140000 (How it was in version 14.11)
> or
> MemSpecLimit=6000 (How we had it configured)



MemSpec limit should never exceed RealMem. ( RealMem - MemSpecLimit ) is the upper limit of how much memory Slurm will assign to jobs on the node.

As an alternative approach, assuming you're not using cgroups, you can just intentionally under-report the RealMemory value for the node as a way to set aside space for the OS and non-user jobs.

Comment 12 paull 2016-11-08 08:38:01 MST

Hi Tim,

We do use cgroups. Prior to updated the configs with the TaskPlugin and ConstrainRAMSpace, some nodes were missing the slurm sub-directory in the memory cgroup directory. Once I changed the configs they were updated, it was just weird that it worked from some and not for others prior to the config change.

This is how we have reconfigured MemSpecLimit for update 16.05.4.1 ( RealMem - MemSpecLimit ). Are there any other parameters with MemSpecLimit that would need to be updated in our configs?

Thanks,
Paul

Comment 13 Tim Wickberg 2016-11-09 14:17:07 MST

(In reply to paull from comment #12)
> Hi Tim,
> 
> We do use cgroups. Prior to updated the configs with the TaskPlugin and
> ConstrainRAMSpace, some nodes were missing the slurm sub-directory in the
> memory cgroup directory. Once I changed the configs they were updated, it
> was just weird that it worked from some and not for others prior to the
> config change.
> 
> This is how we have reconfigured MemSpecLimit for update 16.05.4.1 ( RealMem
> - MemSpecLimit ). Are there any other parameters with MemSpecLimit that
> would need to be updated in our configs?

The only other thing I'd note is that if you move to 16.05.5 or later you would no longer need the ReleaseAgent setting. That setting can be problematic on systemd-based distributions, as systemd prefers the release_agent mount option to point to something of its own devising.

But as-is I believe you should be okay.

Is there anything else I can answer on this?

Comment 14 paull 2016-11-09 15:10:08 MST

No that will be all. Will do some more testing to be sure this is the extent of the changes needing to be made and try to push this out again soon. For now we can close this.

Comment 15 paull 2016-11-16 11:06:15 MST

Hi,

I have built a test environment and have jobs running. The things that bothers me is I have set the required (TaskPlugin=task/cgroup in slurm.conf and ConstrainRAMSpace=yes in cgroup.conf), but the MemSpecLimit on the scontrol output is incorrect:

From slurm.conf:

NodeName=htst0101 CPUs=1 RealMemory=10019 Sockets=1 CoresPerSocket=1 ThreadsPerCore=1 State=UNKNOWN Feature=amd MemSpecLimit=6114


From scontrol output:

[root@htst0001 slurm-test_logs]# scontrol show node=htst0101
NodeName=htst0101 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=1 CPULoad=0.00
   AvailableFeatures=amd
   ActiveFeatures=amd
   Gres=(null)
   NodeAddr=htst0101 NodeHostName=htst0101 Version=16.05
   OS=Linux RealMemory=10019 AllocMem=0 FreeMem=1618 Sockets=1 Boards=1
   MemSpecLimit=3875
   State=IDLE ThreadsPerCore=1 TmpDisk=750 Weight=1 Owner=N/A MCS_label=N/A
   BootTime=2016-11-16T11:26:58 SlurmdStartTime=2016-11-16T11:42:58
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


Why isn't it updating when I restart slurmctld and scontrol reconfig??? Is there a new way of reconfiguring things after making changes to the slurm.conf?

Comment 16 Alejandro Sanchez 2016-11-16 11:15:57 MST

Paul - are you actually restarting slurmctld or just scontrol reconfig? scontrol reconfig will not suffice for this change and an actual restart is required.

Comment 17 paull 2016-11-16 11:19:16 MST

Just a heads up, the cgroup on the node is correct:

[root@htst0101 slurm]# awk '{print int($1/1024/1024)}' '/cgroup/memory/slurm_htst0101/system/memory.limit_in_bytes' 
6144

Comment 18 paull 2016-11-16 11:19:55 MST

I always restart the slurmctld daemon then scontrol reconfig.

Comment 19 Alejandro Sanchez 2016-11-16 11:24:12 MST

Are all the slurm components (clients: scontrol, sinfo + all the daemons) running the same 16.05.4.1 version? what's the output of scontrol -V?

Comment 20 paull 2016-11-16 11:32:45 MST

Yes they are:

On the Server:

[root@htst0001 slurm-test_logs]# scontrol -V
slurm 16.05.4
[root@htst0001 slurm-test_logs]# /etc/init.d/slurm-test status
slurmctld (pid 30451) is running...
[root@htst0001 slurm-test_logs]# readlink -f /proc/30451/exe 
/d/sw/slurm-test/20160909-16050401/sbin/slurmctld
[root@htst0001 slurm-test_logs]# /etc/init.d/slurmdbd-test status
slurmdbd (pid 20561) is running...
[root@htst0001 slurm-test_logs]# readlink -f /proc/20561/exe 
/d/sw/slurm-test/20160909-16050401/sbin/slurmdbd
[root@htst0001 slurm-test_logs]# which scontrol
/d/sw/slurm-test/latest/bin/scontrol
[root@htst0001 slurm-test_logs]# which sinfo
/d/sw/slurm-test/latest/bin/sinfo
[root@htst0001 slurm-test_logs]# ll /d/sw/slurm-test/latest
lrwxrwxrwx 1 root root 17 Oct  3 21:19 /d/sw/slurm-test/latest -> 20160909-16050401


On the node:

[root@htst0101 slurm]# /etc/init.d/slurm status
slurmd (pid 4567) is running...
[root@htst0101 slurm]# readlink -f /proc/4567/exe
/d/sw/slurm-test/20160909-16050401/sbin/slurmd

Comment 21 paull 2016-11-16 12:58:36 MST

I believe I may have found whats happening. Not sure if this is the design or not but here is what I have found:

[root@htst0001 slurm-test_logs]# scontrol show node=htst0101 | grep -E "RealMem|MemSpec"
   OS=Linux RealMemory=10019 AllocMem=0 FreeMem=1609 Sockets=1 Boards=1
   MemSpecLimit=3875

If you do (RealMemory - MemSpecLimit) this equals 6144 which is what I see the MemSpecLimit to in the config. If thats how its designed we will roll with it but it doesn't seem consistent since the same variable name is set differently in the config. 

Is this how it is supposed to be?

Comment 22 Alejandro Sanchez 2016-11-16 14:43:27 MST

(In reply to paull from comment #21)
> I believe I may have found whats happening. Not sure if this is the design
> or not but here is what I have found:
> 
> [root@htst0001 slurm-test_logs]# scontrol show node=htst0101 | grep -E
> "RealMem|MemSpec"
>    OS=Linux RealMemory=10019 AllocMem=0 FreeMem=1609 Sockets=1 Boards=1
>    MemSpecLimit=3875
> 
> If you do (RealMemory - MemSpecLimit) this equals 6144 which is what I see
> the MemSpecLimit to in the config. If thats how its designed we will roll
> with it but it doesn't seem consistent since the same variable name is set
> differently in the config. 
> 
> Is this how it is supposed to be?

No, scontrol show node should output MemSpecLimit value as configured. In fact I'm trying changing MemSpecLimit values on 16.05.4 and after restarting slurmctld I see the same configured value. This is strange. What happens if you restart slurmd on node htst0101? Maybe there's an issue saving/loading the node_state?

Comment 23 Alejandro Sanchez 2016-11-16 15:05:19 MST

Also please upload the slurmd.log on htst0101 after the slurmd restart.

Comment 24 paull 2016-11-18 08:50:32 MST

I had to make some hostname changes so htst0101 is now htst0701.

I changed its MemSpecLimit in slurm.conf from 6144 to 5144. I restarted slurmcltd and did an scontrol reconfig on the controller and then restarted the slurmd on the node.

After:

NodeName=htst0701 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=1 CPULoad=0.23
   AvailableFeatures=amd
   ActiveFeatures=amd
   Gres=(null)
   NodeAddr=htst0701 NodeHostName=htst0701 Version=16.05
   OS=Linux RealMemory=10019 AllocMem=0 FreeMem=657 Sockets=1 Boards=1
   MemSpecLimit=6144
   State=IDLE ThreadsPerCore=1 TmpDisk=750 Weight=1 Owner=N/A MCS_label=N/A
   BootTime=2016-11-18T09:29:54 SlurmdStartTime=2016-11-18T09:39:49
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

[root@htst0701 ~]# awk '{print int($1/1024/1024)}' '/cgroup/memory/slurm_htst0701/system/memory.limit_in_bytes'                                     
5144

I have also attached the logs from htst0701 here. The node as you see changes it's limits. Just the controller doesn't present the MemSpecLimits the correct way.

Comment 25 paull 2016-11-18 08:51:04 MST

Created attachment 3724 [details]
slurmd logs after restart

Comment 26 Alejandro Sanchez 2016-11-18 09:59:34 MST

Ow I can finally reproduce. The key is to set FastSchedule=0. We're gonna work on it and come back to you.

Comment 27 Alejandro Sanchez 2016-11-18 10:31:10 MST

Paul - since nodes are responding and MemSpecLimit is set properly in the cgroups (just scontrol is not printing the right value), I'm switching severity to 3 and we'll continue with this next week.

Comment 28 paull 2016-11-18 11:24:05 MST

Ok thanks for all your help Alejandro. I will await your response next week. Have a great weekend!

Comment 29 paull 2016-11-22 12:40:37 MST

Hi Alejandro,

Any update? Do you believe this will affect anything further?

Thanks,
Paul

Comment 30 Alejandro Sanchez 2016-11-22 13:40:54 MST

(In reply to paull from comment #29)
> Hi Alejandro,
> 
> Any update? Do you believe this will affect anything further?
> 
> Thanks,
> Paul

Hi Paul - I still have to work more on this bug. Hopefully tomorrow will do. My guess is that it's just a display issue on the client side, though, but have to confirm. Will come back to you with updates asap.

Comment 31 Alejandro Sanchez 2016-11-23 09:05:20 MST

Paul - just as an udpate. I've set some breakpoints in the server side while a reconfig happens and it seems the new mem_spec_limit value is pushed to the node hash tables properly, which is responsible for keeping the info of the nodes in memory. Then I started setting breakpoints in the client side and it seems that scontrol_load_nodes calls to slurm_load_node and it returns the old info instead of the new. I'm trying to figure out why but I'm guessing it is an issue with any last_update member or the last_node_update variable. If these variables do not change when the mem_spec_limit is changed then scontrol will always retrieve the old information instead of the new. I continue working with this, will come back to you again.

Comment 32 paull 2016-11-23 09:11:14 MST

Thanks Alejandro for the update.

Comment 59 Alejandro Sanchez 2016-12-06 02:16:37 MST

Paul, following commit addressed for 16.05.7 fixes the issue:

https://github.com/SchedMD/slurm/commit/1eeb9e457e7

mem_spec_limit should have never been packed/unpacked and saved/loaded to node_state in the first place. We don't need to keep state as the value is always read from the slurm.conf. If we used this value it would over write that value so we will throw it away when unpacked. In 17.02 we don't even pack/unpack it, with this other commit:

https://github.com/SchedMD/slurm/commit/b8aec60b3d6

Marking as resolved/fixed. Please, reopen if you encounter further issues with this.

Comment 60 paull 2016-12-06 09:01:17 MST

Thank you Alejandro for your support. I will be testing 16.05.7 as you recommended in 3287.

Thanks,
Paul