Ticket 3818

Summary:	Utilization issues with constraints
Product:	Slurm	Reporter:	NASA JSC Aerolab <JSC-DL-AEROLAB-ADMIN>
Component:	Scheduling	Assignee:	Danny Auble <da>
Status:	RESOLVED FIXED	QA Contact:
Severity:	2 - High Impact
Priority:	---
Version:	16.05.7
Hardware:	Linux
OS:	Linux
Site:	Johnson Space Center	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	16.05.8 17.02.0
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	examples scontrol output Latest slurm.conf slurmctld log Another example of the issue more examples L1 cluster slurm.conf for a single node script to start multiple slurmds for the L1 cluster

Description NASA JSC Aerolab 2017-05-17 10:55:51 MDT

Created attachment 4577 [details]
conf files

The overall problem is described here:

https://groups.google.com/forum/#!topic/slurm-devel/4tfWMDuJ6ww

In short, high priority jobs block other jobs from running even when the jobs are requesting different features.  We have a very strong desire to fix this as it would really help with overall throughput on our cluster.  

Other people on the list seem to run into the same problem - see "Priority blocking jobs despite idle machines" for one example.  

I'll attach a tar file with:

* our conf files
* some example output of when this occurs

Comment 1 NASA JSC Aerolab 2017-05-17 11:06:43 MDT

Attaching several examples.  Focus on *.2017-04-26.0803 for now.  In this example there are 3 jobs requesting bro (broadwell) nodes that aren't running, even though there are idle bro nodes.

Comment 2 NASA JSC Aerolab 2017-05-17 11:07:37 MDT

Created attachment 4578 [details]
examples

Comment 3 Tim Wickberg 2017-05-17 12:06:47 MDT

I'm sending this over to Dominik to see what he can work out.

This does seem to be a side-effect of how the constraints are handled internally; I'm asking him to work out whether there may be a simple path to resolve this, or if we may need to move this into a longer-term enhancement request for a future release. (There's a limit on how extensive of a change we can accept on the stable releases.)

- Tim

Comment 4 Danny Auble 2017-05-22 11:54:14 MDT

Darby, sorry for the delay on this.  Would it be possible to know the exact submit line given from the job holding up the queue as well as the jobs pending to run?

I would also be interested in 'scontrol show job' on those jobs as well.  The q output isn't as helpful.

I'll try to reproduce on my own but these extra items would be helpful.

Comment 5 NASA JSC Aerolab 2017-05-22 13:52:15 MDT

All jobs are submitted via sbatch and have the same basic form:

#SBATCH -n 124
#SBATCH --time=1:00:00
#SBATCH --constraint=[wes|san|has|bro]


We have an epilog script that records the output of "scontrol -o show job ${SLURM_JOB_ID}" for each job.  Is that enough for you?  Otherwise, we also have slurmdbd running with a SQL backend and can query more info that way if you'd like.  I'm not very familiar with sacct so I'd need some help with the commands you need.  I'm attaching a file with the show job output for the jobs around the 4/26/17 time frame.  Let me know what else you'd like to see.

Comment 6 NASA JSC Aerolab 2017-05-22 13:56:37 MDT

Created attachment 4609 [details]
scontrol output

Comment 7 Danny Auble 2017-05-22 16:37:21 MDT

Thanks for the output.  I haven't been able to recreate the scenario just yet though.

Out of curiosity would you be able to switch from

select/linear to select/cons_res

and add

OverSubscribe=Exclusive

to your PartitionName definitions (which will give you the same kind of exclusive behavior as select/linear)?

I am curious if this would make a difference using the more commonly used code path.

If you were able to recreate on a test system this would be ideal then try the different SelectType.

Let me know if this is possible.

Comment 8 Danny Auble 2017-05-23 13:56:30 MDT

Did this help the situation?  Have you been able to test?

Comment 9 NASA JSC Aerolab 2017-05-23 14:04:58 MDT

No, I haven't had a chance to test yet.  Unfortunately, we don't have a good test system so I'd have to use our production system.  Are you pretty confident I can make this change to our production system without too much impact?

Comment 10 Danny Auble 2017-05-23 14:13:18 MDT

I am fairly sure you can do this on a production system, though the change really only applies to the slurmctld you will probably want to restart all your slurmds as well.  But if you want to be ultra safe you can always schedule the change during your next maintenance window.  You can also send your updated slurm.conf so I can double check to make sure you didn't miss anything.

On a related note you can also install Slurm on a single Linux box and emulate your environment fairly easily.  This way you can change options in your slurm.conf and try them out and place them onto the production system after you feel confident it is what you want to do.

See https://slurm.schedmd.com/faq.html#multi_slurmd and just use the --enable-multiple-slurmd option when configuring the build.

If you would like more help on that please let me know.

Comment 11 NASA JSC Aerolab 2017-05-23 14:37:15 MDT

Changes made.  I will upload our new slurm.conf.  

BUT... I didn't have a SelectTypeParameters=CR_CPU and by the time I got this fixed all the jobs on our system died...  Any way to easily requeue these?  And how do I prevent all jobs dieing in the future?!

Comment 12 NASA JSC Aerolab 2017-05-23 14:38:09 MDT

Created attachment 4624 [details]
Latest slurm.conf

Comment 13 Danny Auble 2017-05-23 14:44:39 MDT

Well, that is strange.  If the jobs were still alive on the nodes you could reload the old state files, but it sounds like things are all gone.

Are you sure the jobs got killed?

When I start the slurmctld without a SelectTypeParameter I get a fatal

May 23 14:39:47.286147 26277 slurmctld    0x7f76db0a6b40: fatal: Invalid SelectTypeParameters: NONE (0), You need at least CR_(CPU|CORE|SOCKET)*

That shouldn't kill anything.

The best way to prevent jobs from dieing is testing before changing.

Could you send your slurmctld log when all your jobs died?  It seems strange this would have affected this.

I don't think there is an easy way to requeue jobs unless you ask the original submitter to do so.

Your slurm.conf you sent is the same as before (from what I can tell).

Comment 14 Danny Auble 2017-05-23 14:49:11 MDT

Darby, have you restarted your slurmd's yet?

Comment 15 Danny Auble 2017-05-23 14:54:19 MDT

From what I can tell it sounds like you haven't restarted your slurmd's yet.  This is good, I think I can help recover.

Switch your SelectType back to select/linear

then go to your SlurmStateLocation=/software/x86_64/slurm/state

Copy that dir to a safe location then go into the directory

There you will find a job_state.old file

overwrite the job_state file with that file.

Now go restart your slurmctld and you should be back in business.

Comment 16 Danny Auble 2017-05-23 14:56:20 MDT

You will need to comment out SelectTypeparameters as well (as CR_CPU isn't valid with select/linear).

Let me know if that gets you back up or not.

Comment 17 NASA JSC Aerolab 2017-05-23 15:08:02 MDT

I'm trying to upload the log file but it won't let me – I keep getting access denied.  The log file is only 1.7 MB.  

I had restarted slurm with the new configuration so think my state files got overwritten.  


[root@service1 state]# ll
total 228
-rw------- 1 slurm eg3  9236 May 23 15:43 assoc_mgr_state
-rw------- 1 slurm eg3  9236 May 23 15:39 assoc_mgr_state.old
-rw------- 1 slurm eg3  2328 May 23 15:43 assoc_usage
-rw------- 1 slurm eg3  2328 May 23 15:39 assoc_usage.old
-rw-r--r-- 1 slurm eg3     2 Dec 30 15:45 clustername
-rw------- 1 slurm eg3     0 May 23 15:43 dbd.messages
drwx------ 2 slurm eg3     6 May 23 15:29 hash.0
drwx------ 2 slurm eg3     6 May 23 15:29 hash.1
drwx------ 2 slurm eg3     6 May 23 15:29 hash.2
drwx------ 2 slurm eg3     6 May 23 15:29 hash.3
drwx------ 2 slurm eg3     6 May 23 15:29 hash.4
drwx------ 2 slurm eg3     6 May 23 15:29 hash.5
drwx------ 2 slurm eg3     6 May 23 15:29 hash.6
drwx------ 2 slurm eg3     6 May 23 15:29 hash.7
drwx------ 2 slurm eg3     6 May 23 15:29 hash.8
drwx------ 2 slurm eg3     6 May 23 15:29 hash.9
-rw------- 1 slurm eg3    35 May 23 15:43 job_state
-rw------- 1 slurm eg3    35 May 23 15:39 job_state.old
-rw------- 1 slurm eg3    42 May 23 15:29 last_config_lite
-rw------- 1 slurm eg3    42 May 23 15:24 last_config_lite.old
-rw------- 1 slurm eg3 10240 May 23 15:43 layouts_state_base
-rw------- 1 slurm eg3 10240 May 23 15:24 layouts_state_base.old
-rw------- 1 slurm eg3 56317 May 23 15:43 node_state
-rw------- 1 slurm eg3 56317 May 23 15:39 node_state.old
-rw------- 1 slurm eg3   615 May 23 15:43 part_state
-rw------- 1 slurm eg3   615 May 23 15:39 part_state.old
-rw------- 1 slurm eg3    16 May 23 15:39 priority_last_decay_ran
-rw------- 1 slurm eg3    16 May 23 15:34 priority_last_decay_ran.old
-rw------- 1 slurm eg3   101 May 23 15:43 qos_usage
-rw------- 1 slurm eg3   101 May 23 15:39 qos_usage.old
-rw------- 1 slurm eg3    35 May 23 15:43 resv_state
-rw------- 1 slurm eg3    35 May 23 15:39 resv_state.old
-rw------- 1 slurm eg3    31 May 23 15:43 trigger_state
-rw------- 1 slurm eg3    31 May 23 15:39 trigger_state.old
[root@service1 state]#

Comment 18 Danny Auble 2017-05-23 15:19:15 MDT

Unless we have the state files there isn't much we can do.  I am sorry I gave bad advice in the beginning of this.  Through my tests I didn't see this issue, but can easily reproduce it now.  Changing the select type can only be done with no jobs in the system.

There are 2 unattractive options you can do...

One is have the slurmctld down until after all the running jobs finish.  If you start the slurmctld back up as the slurmd's check in they will kill off the jobs as the slurmctld doesn't know anything about them.  This will at least finish the jobs running on the system currently, but will not allow new jobs to be scheduled.

Two is to start things back up and then start over again.

I understand both of these options are bad, but without state there isn't much we can do.

Comment 19 Danny Auble 2017-05-23 15:20:32 MDT

Created attachment 4625 [details]
slurmctld log

Comment 20 Danny Auble 2017-05-23 16:20:04 MDT

Again, I am sorry for the bad intel and the fallout from it.  I am hoping the changes will at least fix the scenario this ticket is about.  Please report back if the problem still remains with the new configuration.

Comment 21 NASA JSC Aerolab 2017-05-23 16:36:20 MDT

It happens - I make mistakes too.  :)  So back to the issue:

Thanks for the info about simulating larger clusters.  We have individual workstations that each run slurm so we can run small stuff at night.  I will look into this.  Especially in retrospect, I should have done this instead of test on our production system.  

We are back up and running again.  We are using select/cons_res now.  Most users had gone back to requesting a single processor type to avoid getting hung up in the queue.  I've asked several to use multiple processor types again.  It will probably take some time to see if this is working any better.

Comment 22 NASA JSC Aerolab 2017-05-24 09:50:56 MDT

OK, we caught another example this morning, this time under select/cons_res.  See job 55800 in the files I'm about to upload.

Comment 23 NASA JSC Aerolab 2017-05-24 09:59:16 MDT

Created attachment 4634 [details]
Another example of the issue

Comment 24 NASA JSC Aerolab 2017-05-24 13:43:28 MDT

Data for two more examples.  See the top of q.txt.* for annotation of what should run.

Please let me know if you have enough info or if you want me to keep uploading more examples.

Comment 25 NASA JSC Aerolab 2017-05-24 13:43:51 MDT

Created attachment 4640 [details]
more examples

Comment 26 Danny Auble 2017-05-24 13:59:37 MDT

Thanks Darby.  I'll see if I can get to the root of this.  I'll let you know if I need anything else from you.

Comment 27 Danny Auble 2017-05-24 16:53:06 MDT

Darby, good news, I have reproduced what you are seeing.  I also found the commit that appears to fix the behavior.

This issue is fixed in 16.05.8+.  Commit 05ce9a523fa25d is what fixes it.  The commit is only one line as well.  After this commit I am unable to get jobs to hang out when they should be scheduled.

I would advise an upgrade to 16.05.10 if possible, but if you are unable to upgrade applying the commit will also fix this particular problem.

You can get patches from github by just putting ".patch" on the end of a commit reference.

In this case

https://github.com/schedmd/slurm/commit/05ce9a523fa25d

would be

https://github.com/schedmd/slurm/commit/05ce9a523fa25d.patch

I can guarantee no job loss on a dot release (16.05.7 -> 16.05.10).  But this patch is very safe to apply until you can do that.

Let me know how it goes and if it indeed fixes your problem.

Comment 28 NASA JSC Aerolab 2017-05-24 17:21:01 MDT

Great!  I will try this very soon - either tonight or tomorrow.

Comment 29 NASA JSC Aerolab 2017-05-24 21:54:20 MDT

I just installed this update.  I had to manually make the change to the 16.05.07 source as the 16.05.10 source gave a compile error.

make[4]: Leaving directory `/tmp/slurm-16.05.10/src/api'
/bin/sh ../../libtool  --tag=CC   --mode=link gcc  -DNUMA_VERSION1_COMPATIBILITY -g -O2 -pthread -Wno-deprecated-declarations -Wall -g -O0 -fno-strict-aliasing -export-dynamic   -o slurmctld acct_policy.o agent.o backup.o burst_buffer.o controller.o front_end.o gang.o groups.o job_mgr.o job_scheduler.o job_submit.o licenses.o locks.o node_mgr.o node_scheduler.o partition_mgr.o ping_nodes.o port_mgr.o power_save.o powercapping.o preempt.o proc_req.o read_config.o reservation.o sched_plugin.o slurmctld_plugstack.o srun_comm.o state_save.o statistics.o step_mgr.o trigger_mgr.o ../../src/common/libdaemonize.la ../../src/api/libslurm.o -ldl 
libtool: link: gcc -DNUMA_VERSION1_COMPATIBILITY -g -O2 -pthread -Wno-deprecated-declarations -Wall -g -O0 -fno-strict-aliasing -o slurmctld acct_policy.o agent.o backup.o burst_buffer.o controller.o front_end.o gang.o groups.o job_mgr.o job_scheduler.o job_submit.o licenses.o locks.o node_mgr.o node_scheduler.o partition_mgr.o ping_nodes.o port_mgr.o power_save.o powercapping.o preempt.o proc_req.o read_config.o reservation.o sched_plugin.o slurmctld_plugstack.o srun_comm.o state_save.o statistics.o step_mgr.o trigger_mgr.o ../../src/api/libslurm.o -Wl,--export-dynamic  ../../src/common/.libs/libdaemonize.a -ldl -pthread
agent.o: In function `_agent_init':
/tmp/slurm-16.05.10/src/slurmctld/agent.c:1282: undefined reference to `clock_gettime'
collect2: ld returned 1 exit status

This is a little odd.  I had previously compiled and installed the 17.02.2 source in preparation for an upgrade and it compiled fine.  Anyway, the upgrade to 16.05.7 + this patch went fine.  Unfortunately, there weren't any jobs with multiple features hung up in the queue tonight so I can't confirm our issue is fixed right away.  But I'll keep an eye on this tomorrow.  

Any thoughts on staying with select/cons_res vs going back to select/linear?  What are the pros/cons of the two?

Comment 30 NASA JSC Aerolab 2017-05-25 08:21:50 MDT

Time will tell for sure but I think this is working as intended now.  Job 56522 was in the queue this morning as only wes.  We changed it to all processor types and it started.  Before, the higher priority job 56509 would have blocked 56522 since they were both requesting wes nodes.  I'll continue to keep an eye on this but it looks promising!  

Job ID   Username Queue   Jobname              N:ppn Proc Wall  S Elap    Prio     Reason Features  
-------- -------- ------- -------------------- ----- ---- ----- - ----- ------ ---------- --------------
56515    breddell idle    Shield20              15:1   15 08:00 Q 00:00   2258  Resources wes       
56516    breddell idle    Shield20              15:1   15 08:00 Q 00:00   2258  Resources wes       
56517    breddell idle    process09             1:12   12 02:00 R 00:58   2289       None wes       
56520    breddell idle    process10             1:12   12 02:00 R 00:20   2292       None wes       
56511    pjang    normal  m4.0a165pp_from_off  27:16  432 08:00 Q 00:00  11870   Priority bro|has|san
56492    pjang    normal  m4.0a165pp           18:24  432 08:00 Q 00:00  11895   Priority bro       
56485    pjang    normal  m12a155_5sp          18:24  432 08:00 Q 00:00  11900   Priority has       
56493    bstewart normal  D1-D4T2              32:24  768 08:00 Q 00:00  11914   Priority has       
56466    flumpkin long    FBH_heat             36:24  864 24:00 R 01:15  11934       None bro       
56513    pjang    normal  m4.0a165np_from_ny   18:24  432 08:00 Q 00:00  11939   Priority bro       
56514    pjang    normal  m4.0a165np_from_Off  27:16  432 08:00 Q 00:00  11939   Priority bro|has|san
56484    bstewart normal  D1T1                 35:16  560 08:00 R 01:14  11947       None san       
56427    flumpkin normal  D1T3_300_trunk       36:16  576 08:00 R 07:21  11956       None san       
56472    bstewart normal  D1T1                 20:24  480 08:00 R 06:13  11961       None has       
56499    jpowell  normal  m0.35a0.0_C170_xml_  20:12  240 08:00 R 03:19  11962       None wes       
56504    jpowell  normal  m0.35a0.0_itime0_dt  20:12  240 08:00 R 03:06  11962       None wes       
56505    jpowell  normal  m0.35a0.0_C165_xml_  20:12  240 08:00 R 02:57  11962       None wes       
56494    jpowell  normal  m0.35a0.0_dtphys0p5  20:12  240 08:00 R 03:29  11963       None wes       
56469    pjang    normal  m4.0a165ny           18:24  432 08:00 R 06:52  11972       None has       
56510    jpowell  normal  m0.35a0.0_C175_xml_  20:12  240 08:00 Q 00:00  11973   Priority wes       
55829    flumpkin normal  D1T3_600             24:24  576 08:00 R 07:17  11975       None has       
56521    dgs      normal  m0.0a0.0_DTPHYS=0.2   4:24   96 08:00 R 00:23  11988       None [bro|has|san]
56522    jgreat   normal  m0.50a155_itime1_cpu  8:24  192 08:00 R 00:11  12024       None [wes|san|has|bro]
56519    ahyatt   normal  CST_TT_RE10_28DEG    40:11  473 08:00 Q 00:00  12043   Priority wes       
56509    ahyatt   normal  CST_CUBRC_24DEG      40:11  473 04:00 Q 00:00  12051  Resources wes       
56506    ahyatt   normal  CST_CUBRC_28DEG      40:12  480 08:00 R 02:53  12060       None wes       
56508    ahyatt   normal  CST_CUBRC_32DEG      40:12  480 04:00 R 02:35  12060       None wes       
56512    aschwing normal  PEG5                  3:12   36 02:00 R 01:22  12635       None [wes]     
56502    nchampag normal  handrail_0640_node_  17:24  408 08:00 R 01:55  16614       None bro&mem10 

Stats:
 total (406/6888) || ppn=12 (190/2280) | ppn=16 ( 72/1152) | ppn=24 (144/3456) 
S Node   CPU  Job || S Node   CPU  Job | S Node   CPU  Job | S Node   CPU  Job 
- ---- ----- ---- || - ---- ----- ---- | - ---- ----- ---- | - ---- ----- ---- 
Q  270  4144   11 || Q  130  1216    5 | Q    0     0    0 | Q  140  2928    6 
R  363  6164   18 || R  173  2172   10 | R   71  1136    2 | R  119  2856    6 
- ---- ----- ---- || - ---- ----- ---- | - ---- ----- ---- | - ---- ----- ---- 
Q  66%   60%      || Q  68%   53%      | Q   0%    0%      | Q  97%   84%      
R  89%   89%      || R  91%   95%      | R  98%   98%      | R  82%   82%      
- ---- ----- ---- || - ---- ----- ---- | - ---- ----- ---- | - ---- ----- ---- 
F   35   628      || F   17   204      | F    1    16      | F   17   408

Comment 31 Danny Auble 2017-05-25 10:14:04 MDT

Created attachment 4647 [details]
L1 cluster slurm.conf for a single node

Excellent, this is what I saw as well.

Attached you will find the slurm.conf that I used to emulate your system on a single node (snowflake).

You can use it to do the same for your testing purposes.  You will obviously need to change directory locations and snowflake to point to the node you are running on, but for the most part it should be ready to go.

I was able to recreate your scenario on my one node using this slurm.conf and a script loading the system up with your output from your scontrol show jobs.  After that it ended up being fairly straightforward to find when things started working correctly.

Hopefully you will find this helpful.

Comment 32 Danny Auble 2017-05-25 10:16:03 MDT

Created attachment 4648 [details]
script to start multiple slurmds for the L1 cluster

This script will easily start all the nodes you have in your system when emulating it on a single node.

Hopefully this will make testing easier as well.

Don't forget to up your ulimits on openfiles and such for things to work correctly.

My current settings are

ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 63625
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 63536
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 63625
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Comment 33 Danny Auble 2017-05-25 10:17:13 MDT

Would you be ok to close this bug?  We can always reopen it if you find things still aren't working after some time.

Comment 34 NASA JSC Aerolab 2017-05-25 11:24:19 MDT

Thanks a lot for the inputs needed to emulate L1 - very useful.  

Sure, go ahead and close the bug.

Comment 35 Danny Auble 2017-05-25 11:28:04 MDT

Np, please reopen if you need anything else in this. 

If you need help with emulation please open a new bug.

Comment 36 NASA JSC Aerolab 2017-05-31 09:47:20 MDT

Is there any reason not to change back to select/linear at our next down time?  As indicated in bug 3844, it sounds like select/cons_res is more intensive (has more overhead) than select/linear.  Do you expect the --constraint=[wes|san|has|bro] behavior to be the same with select/linear?

Comment 37 Danny Auble 2017-05-31 10:03:54 MDT

I always recommend select/cons_res.  The overhead isn't much over select/linear.  The cons_res plugin give you much more potential power for very little overhead. One major benefit is if you ever do decide to change to running multiple jobs per node (even on a single partition) then it is easy to turn on instead of waiting to drain the cluster.

I would recommend you stay on select/cons_res.  There is very little, if any, benefit to switching back to select/linear.