Ticket 8902

Summary:	preemption applied inconsistently
Product:	Slurm	Reporter:	Emanuele Breuza <emanuele.breuza>
Component:	Scheduling	Assignee:	Ben Roberts <ben>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	albert.gil, bart, daniele.gregori
Version:	19.05.5
Hardware:	Linux
OS:	Linux
Site:	E4	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	20.02.4
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	scontrol show job reporting missed preemption Scheduler log Slurmctld log slurm.conf Slurmctld log wnode43 wnode44 config_preMod config_postMod job_low_5358 job_high_5359 job_high_5360 job_low_5361 job_low_5361_F job_high_5362 job_high_5363 cgroup_conf Slurmctld log during failing tests

Description Emanuele Breuza 2020-04-20 09:15:25 MDT

Hello support,
my customer reported that preemption is applied inconsistently in his cluster. The command executed to submit the low priority job is

sbatch --partition=low --mem=360G --nodes=2 --ntasks-per-node=64 job_script -v

The command for the high priority job is

sbatch --partition=high --mem=360G --nodes=2 --ntasks-per-node=64 job_script -v

The problem seems to be related to resumed job, they seems to be immune to preemption.

Do you have any suggestions about how to investigate this issue?

Thank you very much for your help,
Emanuele

Comment 1 Ben Roberts 2020-04-20 09:30:29 MDT

Hi Emanuele,

Can you clarify what you mean when you say the "resumed job"?  Is that the job in the low partition?  Or are these jobs being preempted once correctly and then not preempting a second time?  Can you reproduce this and send the 'scontrol show job <jobid>' output for a job that should be preempted as well as the job that should be able to preempt it?  I would also like to see the slurmctld logs from the time that the example jobs you have are in the system.  One last thing I would like to see is the output of 'scontrol show config'.

Thanks,
Ben

Comment 2 Emanuele Breuza 2020-04-20 10:33:13 MDT

Created attachment 13878 [details]
scontrol show job reporting missed preemption

Comment 3 Emanuele Breuza 2020-04-20 10:34:40 MDT

Created attachment 13879 [details]
Scheduler log

Comment 4 Emanuele Breuza 2020-04-20 10:36:14 MDT

Created attachment 13880 [details]
Slurmctld log

Comment 5 Emanuele Breuza 2020-04-20 10:37:07 MDT

Created attachment 13881 [details]
slurm.conf

Comment 6 Emanuele Breuza 2020-04-20 10:49:32 MDT

Hello Ben,
I attached a few files. Hopefully, they will help in the investigation. Unfortunately, I do not have direct access to the cluster. So, I have to wait until next meeting to collect better files.

- Can you clarify what you mean when you say the "resumed job"?  Is that the job in the low partition?  Or are these jobs being preempted once correctly and then not preempting a second time? 

Yes, this is exactly what I observed (at least once).

- Can you reproduce this and send the 'scontrol show job <jobid>' output for a job that should be preempted as well as the job that should be able to preempt it?

See file .png related to missed preemption: job 5113 (lower partition, started at 14:12:44) and job 5138 (higher partition, submitted at 14:30:12 and started at 14:39:17).

- I would also like to see the slurmctld logs from the time that the example jobs you have are in the system.

I attached the file you requested. I will generate a more verbose one during the next meeting with the customer.

- One last thing I would like to see is the output of 'scontrol show config'.

I forwarded you request to my customer. In the mean time, I attached the slurm.conf.


Thank you very much,
Emanuele

Comment 7 Ben Roberts 2020-04-20 12:25:44 MDT

Hi Emanuele,

Thanks for sending the information you have.  I looked over the logs and unfortunately it doesn't show much at the current log level.  When you are able to get access to the cluster again and reproduce it, it would be good to have some higher level logs.  Can you raise the log level long enough to submit a job that should be able to preempt other jobs and then set it back down?  If so here are the commands you would use to do that (assuming your current log level is 'info'):
scontrol setdebug debug2
scontrol setdebug info

Do you see preemption working in other cases?  If so please show an example of a working preemption as well so I can look at the differences.  How soon do you think you would be able to collect this information?

Thanks,
Ben

Comment 8 Emanuele Breuza 2020-04-20 12:43:25 MDT

Hello Ben,
I already asked my customer to

Comment 9 Emanuele Breuza 2020-04-20 12:45:18 MDT

Hello Ben,
I already sent a request to my customer to have a meeting tomorrow. Unfortunately, I didn't get any reply yet. I will update you as soon as possible.

See you soon,
Emanuele

Comment 10 Emanuele Breuza 2020-04-21 06:54:14 MDT

Created attachment 13891 [details]
Slurmctld log

with debug3

Comment 11 Emanuele Breuza 2020-04-21 06:57:20 MDT

Created attachment 13892 [details]
wnode43

scontrol show node wnode43

Comment 12 Emanuele Breuza 2020-04-21 06:57:57 MDT

Created attachment 13893 [details]
wnode44

scontrolshow node wnode44

Comment 13 Emanuele Breuza 2020-04-21 07:07:04 MDT

Created attachment 13895 [details]
config_preMod

scontrol show config
partition with gracetime

Comment 14 Emanuele Breuza 2020-04-21 07:09:32 MDT

Created attachment 13896 [details]
config_postMod

scontrol show config
partition without gracetime

Comment 15 Emanuele Breuza 2020-04-21 07:16:05 MDT

Created attachment 13898 [details]
job_low_5358

scontrol show job 5358
job running in partition with lower priority

Comment 16 Emanuele Breuza 2020-04-21 07:18:54 MDT

Created attachment 13899 [details]
job_high_5359

scontrol show job 5359
job running in partition with high priority
preemption succeed

Comment 17 Emanuele Breuza 2020-04-21 07:22:32 MDT

Created attachment 13900 [details]
job_high_5360

scontrol show job 5360
job pending in partition with high priority
preemption failed

Comment 18 Emanuele Breuza 2020-04-21 07:28:30 MDT

Created attachment 13901 [details]
job_low_5361

scontrol show job 5361
job running in partition with low priority
preemption success

Comment 19 Emanuele Breuza 2020-04-21 07:30:04 MDT

Created attachment 13902 [details]
job_low_5361_F

scontrol show job 5361
job running in partition with low priority
preemption failed

Comment 20 Emanuele Breuza 2020-04-21 07:31:25 MDT

Created attachment 13903 [details]
job_high_5362

scontrol show job 5362
job running in partition with high priority
preemption succeed

Comment 21 Emanuele Breuza 2020-04-21 07:32:33 MDT

Created attachment 13904 [details]
job_high_5363

scontrol show job 5363
job pending in partition with high priority
preemption failed

Comment 22 Emanuele Breuza 2020-04-21 08:09:36 MDT

Hello Ben,
I finally collected few more log files. I update Slurmctld log applying an higher verbosity (debug3). The time of testing was between 14:00 and 14:30.

Following, the two test I performed today.

Steps of test 1

- collected config_preMod is the output of "scontrol show config".
- submit job_low_5358 --> job suspended in partition with lower priority
- submit job_high_5359 --> job running in partition with higher priority, it indeed preempted 5358!!!
- cancel 5359, therefore 5358 was resumed
- submit job_high_5360 --> job pending in partition with higher priority, it was unable to preempt 5358!!! I checked again a few minutes later and it was still pending...

Steps of test 2

- collected config_postMod a new output of "scontrol show config" after I deleted GraceTime keyword from partition definition.
- submit job_low_5361 --> job suspended in partition with lower priority
- submit job_high_5362 --> job running in partition with higher priority, it indeed preempted 5361!!!
- submit job_high_5363 --> job pending in partition wiht higher priority, it was unable to preempt 5361!!! During this state, I collected job_low_5361_F after PreemptTime was already passed

In the end, I collected node status using "scontrol show node wnode43" and "scontrol show node wnode44".

Currently, I'm using "SelectType=select/cons_tres". Should I execute some tests with "SelectType=select/cons_res"?

Hopefully, my comment is not as confused as I am...
Thank you for your help,
Emanuele

Comment 23 Ben Roberts 2020-04-21 11:40:45 MDT

Hi Emanuele,

Thank you for collecting this information for me.  I've been going over it this morning and I can see something that looks like a lead.  After being unable to find a set of parameters that would allow a job to be preempted once and not a second time I was looking more closely at differences between the jobs.  I saw that the second high priority job in both cases has a memory requirement that is almost equal to the full amount of memory available on the two nodes.  

To verify that this memory requirement is preventing this job from preempting I would like a few more bits of information from you.  Can I have you send your cgroup.conf file to me?  Can you verify that if you submit a job that doesn't specify memory a second time that it's able to preempt a job as well?  If you lower the memory requirements of the second type of job is it able to preempt a job to start?  If you're using job scripts for these two types of jobs is that script something you're able to share (at least the #SBATCH arguments at the beginning of the script)?  

Thanks,
Ben

Comment 24 Emanuele Breuza 2020-04-21 12:08:17 MDT

Created attachment 13907 [details]
cgroup_conf

to be confirmed

Comment 25 Emanuele Breuza 2020-04-21 12:33:28 MDT

Hello Ben,
unfortunately I cannot perform any tests until tomorrow.

- Can I have you send your cgroup.conf file to me?

I loaded the cgroup.conf that I used during Slurm configuration. Tomorrow, I'll check if the customer modified it.

- Can you verify that if you submit a job that doesn't specify memory a second time that it's able to preempt a job as well?

I will run some tests tomorrow. This being said, I'm using ConstrainRAMSpace=yes in cgroup.conf. Should I turn it off?

-  If you lower the memory requirements of the second type of job is it able to preempt a job to start?

I did performed this test and the second job was unable to preempt. Tomorrow, I'll try once again.

- If you're using job scripts for these two types of jobs is that script something you're able to share (at least the #SBATCH arguments at the beginning of the script)?

I'll ask my customer to extract the info you requested.

Thanks for your help,
Emanuele

Comment 26 Emanuele Breuza 2020-04-21 13:49:00 MDT

Comment on attachment 13907 [details]
cgroup_conf

file content confirmed

Comment 27 Emanuele Breuza 2020-04-21 14:18:06 MDT

A little update with respect to comment #25

> - Can I have you send your cgroup.conf file to me?
> 
> I loaded the cgroup.conf that I used during Slurm configuration. Tomorrow,
> I'll check if the customer modified it.

I got the file from customer and it is identical to what I already uploaded.
 
> - If you're using job scripts for these two types of jobs is that script
> something you're able to share (at least the #SBATCH arguments at the
> beginning of the script)?
> 
> I'll ask my customer to extract the info you requested.
 
As far as I know, the only options passed to sbatch are the following:
--nodes=2
--ntasks-per-node=32
--mem=360G
--partition=tread_n or tread_h

Unfortunately, I did not get the submition script. I will check it tomorrow.

See you soon,
Emanuele

Comment 28 Ben Roberts 2020-04-21 16:13:03 MDT

With those settings I ran some more tests and I was able to reproduce the behavior you're seeing. There was one major difference though, I had to set my SelectTypeParameter to CR_CPU_Memory instead of CR_CPU, as you have it. It's interesting that the resource allocation is different in my testing. I don't know that you would need to disable the 'ConstrainRAMSpace' setting, but if you can try changing 'MaxRAMPercent' to '100' rather than '102.5' that would be worth trying. After changing this parameter you would need to run 'scontrol reconfigure' to pick up the change.

When you are able to run the tests I mentioned before, it should let us know if this is indeed the issue. When you run them, since this is looking like a resource limitation, it would be good if you can also enable the 'SelectType' debug flag for the duration of the test. I would like to have you follow this procedure for the test:
1. Enable debug logging:
scontrol setdebug debug2
scontrol setdebugflags +SelectType
2. Submit a job that can preempt with no memory requirements, verify that it does preempt jobs and runs. Cancel this job.
3. Submit a similar job with memory requirements. I assume this won't run, cancel this job.
4. Submit a job that can preempt with no memory requirements (like the job from step 3). Verify that it can still preempt jobs and run. Cancel this job.
5. Submit a job that can preempt with lower memory requirements. This will be similar to the job from step 3, but request '--mem=180G'. See if this is able to preempt jobs. Cancel this job.
6. Disable debug logging.
scontrol setdebug info
scontrol setdebugflags -SelectType

If possible could you run the test above with the configuration the way it is currently and again with the 'MaxRAMPercent' changed to 100? Let me know how this goes and I'll keep looking into reproducing the behavior you're seeing with the same settings.

Thanks,
Ben

Comment 29 Emanuele Breuza 2020-04-22 05:09:19 MDT

Created attachment 13920 [details]
Slurmctld log during failing tests

Slurmctld log file collected during tests aiming to replicate preemption issue.

Comment 30 Emanuele Breuza 2020-04-22 05:51:30 MDT

Hello Ben,
I executed some tests and uploaded the slurmctld log file (i.e. Slurmctld log during failing tests). The job i used was a simple sleep and I submitted it with user belluda001. the sbatch command was:

sbatch --nodelist=gnode01,gnode02 --ntasks-per-node=36 --mem=720G --partition=nvh_n or nvh_h my_sleep_job_.sh (I did not recall the exact name, it didn't have any #SBATCH).

I was the only user that submitted job on partition nvh_n and nvh_h.

I did all my best to follow the scheme you requested, i.e.
submit job low, submit job high (preemption OK), cancel job high (automatic resume of job low), submit job high again (preemption FAILED), cancel job high, submit job high again with different --mem.

After changes in slurm.conf and cgroup.conf, I did restart slurmctld and slurmd daemon. You should be able to follow my tests sequence following "slurmctld version 19.05.5 started on cluster montestella".

Following the results I obtained:

- Test #1: lower --mem in second high priority job
Time: ~9:48 (I'm sorry, I forgot to execute a date before this test)
Using 1 compute node (gnode01), preemption worked correctly. This means that all high priority jobs were able to suspend the same low priority one. I tested many combinations of --mem value (0, 180G and 720G for both low and high priority) and preemption always worked.
Using 2 nodes (gnode01,gnode02), preemption didn't work for any combinations of --mem (0, 180G and 720G)

- Test #2: change MaxRAMPercent=100 in cgroup.conf
Time: 10:35:25
Using 1 node (gnode01), everything went OK.
Using 2 nodes (gnode01,gnode02), preemption of low priority job was broked from the second high priority job.

- Test #3: comment all memory related keyword in cgroup.conf
Time: 10:57:03
Same results as Tsst #2.

- Test #4: remove of "preempt_youngest_first" from SchedulerParameters in slurm.conf
Time: 10:57:03
Same results as Test #2.

- Test #5: remove of "preempt_reorder_count=3" from SchedulerParamenters in slurm.conf
Time: 11:27:11
Same results as Test #2.

- Test #6: remove of "salloc_wait_nodes,sbatch_wait_nodes" from SchedulerParamenters in slurm.conf
Time: 11:41:24
Same results as test #2.

- Test #7: using "cons_res" in SelectType in slurm.conf (NOTE: I had to remove Gres otherwise slurmctld didn't start)
Time: 11:59:13
Same results as test #2.

At 12:11:37 I finally restored slurm.conf. Let me know if you need more tests.

Thank you for your help,
Emanuele

Comment 32 Ben Roberts 2020-04-22 12:41:12 MDT

Hi Emanuele,

I wanted to send a quick update.  I've been going over the information you sent this morning and I'm working on tracking down what's happening from the logs.  It looks like the first time you submit a job in the nvh_h partition it does a job test to see if and where the job can be placed.  This happens correctly the for the first job, but for the second job it does an initial test and recognizes that it needs two nodes, but then it does a second test and the request is somehow getting changed to 1 node.  I'm trying to track down how that is happening and I still haven't been able to reproduce the behavior.  I'll keep looking into it, but I wanted to let you know what I've been seeing.

Thanks,
Ben

Comment 33 Ben Roberts 2020-04-23 16:07:16 MDT

Hi Emanuele,

Thanks for your patience while I tracked this down.  I was able to narrow the problem down to the 'select' parameter for 'MCSParameters'.  If this is set then any node after the first that was allocated to the preempted job doesn't seem to be freed properly when the job completes.  When a new preemptor job is submitted the scheduler does an initial test evaluation and sees all the nodes in the partition as being available.  Then it does an actual evaluation and any nodes beyond the first that were in the previous job don't get fully cleared and don't show as being available until the job that was preempted once completes.  

To work around this you should be able to change the following line in your slurm.conf:
MCSParameters=enforced,select,privatedata

The other available options to replace the 'select' parameter are 'noselect' or 'ondemandselect'.  To be clear, this is just to work around the problem until we can find a solution.  Please let me know if you have questions about this or if this workaround doesn't work for you.

Thanks,
Ben

Comment 34 Emanuele Breuza 2020-04-24 02:29:17 MDT

Hello Ben,
your workaround did the trick!!!! I set MCSParameters=enforced,ondemandselect,privatedata in slurm.conf and now preemption is working correctly on the cluster of my customer!!!
Thank you very much for your help!!!

Regarding MCS and PrivateData, I'm still a little bit confused about how they work. Could you please send me something to read? Maybe, there are some amazing post in Slurm forum that I missed...

Thank you again and have you a nice day,
Emanuele

Comment 35 Ben Roberts 2020-04-27 09:12:11 MDT

Hi Emanuele,

I'm glad the workaround I provided worked for you too.  We do have a couple places you can read more about the MCS Plugin.  Here is our documentation on it, which you  may have seen already:
https://slurm.schedmd.com/mcs.html

We also have some slides about the feature from the Slurm Users Group in 2016:
https://slurm.schedmd.com/SLUG16/MCS.pdf

I'm still working on how to address the bug.  Since we have a workaround for the issue I'm going to lower the severity of the issue to 3.  I'll keep you updated as I make further progress.

Thanks,
Ben

Comment 36 Emanuele Breuza 2020-04-28 00:25:30 MDT

Hello Ben,
thanks for your update and for sharing those slides on MCS.

Let me know if you need anything from me else to fix this bug.

Have you a nice day,
Emanuele

Comment 42 Ben Roberts 2020-05-19 12:12:28 MDT

Hi Emanuele,

Just to provide an update, I have a potential solution put together that will be reviewed internally before proceeding further.  Let me know if you have any questions while waiting for this process to move forward.

Thanks,
Ben

Comment 43 Emanuele Breuza 2020-05-19 12:21:28 MDT

Hello Ben,
thank you very much for the update. As far as I know, my customer did not start to use MCS features in Slurm yet. I will contact you as soon as possible if the present issue acquires an higher impact on cluster production.

Have you a nice day,
Emanuele

Comment 47 Ben Roberts 2020-07-13 09:19:00 MDT

Hi Emanuele,

I wanted to send you an update that a fix for this issue has been checked in for 20.02.4 and later versions.  You can see the commit here:

https://github.com/SchedMD/slurm/commit/6d0b9f5eb0322c84854ee42e61b19697b3567400

I'll go ahead and close this ticket, but please let me know if you have any questions.

Thanks,
Ben