Ticket 9153 - Jobs infinitely pending with Resources reason
Summary: Jobs infinitely pending with Resources reason
Status: RESOLVED DUPLICATE of ticket 9724
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 19.05.5
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Marshall Garey
QA Contact:
URL:
: 9889 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2020-06-03 05:00 MDT by Marc Caubet Serrabou
Modified: 2020-11-13 12:41 MST (History)
1 user (show)

See Also:
Site: Paul Scherrer
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurmctld.log file for a specific job (3.13 KB, text/plain)
2020-06-03 05:00 MDT, Marc Caubet Serrabou
Details
slurm.conf file (6.63 KB, text/plain)
2020-06-03 05:02 MDT, Marc Caubet Serrabou
Details
scontrol_show_job output (1.44 KB, text/plain)
2020-06-04 01:36 MDT, Marc Caubet Serrabou
Details
work in progress patch (1.50 KB, patch)
2020-11-10 15:33 MST, Marshall Garey
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Marc Caubet Serrabou 2020-06-03 05:00:47 MDT
Created attachment 14493 [details]
slurmctld.log file for a specific job

The following Hybrid jobs:

[root@merlin-slurmctld01 ~]# cat slurm-134882855.sh
#!/bin/bash -l                                                               
#SBATCH --clusters=merlin6
#SBATCH --job-name=V1000-full_core_3D_fine.inp.id     
#SBATCH --ntasks=2
#SBATCH --mem-per-cpu=22000
#SBATCH --cpus-per-task=8
#SBATCH --partition=daily
#SBATCH --time=23:59:00
#SBATCH --output=srun_%j.out  
#SBATCH --error=srun_%j.err  
#SBATCH --hint=nomultithread
module purge
module use unstable
module load gcc/7.5.0 openmpi/3.1.6_slurm intel/18.4
# Code Execution
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun ...
{code}

Can never run, and Slurm does not complain about it. Attached slurm.conf with node/partition config and slurmctld log file for a specific job of this nature.

Nodes have DefMemPerNode=352000, DefMemPerCPU=4000 and hyperthreading enabled (2 sockets with 22 cores each, so 44 cores in total, hence 88 CPUs). No exclusive node usage by default.

The above script is for a Hybrid Job (OpenMP/OpenMPI) running 2 tasks, using 8 cores per task (nomultithread), requiring 176000MB per task. Slurm tries to fit it into a single node, but it will stay forever in PD state (Reason:Resources). When specifying 2 nodes (--ntasks-per-node=2) then it works.

I would like to understand why it can not fit into a single node (I guess this is because --mem-per-cpu also affects to the 'disabled' CPU due to the 'nomultithread' option). If this is expected, shouldn't Slurm give an error or try to spread the job into 2 different nodes?

Thanks a lot for your help,
Marc
Comment 1 Marc Caubet Serrabou 2020-06-03 05:02:48 MDT
Created attachment 14494 [details]
slurm.conf file
Comment 2 Marshall Garey 2020-06-03 10:48:05 MDT
Can you upload the output of

scontrol -d show job <jobid>

for that job?
Comment 3 Marc Caubet Serrabou 2020-06-04 01:36:06 MDT
Created attachment 14507 [details]
scontrol_show_job output

I attach the output for another example of an identical job, which stays infinitely queued (PD with reason (Resources) )
Comment 10 Marshall Garey 2020-06-15 14:43:27 MDT
I reproduced this with your config last week, but today finally reproduced it with my config. It has something to do with requesting --hint=nomultithread and requesting CPUs less than or equal to the number of cores on the node. If I request a number of CPUs greater than the number of cores on the node but still less than the total number of CPUs (including hardware threads) on the node, Slurm spans the job to two nodes and runs the job. Or, if I don't request --hing=nomultithread, Slurm can run the job on one node.


I'm looking at logs with extra debugging and debug flags turned on so I can see exactly what's going on.
Comment 11 Marc Caubet Serrabou 2020-06-16 01:17:03 MDT
Hi,

thanks for taking care of it. Just one hint, see the example below (same example as before, but only changing the --mem-per-cpu setting):

(base) [caubet_m@merlin-l-001 get_cpu]$ sinfo -p test
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
test         up    1:00:00      5   idle merlin-c-[023-024,123-124,223]

(base) [caubet_m@merlin-l-001 get_cpu]$ sbatch bug9153_11GB.batch
Submitted batch job 134943172 on cluster merlin6
(base) [caubet_m@merlin-l-001 get_cpu]$ sbatch bug9153_12GB.batch
Submitted batch job 134943173 on cluster merlin6
(base) [caubet_m@merlin-l-001 get_cpu]$ sbatch bug9153_16GB.batch
Submitted batch job 134943174 on cluster merlin6
(base) [caubet_m@merlin-l-001 get_cpu]$ sbatch bug9153_20GB.batch
Submitted batch job 134943175 on cluster merlin6

(base) [caubet_m@merlin-l-001 get_cpu]$ squeue -u caubet_m -a
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
         134943173      test  bug9153 caubet_m PD       0:00      1 (Resources)
         134943174      test  bug9153 caubet_m PD       0:00      1 (Resources)
         134943175      test  bug9153 caubet_m PD       0:00      1 (Resources)

(base) [caubet_m@merlin-l-001 get_cpu]$ sinfo -p test
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
test         up    1:00:00      5   idle merlin-c-[023-024,123-124,223]

Only the first run works, the other ones get stuck in the queue waiting for "Resources". I observe this problem only when more than half of the DefMemPerNode is defined. Observe that 
  --mem-per-cpu=11000MB  * --cpus-per-task=8 = 176000MB
which is half of the DefMemPerNode. Any values below that will work, and as soon as one setup a higher value, then job can not be allocated. We disable hyperthreading for these pending jobs (when using hyperthreading no problem at all), for some reason is not capable to manage a job requesting the whole memory and only can use the half of it, which is not correct (as far as I understand one should be able to run 1 core without multi-threading by using the whole memory of the node).

Cheers,
Marc
Comment 12 Marc Caubet Serrabou 2020-07-06 10:08:02 MDT
Hi,

any updates about this? Today I have upgraded to 15.05.7 and we still see the same problem.

Cheers,
Marc
Comment 13 Marshall Garey 2020-07-06 10:28:28 MDT
I haven't been able to fix this yet, but any fix won't go in 19.05 regardless. For a few months now only major fixes have gone in 19.05, and all the bug fixes are going into 20.02.
Comment 14 Marc Caubet Serrabou 2020-07-06 10:44:58 MDT
Hi,

thanks for your answer. Then, is the issue located? 

About applying a fix only for 20.02 for us is ok: I understand that back-porting some code is sometimes a pain and something undesirable (otherwise is difficult to move forward and to maintain such amount of code). In any case, I am planning to upgrade Slurm within the next 2 months.

Thanks a lot for your help and best regards,
Marc
Comment 16 Marshall Garey 2020-08-13 18:01:07 MDT
Just letting you know I'm still working on it and haven't quite identified the underlying issue, though I've made some progress.
Comment 19 Marshall Garey 2020-08-24 18:39:26 MDT
I've determined what is happening and why, but haven't gotten a fix yet. The part of the code is indeed in the select plugin as I thought, and those plugins are some of the most complicated parts of Slurm, so a fix might not even go into 20.02. We'll have to see what needs to be done and will determine at that time what versions of Slurm to patch.

To answer your original question about why the job wouldn't run:

--hint=nomultithread means that each CPU requested by the job will be allocated an entire core. This means that the number of CPUs allocated to the job will be double the number of CPUs requested. In the case of your example job:

Without --hint=nomultithread:

#SBATCH --ntasks=2
#SBATCH --mem-per-cpu=22000
#SBATCH --cpus-per-task=8

2 tasks * 22000 MB per CPU * 8 CPUs per task = 352000 MB total

But with --hint=nomultithread, the number of CPUs allocated to the job would be doubled (because of 2 threads per core). So the number of memory that would be allocated is doubled to 704000 MB. The node doesn't have that much memory, so the job isn't scheduled.

So why doesn't Slurm reject this job? It's because when the select plugin determines if a job can run on one node or not, it doesn't properly consider --hint=nomultithread, therefore Slurm thinks the job can run on one node even though it can't. Slurm allocates the correct number of CPUs (correctly handling --hint=nomultithread). In this job's case it is 16 CPUs. Then later Slurm checks if the node has enough memory for the job, and there it checks for --mem-per-cpu and multiples by the number of CPUs (16), and it's too much memory, so Slurm doesn't schedule the job. But it doesn't reject the job, either.

So this is a bug with the interaction between --hint=nomultithread and --mem-per-cpu. It has nothing to do with DefMemPerNode, MemSpecLimit, or anything else. You see that if you request 2 nodes, then Slurm can fit the job on two nodes and schedules the job. You could also replace --mem-per-cpu with --mem and the job will run.


I did find a way to make Slurm properly consider --hint=nomultithread when checking if a job could fit on one node, but if I do that in that particular place Slurm rejects the job. I believe the correct behavior is for Slurm to realize the job can't fit on one node and to span it to two nodes (or however many nodes you need). I haven't figured out a way to make that happen yet.

Do you any questions about what I've explained?
Comment 20 Marc Caubet Serrabou 2020-08-25 00:35:49 MDT
Hi,

thanks a lot for your detailed answer and for taking care of it, the problem looks not trivial. I was suspecting that for nomultithread CPU numbers were doubled, thanks for confirming it. Then, as you explain memory is also doubled and does not fit into a node. It makes sense, and now is clear to me the exact reason for that.

About a possible solution,  I would agree that the ideal solution is to detect the memory which is needed for the job, and spread the job to multiple nodes. However, returning an error that details a bit the problem would be at least necessary and probably sufficient. Alternatively, a clear message while the job is hanging in the queue would be also useful. It is important to avoid jobs hanging forever in the queue without a clear reason, otherwise it would be confusing and difficult to detect. Therefore, any alternative to the current situation is welcome.

Finally, as said, a fix for version v20 is ok for us. We still run v19, but updating would not be a problem as this is scheduled for this year.

Thanks a lot for your help,
Marc
Comment 22 Marshall Garey 2020-09-29 08:29:18 MDT
*** Ticket 9889 has been marked as a duplicate of this ticket. ***
Comment 24 Marshall Garey 2020-11-09 15:49:15 MST
Hi Marc,

I found that bug 9724 is a duplicate of this one (--mem-per-cpu causing problems with threads_per_core greater than 1). I tested a patch for that bug and it fixes the problem. It's still in the middle of our QA process, so it has a chance to change.

Would you like to test this patch? If so, I can upload it here. Otherwise, I'll mark this bug as a duplicate of bug 9724.
Comment 25 Marc Caubet Serrabou 2020-11-10 00:42:26 MST
Hi Marshall,


thanks for the update, that's a good new! Yes please, I am interested in to test it. In fact I have to compile Slurm v20 for my test instance so I would apply it there.

From my understanding, Slurm 20.11 will come with improvements regarding to this, is it correct? Will this patch be included there? I want to upgrade our main instance to Slurm v20 so I would wait until then if these problems are fixed in this version.


Thanks a lot,

Marc

_________________________________________________________
Paul Scherrer Institut
High Performance Computing & Emerging Technologies
Marc Caubet Serrabou
Building/Room: OHSA/014
Forschungsstrasse, 111
5232 Villigen PSI
Switzerland

Telephone: +41 56 310 46 67
E-Mail: marc.caubet@psi.ch
________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, November 9, 2020 11:49:15 PM
To: Caubet Serrabou Marc (PSI)
Subject: [Bug 9153] Jobs infinitely pending with Resources reason


Comment # 24<https://bugs.schedmd.com/show_bug.cgi?id=9153#c24> on bug 9153<https://bugs.schedmd.com/show_bug.cgi?id=9153> from Marshall Garey<mailto:marshall@schedmd.com>

Hi Marc,

I found that bug 9724<show_bug.cgi?id=9724> is a duplicate of this one (--mem-per-cpu causing
problems with threads_per_core greater than 1). I tested a patch for that bug
and it fixes the problem. It's still in the middle of our QA process, so it has
a chance to change.

Would you like to test this patch? If so, I can upload it here. Otherwise, I'll
mark this bug as a duplicate of bug 9724<show_bug.cgi?id=9724>.

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 26 Marshall Garey 2020-11-10 15:33:30 MST
Created attachment 16591 [details]
work in progress patch

(In reply to Marc Caubet Serrabou from comment #25)
> Hi Marshall,
> 
> 
> thanks for the update, that's a good new! Yes please, I am interested in to
> test it. In fact I have to compile Slurm v20 for my test instance so I would
> apply it there.

I've uploaded it here. Keep in mind this is a work in progress. From my testing, it fixes the problem you reported in this bug, at least for select/cons_res. There's been some internal discussion about this not working quite right all the time for machines with more than 2 hyperthreads per core (such as machines with 4 hyperthreads per core), but you shouldn't have to worry about that. The final version may look a bit different or have additional patches on top of this one.

Although the patch file says "2011" in its name, this patch applies cleanly to Slurm 20.02. Let me know if you have a problem compiling or testing with this patch.


> From my understanding, Slurm 20.11 will come with improvements regarding to
> this, is it correct? Will this patch be included there? I want to upgrade
> our main instance to Slurm v20 so I would wait until then if these problems
> are fixed in this version.

The fixes are likely to be in 20.11, and at least a partial fix is possible in 20.02, but I'm not sure. 20.11.0 will be released this month. But these fixes haven't been checked in yet, and I can't say whether or not they will be before 20.11.0 is released.


Tangent about upgrading:

Keep in mind that Slurm 20.02 and 20.11 are completely different major versions. (We do our versioning similar to Ubuntu - Year.Month of release.) The third number indicates the micro release - these micro releases just have bug fixes. Different major versions usually have big feature changes. Definitely look at the NEWS and RELEASE_NOTES files before upgrading.

You could upgrade to latest stable release (20.02.6) or the newest release (20.11.0). This may be obvious but I'm going to say it anyway:

20.02.6 has been tested a lot more because it's been out for awhile. 20.11.0 will have new features and some bug fixes that we thought were too risky to put in 20.02, but since it's new it's not as heavily tested (because it's new). In other words, 20.11 is more likely to have unknown bugs, but it will contain more bug fixes. 20.02.6 has bugs, but we know more about them. We're less likely to make big changes to 20.02 so it may not have certain bug fixes but the bug fixes are less likely to be breaking.
Comment 27 Marshall Garey 2020-11-13 12:41:51 MST
The patches have been committed ahead of the 20.11 release. I'm marking this as a duplicate of bug 9724. Let me know if you have any further issues.

See commits 49a7d7f9fb and 62546cb0b1de

*** This ticket has been marked as a duplicate of ticket 9724 ***