Created attachment 14493 [details] slurmctld.log file for a specific job The following Hybrid jobs: [root@merlin-slurmctld01 ~]# cat slurm-134882855.sh #!/bin/bash -l #SBATCH --clusters=merlin6 #SBATCH --job-name=V1000-full_core_3D_fine.inp.id #SBATCH --ntasks=2 #SBATCH --mem-per-cpu=22000 #SBATCH --cpus-per-task=8 #SBATCH --partition=daily #SBATCH --time=23:59:00 #SBATCH --output=srun_%j.out #SBATCH --error=srun_%j.err #SBATCH --hint=nomultithread module purge module use unstable module load gcc/7.5.0 openmpi/3.1.6_slurm intel/18.4 # Code Execution export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK srun ... {code} Can never run, and Slurm does not complain about it. Attached slurm.conf with node/partition config and slurmctld log file for a specific job of this nature. Nodes have DefMemPerNode=352000, DefMemPerCPU=4000 and hyperthreading enabled (2 sockets with 22 cores each, so 44 cores in total, hence 88 CPUs). No exclusive node usage by default. The above script is for a Hybrid Job (OpenMP/OpenMPI) running 2 tasks, using 8 cores per task (nomultithread), requiring 176000MB per task. Slurm tries to fit it into a single node, but it will stay forever in PD state (Reason:Resources). When specifying 2 nodes (--ntasks-per-node=2) then it works. I would like to understand why it can not fit into a single node (I guess this is because --mem-per-cpu also affects to the 'disabled' CPU due to the 'nomultithread' option). If this is expected, shouldn't Slurm give an error or try to spread the job into 2 different nodes? Thanks a lot for your help, Marc
Created attachment 14494 [details] slurm.conf file
Can you upload the output of scontrol -d show job <jobid> for that job?
Created attachment 14507 [details] scontrol_show_job output I attach the output for another example of an identical job, which stays infinitely queued (PD with reason (Resources) )
I reproduced this with your config last week, but today finally reproduced it with my config. It has something to do with requesting --hint=nomultithread and requesting CPUs less than or equal to the number of cores on the node. If I request a number of CPUs greater than the number of cores on the node but still less than the total number of CPUs (including hardware threads) on the node, Slurm spans the job to two nodes and runs the job. Or, if I don't request --hing=nomultithread, Slurm can run the job on one node. I'm looking at logs with extra debugging and debug flags turned on so I can see exactly what's going on.
Hi, thanks for taking care of it. Just one hint, see the example below (same example as before, but only changing the --mem-per-cpu setting): (base) [caubet_m@merlin-l-001 get_cpu]$ sinfo -p test PARTITION AVAIL TIMELIMIT NODES STATE NODELIST test up 1:00:00 5 idle merlin-c-[023-024,123-124,223] (base) [caubet_m@merlin-l-001 get_cpu]$ sbatch bug9153_11GB.batch Submitted batch job 134943172 on cluster merlin6 (base) [caubet_m@merlin-l-001 get_cpu]$ sbatch bug9153_12GB.batch Submitted batch job 134943173 on cluster merlin6 (base) [caubet_m@merlin-l-001 get_cpu]$ sbatch bug9153_16GB.batch Submitted batch job 134943174 on cluster merlin6 (base) [caubet_m@merlin-l-001 get_cpu]$ sbatch bug9153_20GB.batch Submitted batch job 134943175 on cluster merlin6 (base) [caubet_m@merlin-l-001 get_cpu]$ squeue -u caubet_m -a JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 134943173 test bug9153 caubet_m PD 0:00 1 (Resources) 134943174 test bug9153 caubet_m PD 0:00 1 (Resources) 134943175 test bug9153 caubet_m PD 0:00 1 (Resources) (base) [caubet_m@merlin-l-001 get_cpu]$ sinfo -p test PARTITION AVAIL TIMELIMIT NODES STATE NODELIST test up 1:00:00 5 idle merlin-c-[023-024,123-124,223] Only the first run works, the other ones get stuck in the queue waiting for "Resources". I observe this problem only when more than half of the DefMemPerNode is defined. Observe that --mem-per-cpu=11000MB * --cpus-per-task=8 = 176000MB which is half of the DefMemPerNode. Any values below that will work, and as soon as one setup a higher value, then job can not be allocated. We disable hyperthreading for these pending jobs (when using hyperthreading no problem at all), for some reason is not capable to manage a job requesting the whole memory and only can use the half of it, which is not correct (as far as I understand one should be able to run 1 core without multi-threading by using the whole memory of the node). Cheers, Marc
Hi, any updates about this? Today I have upgraded to 15.05.7 and we still see the same problem. Cheers, Marc
I haven't been able to fix this yet, but any fix won't go in 19.05 regardless. For a few months now only major fixes have gone in 19.05, and all the bug fixes are going into 20.02.
Hi, thanks for your answer. Then, is the issue located? About applying a fix only for 20.02 for us is ok: I understand that back-porting some code is sometimes a pain and something undesirable (otherwise is difficult to move forward and to maintain such amount of code). In any case, I am planning to upgrade Slurm within the next 2 months. Thanks a lot for your help and best regards, Marc
Just letting you know I'm still working on it and haven't quite identified the underlying issue, though I've made some progress.
I've determined what is happening and why, but haven't gotten a fix yet. The part of the code is indeed in the select plugin as I thought, and those plugins are some of the most complicated parts of Slurm, so a fix might not even go into 20.02. We'll have to see what needs to be done and will determine at that time what versions of Slurm to patch. To answer your original question about why the job wouldn't run: --hint=nomultithread means that each CPU requested by the job will be allocated an entire core. This means that the number of CPUs allocated to the job will be double the number of CPUs requested. In the case of your example job: Without --hint=nomultithread: #SBATCH --ntasks=2 #SBATCH --mem-per-cpu=22000 #SBATCH --cpus-per-task=8 2 tasks * 22000 MB per CPU * 8 CPUs per task = 352000 MB total But with --hint=nomultithread, the number of CPUs allocated to the job would be doubled (because of 2 threads per core). So the number of memory that would be allocated is doubled to 704000 MB. The node doesn't have that much memory, so the job isn't scheduled. So why doesn't Slurm reject this job? It's because when the select plugin determines if a job can run on one node or not, it doesn't properly consider --hint=nomultithread, therefore Slurm thinks the job can run on one node even though it can't. Slurm allocates the correct number of CPUs (correctly handling --hint=nomultithread). In this job's case it is 16 CPUs. Then later Slurm checks if the node has enough memory for the job, and there it checks for --mem-per-cpu and multiples by the number of CPUs (16), and it's too much memory, so Slurm doesn't schedule the job. But it doesn't reject the job, either. So this is a bug with the interaction between --hint=nomultithread and --mem-per-cpu. It has nothing to do with DefMemPerNode, MemSpecLimit, or anything else. You see that if you request 2 nodes, then Slurm can fit the job on two nodes and schedules the job. You could also replace --mem-per-cpu with --mem and the job will run. I did find a way to make Slurm properly consider --hint=nomultithread when checking if a job could fit on one node, but if I do that in that particular place Slurm rejects the job. I believe the correct behavior is for Slurm to realize the job can't fit on one node and to span it to two nodes (or however many nodes you need). I haven't figured out a way to make that happen yet. Do you any questions about what I've explained?
Hi, thanks a lot for your detailed answer and for taking care of it, the problem looks not trivial. I was suspecting that for nomultithread CPU numbers were doubled, thanks for confirming it. Then, as you explain memory is also doubled and does not fit into a node. It makes sense, and now is clear to me the exact reason for that. About a possible solution, I would agree that the ideal solution is to detect the memory which is needed for the job, and spread the job to multiple nodes. However, returning an error that details a bit the problem would be at least necessary and probably sufficient. Alternatively, a clear message while the job is hanging in the queue would be also useful. It is important to avoid jobs hanging forever in the queue without a clear reason, otherwise it would be confusing and difficult to detect. Therefore, any alternative to the current situation is welcome. Finally, as said, a fix for version v20 is ok for us. We still run v19, but updating would not be a problem as this is scheduled for this year. Thanks a lot for your help, Marc
*** Ticket 9889 has been marked as a duplicate of this ticket. ***
Hi Marc, I found that bug 9724 is a duplicate of this one (--mem-per-cpu causing problems with threads_per_core greater than 1). I tested a patch for that bug and it fixes the problem. It's still in the middle of our QA process, so it has a chance to change. Would you like to test this patch? If so, I can upload it here. Otherwise, I'll mark this bug as a duplicate of bug 9724.
Hi Marshall, thanks for the update, that's a good new! Yes please, I am interested in to test it. In fact I have to compile Slurm v20 for my test instance so I would apply it there. From my understanding, Slurm 20.11 will come with improvements regarding to this, is it correct? Will this patch be included there? I want to upgrade our main instance to Slurm v20 so I would wait until then if these problems are fixed in this version. Thanks a lot, Marc _________________________________________________________ Paul Scherrer Institut High Performance Computing & Emerging Technologies Marc Caubet Serrabou Building/Room: OHSA/014 Forschungsstrasse, 111 5232 Villigen PSI Switzerland Telephone: +41 56 310 46 67 E-Mail: marc.caubet@psi.ch ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: Monday, November 9, 2020 11:49:15 PM To: Caubet Serrabou Marc (PSI) Subject: [Bug 9153] Jobs infinitely pending with Resources reason Comment # 24<https://bugs.schedmd.com/show_bug.cgi?id=9153#c24> on bug 9153<https://bugs.schedmd.com/show_bug.cgi?id=9153> from Marshall Garey<mailto:marshall@schedmd.com> Hi Marc, I found that bug 9724<show_bug.cgi?id=9724> is a duplicate of this one (--mem-per-cpu causing problems with threads_per_core greater than 1). I tested a patch for that bug and it fixes the problem. It's still in the middle of our QA process, so it has a chance to change. Would you like to test this patch? If so, I can upload it here. Otherwise, I'll mark this bug as a duplicate of bug 9724<show_bug.cgi?id=9724>. ________________________________ You are receiving this mail because: * You reported the bug.
Created attachment 16591 [details] work in progress patch (In reply to Marc Caubet Serrabou from comment #25) > Hi Marshall, > > > thanks for the update, that's a good new! Yes please, I am interested in to > test it. In fact I have to compile Slurm v20 for my test instance so I would > apply it there. I've uploaded it here. Keep in mind this is a work in progress. From my testing, it fixes the problem you reported in this bug, at least for select/cons_res. There's been some internal discussion about this not working quite right all the time for machines with more than 2 hyperthreads per core (such as machines with 4 hyperthreads per core), but you shouldn't have to worry about that. The final version may look a bit different or have additional patches on top of this one. Although the patch file says "2011" in its name, this patch applies cleanly to Slurm 20.02. Let me know if you have a problem compiling or testing with this patch. > From my understanding, Slurm 20.11 will come with improvements regarding to > this, is it correct? Will this patch be included there? I want to upgrade > our main instance to Slurm v20 so I would wait until then if these problems > are fixed in this version. The fixes are likely to be in 20.11, and at least a partial fix is possible in 20.02, but I'm not sure. 20.11.0 will be released this month. But these fixes haven't been checked in yet, and I can't say whether or not they will be before 20.11.0 is released. Tangent about upgrading: Keep in mind that Slurm 20.02 and 20.11 are completely different major versions. (We do our versioning similar to Ubuntu - Year.Month of release.) The third number indicates the micro release - these micro releases just have bug fixes. Different major versions usually have big feature changes. Definitely look at the NEWS and RELEASE_NOTES files before upgrading. You could upgrade to latest stable release (20.02.6) or the newest release (20.11.0). This may be obvious but I'm going to say it anyway: 20.02.6 has been tested a lot more because it's been out for awhile. 20.11.0 will have new features and some bug fixes that we thought were too risky to put in 20.02, but since it's new it's not as heavily tested (because it's new). In other words, 20.11 is more likely to have unknown bugs, but it will contain more bug fixes. 20.02.6 has bugs, but we know more about them. We're less likely to make big changes to 20.02 so it may not have certain bug fixes but the bug fixes are less likely to be breaking.
The patches have been committed ahead of the 20.11 release. I'm marking this as a duplicate of bug 9724. Let me know if you have any further issues. See commits 49a7d7f9fb and 62546cb0b1de *** This ticket has been marked as a duplicate of ticket 9724 ***