Summary: | Regression in 20.11.7 for back-to-back srun's | ||
---|---|---|---|
Product: | Slurm | Reporter: | Luke Yeager <lyeager> |
Component: | slurmctld | Assignee: | Marshall Garey <marshall> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 2 - High Impact | ||
Priority: | --- | CC: | bas.vandervlies, dwightman, fabecassis, fullop, hpc-cs-hd, marshall, mcoyne, ndobson, sts |
Version: | 20.11.7 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: | https://bugs.schedmd.com/show_bug.cgi?id=11870 | ||
Site: | NVIDIA (PSLA) | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | Google sites: | --- |
HPCnow Sites: | --- | HPE Sites: | --- |
IBM Sites: | --- | NOAA SIte: | --- |
NoveTech Sites: | --- | Nvidia HWinf-CS Sites: | --- |
OCF Sites: | --- | Recursion Pharma Sites: | --- |
SFW Sites: | --- | SNIC sites: | --- |
Tzag Elita Sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | 20.11.8 21.08.0pre1 | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Description
Luke Yeager
2021-06-17 10:12:22 MDT
Thanks for reporting this, and nice job with git bisect finding the commit that started it! That certainly helped me find out what was happening faster. I can reproduce this and I believe I have a fix. I'm submitting my patch to our review queue. Would you like me to make the patch available for you to test? I discovered this bug can happen without a SPANK plugin with a simple job like this: #!/bin/bash #SBATCH -N2 srun -l -N1 bash -c 'hostname; sleep 10' & sleep 1 srun -l -N2 bash -c 'hostname; sleep 10' & wait Before commit 6a2c99edbf96e, the second step wasn't rejected, the second step incorrectly ran in parallel with the first step, thus overallocating CPUs. So commit 6a2c99edbf actually fixed that bug (in addition to the bug I meant to fix with commit 6a2c99edbf). But unfortunately that then caused this new bug which you reported. Thanks for the quick confirmation. And it's good to know that this will affect sites and jobs which don't use spank plugins, too. I think we will probably just sit at 20.11.6 until 20.11.8 is released rather than patching 20.11.7. But it would be nice to have the patch public for the sake of other sites who might want to patch, I think. As you know, many sites need to be at 20.11.7 because of the CVE fix. (In reply to Luke Yeager from comment #8) > Thanks for the quick confirmation. And it's good to know that this will > affect sites and jobs which don't use spank plugins, too. I think we will > probably just sit at 20.11.6 until 20.11.8 is released rather than patching > 20.11.7. Sounds good. > But it would be nice to have the patch public for the sake of other > sites who might want to patch, I think. As you know, many sites need to be > at 20.11.7 because of the CVE fix. As soon as we push the fix I will post the commit here. If someone wants the patch before it's pushed, just make a comment here and I will make it public. Hey Luke, This turned out to be a little more complicated than I initially thought, but we got it figured out. We've fixed this and other issues in the following commits: 48009cbd97 Clear allocated cpus for running steps in a job before handling requested nodes on new step. 05f5abda9f Don't reject a step if not enough nodes are available. 49b1b2681e Don't reject a step if it requests a node that is allocated. These will be in 20.11.8. Thanks for reporting this! Closing this bug as resolved/fixed. *** Ticket 11783 has been marked as a duplicate of this ticket. *** I've verified the fix in 20.11.8 - thanks. |