Created attachment 11544 [details] slurm.conf We have run into an issue where running scontrol reconfigure aborts running GPU jobs, but only on multi GPU nodes. First the details: This is a 4 node cluster, each node has either 4 or 8 GPU All running Ubuntu 18.04.2 LTS All running 19.05.0 compiled from source All running Nvidia GPUs (Rtx Titans, Quadro RTX 8000s, and GTX 1080s) If we issue the command scontrol reconfigure to make a config change (changing AccountingStorageType in this example), running GPU jobs get aborted with an 'unsupported GRES options' error. We have reproduced this many time in this cluster. Example allennlp-server1.corp ~ # squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 671 allennlp bash sanjays R 12:21:57 1 allennlp-server1 698 allennlp sbatch_w roys R 10:28:47 1 allennlp-server4 711 allennlp gqa_bala sanjays R 9:48:02 1 allennlp-server4 783 allennlp allennlp sanjays R 45:26 1 allennlp-server4 allennlp-server1.corp ~ # sudo scontrol reconfigure allennlp-server1.corp ~ # squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 671 allennlp bash sanjays CG 12:22:10 1 allennlp-server1 783 allennlp allennlp sanjays CG 45:39 1 allennlp-server4 allennlp-server1.corp ~ # And in the logs, this is what we see: Sep 10 22:39:28 allennlp-server1 slurmctld[25106]: Processing RPC: REQUEST_RECONFIGURE from uid=0 Sep 10 22:39:28 allennlp-server1 slurmctld[25106]: No memory enforcing mechanism configured. Sep 10 22:39:28 allennlp-server1 slurmctld[25106]: layouts: no layout to initialize Sep 10 22:39:28 allennlp-server1 slurmctld[25106]: restoring original state of nodes Sep 10 22:39:28 allennlp-server1 slurmctld[25106]: select/cons_tres: select_p_node_init Sep 10 22:39:28 allennlp-server1 slurmctld[25106]: select/cons_tres: preparing for 2 partitions Sep 10 22:39:28 allennlp-server1 slurmctld[25106]: error: Aborting JobId=671 due to use of unsupported GRES options Sep 10 22:39:28 allennlp-server1 slurmctld[25106]: error: Aborting JobId=698 due to use of unsupported GRES options Sep 10 22:39:28 allennlp-server1 slurmctld[25106]: email msg to roys@allenai.org: Slurm Job_id=698 Name=sbatch_wrapper.sh Failed, Run time 10:29:00, NODE_FAIL, ExitCode 0 Sep 10 22:39:28 allennlp-server1 slurmctld[25106]: error: Aborting JobId=711 due to use of unsupported GRES options Sep 10 22:39:28 allennlp-server1 slurmctld[25106]: error: Aborting JobId=783 due to use of unsupported GRES options We believe that this is some kind of race condition with the loading of plugins. We also were not able to reproduce this on a single GPU test node that we spun up to investigate and the fact that one job seems to survive, leads us to suspect that a job on GPU0 survives but the others do not. Finally I rated this as a High impact as this issue will prevent adoption here at the Ai2 as we have many users that run multiple long running jobs for their research and these kinds of interruptions are not acceptable to them. Please feel free to reprioritize according to your standards.
Created attachment 11545 [details] gres.conf
Created attachment 11547 [details] slurmdbd.conf
This appears to be related based on the description and what we are seeing https://bugs.schedmd.com/show_bug.cgi?id=7727
Hi I can reproduce this easily. It looks that the patch from bug 7727 is correct and fixes this issue. I let you know when it will be in the repo. Dominik
When will this be released? Debating if we should just integrate our own patch or wait for yours. Thanks!
Hi As you probably have already noticed that fix is committed as: https://github.com/SchedMD/slurm/commit/2abd2a3d8d6bdc It will be included in 19.05.3. We plan to release 19.05.3 before end of the month, but we have no strict date yet. Let me know if we can close this ticket now. Dominik
Hi Did you apply this patch? Please let me know when we can close this ticket. Dominik
Patch applied. Feel free to close. marc > On Sep 18, 2019, at 5:29 AM, bugs@schedmd.com wrote: > > Dominik Bartkiewicz changed bug 7729 > What Removed Added > Severity 2 - High Impact 4 - Minor Issue > Comment # 7 on bug 7729 from Dominik Bartkiewicz > Hi > > Did you apply this patch? > Please let me know when we can close this ticket. > > Dominik > You are receiving this mail because: > You are on the CC list for the bug.