Hi! I'd like to revisit bug #1589, which was actually never really resolved. We've been seeing quite continuous error messages in our slurmctld logs, across all the Slurm versions we used since 14.11, that look like this: error: gres/gpu: job 35630501 dealloc node sh-15-09 topo gres count underflow (0 1) The mention pretty much all of our GPU nodes, and are logged quite often (there are 332,998 occurrences of that error message, spanning all of our GPU nodes, in our slurmctld.log for the last 2 days). It doesn't seem to be preventing GPU jobs to be dispatched nor GRES resources to be correctly allocated, but we would like to get to the bottom of this, understand why this error is logged and hopefully fix the reason why it happens. Conf and logs attached in the next comment. Thanks! -- Kilian
Hey Kilian, I am able to reproduce the error by simply restarting the slurmctld while a job with multiple gpus allocated to it is running. Here are the steps: # Get a job with > 1 gpus $ srun --gres=gpu:4 sleep 15 & # Restart the slurmctld $ slurmctld Once the job finishes, I get an error in slurmctld.log like this: [2019-01-28T10:16:14.294] error: gres/gpu: job 2622 dealloc node hintron topo gres count underflow (1 4) Interestingly, if I did an `scontrol reconfigure` instead of a slurmctld restart, I was not able to reproduce the error. Also of note is that if I specified only 1 GPU (e.g. `srun --gres=gpu sleep 15 &`), there is no error. Can you think of any other situations where this error is triggered for you? If you are seeing it all the time, maybe it’s not limited to when the slurmctld gets restarted. I’ll dive into the code to see how this happens in the first place. -Michael
Hi Michael, (In reply to Michael Hinton from comment #4) > I am able to reproduce the error by simply restarting the slurmctld while a > job with multiple gpus allocated to it is running. Thanks for the update, and great to hear you've been able to reproduce the error. > Can you think of any other situations where this error is triggered for you? > If you are seeing it all the time, maybe it’s not limited to when the > slurmctld gets restarted. Restarting slurmctld while multi-gpu jobs are running is a pretty common scenario for us, so that sounds completely plausible and very likely the source of the issue. I can't really think of any specific situation that would trigger this, although the number of occurrences doesn't seem to decrease over time after a slurmctld restart: for instance, the last slurmctld restart was on Jan 24: 2019-01-24T18:35:54-08:00 sh-sl01 slurmctld[166108]: slurmctld version 18.08.4 started on cluster sherlock and the number of occurrences of that error message doesn't seem to be decreasing over time: 2019-01-24: 4,009,605 occurrences 2019-01-25: 4,012,105 2019-01-26: 393,597 2019-01-27: 516,347 2019-01-28: 592,146 (so far) So there may be something else at play, but I don't know. > I’ll dive into the code to see how this happens in the first place. Thank you! Cheers, -- Kilian
Hey Kilian, We have a solution for this that is currently under review. Interestingly, this problem does not affect 19.05+, and the solution was actually a bit of a backport. I'll keep you posted when the patch lands. Thanks, Michael
Hi Michael, (In reply to Michael Hinton from comment #27) > We have a solution for this that is currently under review. > > Interestingly, this problem does not affect 19.05+, and the solution was > actually a bit of a backport. > > I'll keep you posted when the patch lands. Thanks for the update! I figure that could have been the case, with all the work around GPUs in 19.05. Cheers, -- Kilian
Hey Kilian, The patch just landed, slated for 18.08.6: https://github.com/SchedMD/slurm/commit/6f8cd92e1091e3439352311a79449d7930d0870b If this doesn't fix the problem, feel free to reopen this ticket. Thanks! -Michael
Hey Kilian, Just a heads up: 6f8cd92 had a regression that caused slurmctld to abort when GRES are added or removed from a node's configuration and all the daemons are restarted. It was fixed here: https://github.com/SchedMD/slurm/commit/69d78159c33051f2f6a3cb3ef2bf97c31288df02 Since we caught this before 18.08.6, it shouldn't affect anyone. Thanks, Michael
*** Ticket 6729 has been marked as a duplicate of this ticket. ***