| Summary: | topo gres count underflow | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Kilian Cavalotti <kilian> |
| Component: | slurmctld | Assignee: | Director of Support <support> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | davide.vanzo, felip.moll |
| Version: | 18.08.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=1589 https://bugs.schedmd.com/show_bug.cgi?id=6500 https://bugs.schedmd.com/show_bug.cgi?id=7401 |
||
| Site: | Stanford | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | Sherlock | CLE Version: | |
| Version Fixed: | 18.08.6 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Kilian Cavalotti
2019-01-16 13:07:48 MST
Hey Kilian,
I am able to reproduce the error by simply restarting the slurmctld while a job with multiple gpus allocated to it is running.
Here are the steps:
# Get a job with > 1 gpus
$ srun --gres=gpu:4 sleep 15 &
# Restart the slurmctld
$ slurmctld
Once the job finishes, I get an error in slurmctld.log like this:
[2019-01-28T10:16:14.294] error: gres/gpu: job 2622 dealloc node hintron topo gres count underflow (1 4)
Interestingly, if I did an `scontrol reconfigure` instead of a slurmctld restart, I was not able to reproduce the error.
Also of note is that if I specified only 1 GPU (e.g. `srun --gres=gpu sleep 15 &`), there is no error.
Can you think of any other situations where this error is triggered for you? If you are seeing it all the time, maybe it’s not limited to when the slurmctld gets restarted.
I’ll dive into the code to see how this happens in the first place.
-Michael
Hi Michael, (In reply to Michael Hinton from comment #4) > I am able to reproduce the error by simply restarting the slurmctld while a > job with multiple gpus allocated to it is running. Thanks for the update, and great to hear you've been able to reproduce the error. > Can you think of any other situations where this error is triggered for you? > If you are seeing it all the time, maybe it’s not limited to when the > slurmctld gets restarted. Restarting slurmctld while multi-gpu jobs are running is a pretty common scenario for us, so that sounds completely plausible and very likely the source of the issue. I can't really think of any specific situation that would trigger this, although the number of occurrences doesn't seem to decrease over time after a slurmctld restart: for instance, the last slurmctld restart was on Jan 24: 2019-01-24T18:35:54-08:00 sh-sl01 slurmctld[166108]: slurmctld version 18.08.4 started on cluster sherlock and the number of occurrences of that error message doesn't seem to be decreasing over time: 2019-01-24: 4,009,605 occurrences 2019-01-25: 4,012,105 2019-01-26: 393,597 2019-01-27: 516,347 2019-01-28: 592,146 (so far) So there may be something else at play, but I don't know. > I’ll dive into the code to see how this happens in the first place. Thank you! Cheers, -- Kilian Hey Kilian, We have a solution for this that is currently under review. Interestingly, this problem does not affect 19.05+, and the solution was actually a bit of a backport. I'll keep you posted when the patch lands. Thanks, Michael Hi Michael, (In reply to Michael Hinton from comment #27) > We have a solution for this that is currently under review. > > Interestingly, this problem does not affect 19.05+, and the solution was > actually a bit of a backport. > > I'll keep you posted when the patch lands. Thanks for the update! I figure that could have been the case, with all the work around GPUs in 19.05. Cheers, -- Kilian Hey Kilian, The patch just landed, slated for 18.08.6: https://github.com/SchedMD/slurm/commit/6f8cd92e1091e3439352311a79449d7930d0870b If this doesn't fix the problem, feel free to reopen this ticket. Thanks! -Michael Hey Kilian, Just a heads up: 6f8cd92 had a regression that caused slurmctld to abort when GRES are added or removed from a node's configuration and all the daemons are restarted. It was fixed here: https://github.com/SchedMD/slurm/commit/69d78159c33051f2f6a3cb3ef2bf97c31288df02 Since we caught this before 18.08.6, it shouldn't affect anyone. Thanks, Michael *** Ticket 6729 has been marked as a duplicate of this ticket. *** *** Ticket 6729 has been marked as a duplicate of this ticket. *** |