Hi, This is a follow up on bug #4650. We're seeing some (but not all) of our GPU jobs accounting records created with a AllocGRES value of "7696487:x" (x being the number of requested GPUs), exactly as described in bug #4650. We've checked our slurm.conf and gres.conf files and node configuration seem to be consistent regarding GRES config. I've cleaned up the database yesterday by manually updating the records, as suggested in #4650: MariaDB [slurm_acct_db]> update sherlock_job_table set gres_alloc='gpu:1' where gres_alloc='7696487:1'; And today, I can see the more jobs records have been created with that gres id: MariaDB [slurm_acct_db]> SELECT id_job,gres_alloc,gres_req,nodelist from sherlock_job_table where gres_alloc like '7696487%' LIMIT 10; +----------+------------+----------+-----------+ | id_job | gres_alloc | gres_req | nodelist | +----------+------------+----------+-----------+ | 35564295 | 7696487:1 | gpu:0 | sh-115-01 | | 35564298 | 7696487:1 | gpu:0 | sh-115-01 | | 35669388 | 7696487:1 | gpu:0 | sh-19-02 | | 35669389 | 7696487:1 | gpu:0 | sh-112-12 | | 35669423 | 7696487:1 | gpu:0 | sh-112-13 | | 35669438 | 7696487:1 | gpu:0 | sh-114-04 | | 35669439 | 7696487:1 | gpu:0 | sh-114-04 | | 35669440 | 7696487:1 | gpu:0 | sh-114-04 | | 35669455 | 7696487:1 | gpu:0 | sh-114-04 | | 35669456 | 7696487:1 | gpu:0 | sh-114-04 | +----------+------------+----------+-----------+ MariaDB [slurm_acct_db]> SELECT COUNT(*) from sherlock_job_table where gres_alloc like '7696487%'; +----------+ | COUNT(*) | +----------+ | 588 | +----------+ I'll attach the config files (gres.conf and slurm.conf) as well as the slurmctld log and the slurmd log from a relevant node (sh-115-01). Thanks! -- Kilian
I'd actually like to make the config and log files attachment private, but there's no Privacy setting checkbox when I try to upload an attachement. :(
(In reply to Kilian Cavalotti from comment #1) > I'd actually like to make the config and log files attachment private, but > there's no Privacy setting checkbox when I try to upload an attachement. :( Hi Kilian, your bug is already private. It will be safe to attach it here. Thanks
Hey Kilian, We figured out where the problem is. We are just making sure that our fix looks good before committing. Thanks, -Michael
(In reply to Michael Hinton from comment #9) > Hey Kilian, > > We figured out where the problem is. We are just making sure that our fix > looks good before committing. Good news! Thanks for the update. Cheers, -- Kilian
Kilian, Here is the patch, slated for 18.08.5: https://github.com/SchedMD/slurm/commit/588aacf5b13da5ef. Let me know if that works for you! Thanks, -Michael
Here's a quick explanation: `7696487` is the gpu plugin id, and is simply a special hash of the string `gpu`. Before, Slurm would try to see if that id matched any gpu records parsed from (effectively) a random node's gres.conf. On heterogeneous systems (or homogeneous systems with gres.confs that are mismatched, like in bug 4650), this means that sometimes a gpu record wasn't found, so the string "gpu" wasn't found. The fallback was to use the plugin id instead. The simplifying realization was that searching gpu records from a random node's gres.conf for a gres name string was not very smart. Instead, we can simply check the GresTypes strings configured in the controller's slurm.conf.
Hi Michael, (In reply to Michael Hinton from comment #13) > Here's a quick explanation: > > `7696487` is the gpu plugin id, and is simply a special hash of the string > `gpu`. > > Before, Slurm would try to see if that id matched any gpu records parsed > from (effectively) a random node's gres.conf. On heterogeneous systems (or > homogeneous systems with gres.confs that are mismatched, like in bug 4650), > this means that sometimes a gpu record wasn't found, so the string "gpu" > wasn't found. The fallback was to use the plugin id instead. > > The simplifying realization was that searching gpu records from a random > node's gres.conf for a gres name string was not very smart. Instead, we can > simply check the GresTypes strings configured in the controller's slurm.conf. Thanks for the explanation, and for the patch! Cheers, -- Kilian
Closing
*** Ticket 6599 has been marked as a duplicate of this ticket. ***