| Summary: | AllocGRES gives strange value | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | GSK-ONYX-SLURM <slurm-support> |
| Component: | Accounting | Assignee: | Felip Moll <felip.moll> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | felip.moll, kilian |
| Version: | 17.02.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=4695 https://bugs.schedmd.com/show_bug.cgi?id=6366 |
||
| Site: | GSK | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
GSK-ONYX-SLURM
2018-01-19 08:19:53 MST
Hello, This happens in two situations: 1. If you are using a GRES plugine, i.e. gpu, and you have inconsistencies between slurm.conf and gres.conf. Can you check if you have gpu counts in slurm.conf on nodes differently from what is in gres.conf? 2. When you start slurmd before starting slurmctld and you have defined a gres.conf. This seems to be a bug, I am investigating this case. Will come back to you asap, but in the meantime please, try to check situation 1. More detailed explanation: What probably happened here is that you have some nodes with a configuration mismatch between slurm.conf and gres.conf. For example, if you have a line in slurm.conf like: NodeName=node0001 gres=gpu:1 and in gres.conf there's no entry for node0001, then what will happen is that when slurmd starts on this node, it'll send a registration message to the controller with the values read from gres.conf: none. The controller then will detect this mismatch thus marking the node drain with reason: Reason=gres/gpu count too low (0 < 1) [slurm@2018-01-23T12:51:13] After that you can force this node to be up again just with: scontrol update nodename=node0001 state=resume and send jobs to it asking for gres=gpu since it is defined in slurm.conf. Finally the job is accepted but at the time of registering which GRES resources have been allocated there's no match between gpu and what's in gres.conf so the ID is saved instead of the string 'gpu'. The ID 7696487 corresponds to GPU. So to solve this: 1. Check for a mismatch between slurm.conf and gres.conf, correct it and do 'scontrol reconfig' and slurmd daemon restart on affected nodes, in this order. 2. The string '7696487' is already in the database. If you want to correct these values you will have to do it manually: see all: > select id_job,gres_alloc from <your_cluster_name>_job_table; just wrong gpu ones: > select id_job,gres_alloc from <your_cluster_name>__job_table where gres_alloc like '7696487%'; example of update: > update <your_cluster_name>__job_table set gres_alloc='gpu:1' where gres_alloc='7696487:1'; 3. I will check internally how to avoid this situation. Please, tell me how 1 and 2 goes for you. Hello, I am marking this bug as resolved and in parallel I am opening an enhancement request to fix this situation. In the future we will probably deny an admin to set nodes to IDLE if they have a reason related to bad gres count, cpu count or any other config mismatch. May you have any questions, just raise a new bug. Comment 3 will solve your current issues. Best regards and thanks for reporting. Hi Felip, As there been any fix for this issue? We seem to be experiencing the same thing, although our slurm.conf and gres.conf seem to be consistent. We definitely have occasions where slurmd starts before slurmctld, though. So I went to the Slurm DB and updated all the records with id 7696487, as you detailed in Comment #3. I restarted slurmctld, and then all slurmds on GPU nodes, but I continue to see new jobs being recorded with gres_alloc="7696487:x". None of our nodes is drained with "Reason=gres/gpu count too low" either. Cheers, -- Kilian (In reply to Kilian Cavalotti from comment #9) > Hi Felip, > > As there been any fix for this issue? We seem to be experiencing the same > thing, although our slurm.conf and gres.conf seem to be consistent. > > We definitely have occasions where slurmd starts before slurmctld, though. > > So I went to the Slurm DB and updated all the records with id 7696487, as > you detailed in Comment #3. I restarted slurmctld, and then all slurmds on > GPU nodes, but I continue to see new jobs being recorded with > gres_alloc="7696487:x". > > None of our nodes is drained with "Reason=gres/gpu count too low" either. > > Cheers, > -- > Kilian Hi Kilian, Though this may apply to recent versions this bug is tagged against the old 17.02. Given that your situation does not match exactly the one in this bug, would you mind to open a new one? You can directly add your newest slurm.conf/gres.conf and slurmctld/slurmd logs and issued command. Also, remember that commenting on "resolved" bugs without reopening them may lead us to unintentionally skip the notifications. Always reopen or open a new one. Thanks |