| Summary: | Jobs going to nodes that are not members of the selected partition | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Adam <adam.munro> |
| Component: | slurmctld | Assignee: | Director of Support <support> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | felip.moll |
| Version: | 20.02.3 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=9225 | ||
| Site: | Yale | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
bug8847_2002_v12.patch
bug8847_2002_v13.patch |
||
Since we upgraded to 20.02.3 we have seen jobs submitted to one partition end up running on nodes that are not a member of the partition the jobs were submitted to. For example, all of these: User JobID Partition State Submit Start NodeList --------- ------------ ---------- ---------- ------------------- ------------------- --------------- ga254 65507956 day COMPLETED 2020-08-29T11:24:43 2020-08-29T11:24:44 c31n06 ga254 65507958 day COMPLETED 2020-08-29T11:24:45 2020-08-29T11:24:46 c31n06 ch2229 62930363 pi_econ_io COMPLETED 2020-08-11T08:22:02 2020-08-11T08:22:48 p08r02n40 ch2229 62967916 pi_econ_io COMPLETED 2020-08-11T11:56:25 2020-08-11T11:57:04 p08r02n40 ch2229 63006219 pi_econ_io COMPLETED 2020-08-11T17:24:47 2020-08-11T17:24:48 p08r02n44 ch2229 63292472 pi_econ_io COMPLETED 2020-08-13T17:38:50 2020-08-13T17:39:11 p08r02n36 lf468 62468450 pi_econ_lp FAILED 2020-08-06T18:43:30 2020-08-06T18:44:11 p08r02n40 fd338 64246551 pi_polima+ COMPLETED 2020-08-21T15:27:16 2020-08-25T14:55:09 p08r02n36 ..none of the above nodes are/were members of any of the listed partitions (eg: c31n06 is not a member of "day", etc). This does not happen very frequently, but it is a big problem because the owners of the nodes are unhappy with other user's jobs running on their nodes. Thank you, Adam