| Summary: | Heterogeneous job components reserving individually more than available resources | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Alejandro Sanchez <alex> |
| Component: | Scheduling | Assignee: | Unassigned Developer <dev-unassigned> |
| Status: | OPEN --- | QA Contact: | |
| Severity: | 5 - Enhancement | ||
| Priority: | --- | ||
| Version: | 18.08.0 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=5579 | ||
| Site: | SchedMD | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
Alejandro Sanchez
2018-08-31 09:36:16 MDT
(In reply to Alejandro Sanchez from comment #0) > This was detected in bug 5579 as a related but separate issue. Core of the > problem is heterogeneous job components belonging to a single hetjob can > each of them request up to the total amount of available resources. Note that "available resources" here can include not just nodes on the cluster, but 1. Resources in a specific partition (when multiple hetjob components are submitted to the same partition) 2. GRES when specified as a total count for the job (e.g. two hetjob components specify "--gpus=16" which exceeds the total available on the partition) 3. Licenses 4. Burst buffer space 5. Global limits (total node count for a user, QOS, etc.) This will be difficult to detect, at least for most of these cases. Ideally this can detected at submit time, then we could reject the job. Otherwise holding the job is probably the best option. |