| Summary: | Jobs vanished when submitted to non-existent partition | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Akmal Madzlan <akmalm> |
| Component: | slurmctld | Assignee: | Brian Christiansen <brian> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | CC: | brian, da |
| Version: | 14.03.9 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | DownUnder GeoSolutions | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Akmal Madzlan
2014-11-04 13:10:02 MST
I've been able to reproduce this is if I remove a partition from the slurm.conf and reconfigure. If you use "scontrol delete PartitionName=<partiton>", there is logic to prevent the partition from being deleted if the partition is in use by any job. I'm looking into a solution for not canceling the jobs on a restart/reconfigure. The solution to prevent the jobs from being canceled will be considered for 15.08 -- It's a large project. The solution for now is to use "scontrol delete partiton=<partition>" before removing a partition from the slurm.conf and restarting. This will help ensure that there are no jobs referencing the partition at removal. The FAQ has been updated with this information as well. https://github.com/SchedMD/slurm/commit/2f4a18fa466ee55baa205390f154e75269074b5f Let us know if you have any questions. Thanks, Brian Thanks Brian |