Ticket 1231 - Jobs vanished when submitted to non-existent partition
Summary: Jobs vanished when submitted to non-existent partition
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 14.03.9
Hardware: Linux Linux
: 2 - High Impact
Assignee: Brian Christiansen
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2014-11-04 13:10 MST by Akmal Madzlan
Modified: 2014-11-09 14:37 MST (History)
2 users (show)

See Also:
Site: DownUnder GeoSolutions
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Akmal Madzlan 2014-11-04 13:10:02 MST
Jobs vanished when submitted to 3 partitions, 2 of which still existed. Some of the jobs went into SE and we're speculating that the jobs that are currently running when the partition disappear/removed is put into SE state and the one that is still pending is vanished.

I think slurm should put the pending jobs into the other partitions specified or put it in some sort of error state when one a partition specified is removed. That way we can fix the error and then bring them back into the queue

[2014-10-31T22:06:56.198] error: Invalid partition (teambond,idle,desktopBigMem) for job 2202158
[2014-10-31T22:06:56.220] error: Invalid partition (teambond,idle,desktopBigMem) for job 2202159
[2014-10-31T22:06:56.236] error: Invalid partition (teambond,idle,desktopBigMem) for job 2202160
[2014-10-31T22:06:56.253] error: Invalid partition (teambond,idle,desktopBigMem) for job 2202161
[2014-10-31T22:06:56.277] error: Invalid partition (teambond,idle,desktopBigMem) for job 2202162
[2014-10-31T22:06:56.292] error: Invalid partition (teambond,idle,desktopBigMem) for job 2202163
[2014-10-31T22:06:56.310] error: Invalid partition (teambond,idle,desktopBigMem) for job 2202164
[2014-10-31T22:06:56.328] error: Invalid partition (teambond,idle,desktopBigMem) for job 2202165
[2014-10-31T22:06:56.346] error: Invalid partition (teambond,idle,desktopBigMem) for job 2202166
[2014-10-31T22:06:56.366] error: Invalid partition (teambond,idle,desktopBigMem) for job 2202167
[2014-10-31T22:06:56.389] error: Invalid partition (teambond,idle,desktopBigMem) for job 2202168
[2014-10-31T22:06:56.408] error: Invalid partition (teambond,idle,desktopBigMem) for job 2202169
[2014-10-31T22:06:56.427] error: Invalid partition (teambond,idle,desktopBigMem) for job 2202170
[2014-10-31T22:06:56.443] error: Invalid partition (teambond,idle,desktopBigMem) for job 2202171
[2014-10-31T22:06:56.464] error: Invalid partition (teambond,idle,desktopBigMem) for job 2202172
[2014-10-31T22:06:56.486] error: Invalid partition (teambond,idle,desktopBigMem) for job 2202173
[2014-10-31T22:06:56.506] error: Invalid partition (teambond,idle,desktopBigMem) for job 2202174
Comment 1 Brian Christiansen 2014-11-05 07:42:07 MST
I've been able to reproduce this is if I remove a partition from the slurm.conf and reconfigure. If you use "scontrol delete PartitionName=<partiton>", there is logic to prevent the partition from being deleted if the partition is in use by any job.

I'm looking into a solution for not canceling the jobs on a restart/reconfigure.
Comment 2 Brian Christiansen 2014-11-06 07:36:33 MST
The solution to prevent the jobs from being canceled will be considered for 15.08 -- It's a large project. The solution for now is to use "scontrol delete partiton=<partition>" before removing a partition from the slurm.conf and restarting. This will help ensure that there are no jobs referencing the partition at removal.

The FAQ has been updated with this information as well.
https://github.com/SchedMD/slurm/commit/2f4a18fa466ee55baa205390f154e75269074b5f

Let us know if you have any questions.

Thanks,
Brian
Comment 3 Akmal Madzlan 2014-11-09 14:37:21 MST
Thanks Brian