Ticket 4476

Summary: jobs on all clusters in the federation,removing jobs from one cluster
Product: Slurm Reporter: Phil Eckert <eckert2>
Component: FederationAssignee: Unassigned Developer <dev-unassigned>
Status: OPEN --- QA Contact:
Severity: 5 - Enhancement    
Priority: --- CC: sts
Version: 17.11.0   
Hardware: Linux   
OS: Linux   
Site: LLNL Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Phil Eckert 2017-12-05 15:41:34 MST
I have jobs submitted and running on all clusters in the federation, is there a way to remove my jobs from a single cluster without having to list them individually?
Comment 1 Isaac Hartung 2017-12-05 16:10:09 MST
Do you want to cancel or requeue these jobs?
Comment 2 Phil Eckert 2017-12-05 16:12:46 MST
cancel. It appear that I can cancel all or one at a time, but not at the cluster level.
Comment 3 Isaac Hartung 2017-12-05 16:15:28 MST
scancel has a cluster option: "-M or --clusters=", which would send the scancel to only that cluster, then you could add the username option to cancel all of your jobs.
Comment 4 Phil Eckert 2017-12-05 16:16:51 MST
I used that, but it cancelled all jobs, not just the cluster I requested.
Comment 5 Phil Eckert 2017-12-05 16:18:41 MST
To be more clear, I used:


scancel -u eckert -M icrm1

and it removed jobs from all the clusters.
Comment 6 Isaac Hartung 2017-12-05 16:19:40 MST
(In reply to Phil Eckert from comment #4)
> I used that, but it cancelled all jobs, not just the cluster I requested.

OK.  I'm working on this and will get back to you as soon as possible.
Comment 7 Phil Eckert 2017-12-05 16:29:18 MST
A guest, because I like to guess ;-)

would be that the scancel might be working from the VIABLE_SIBLINGS list when canceling, rather than the ACTIVE_SIBLINGS


Another thought, if a job is not active I would think that the scancel would only act on jobs on the origin host, since I'm sure this is going to be confusing to users as it is.

I'm still fairly convinced that jobs should only be vialbe on the host they are submitted on, and that if a user desires multiple hosts, they would request them.

/g/g0/eckert[22] squeue
JOBID    CLUSTER  ST    ORIGIN          VIABLE_SIBLINGS          ACTIVE_SIBLINGS    TIME NODES         REASON                              NODELIST
67108914   icrm3  PD     icrm1        icrm1,icrm2,icrm3                       NA    0:00    20      Resources                                      
67108915   icrm3  PD     icrm1        icrm1,icrm2,icrm3                       NA    0:00    20       Priority                                      
67108916   icrm3  PD     icrm1        icrm1,icrm2,icrm3                       NA    0:00    20       Priority                                      
67108910   icrm2   R     icrm1        icrm1,icrm2,icrm3                    icrm2    1:44    20           None                     icrm-2-host[1-20]
67108913   icrm3   R     icrm1        icrm1,icrm2,icrm3                    icrm3    1:41    20           None                    icrm-3-host[21-40]
67108912   icrm3   R     icrm1        icrm1,icrm2,icrm3                    icrm3    1:42    20           None                     icrm-3-host[1-20]
67108911   icrm1   R     icrm1        icrm1,icrm2,icrm3                    icrm1    1:42    20           None                    icrm-1-host[21-40]
67108909   icrm1   R     icrm1        icrm1,icrm2,icrm3                    icrm1    1:45    20           None                     icrm-1-host[1-20]
Comment 8 Isaac Hartung 2017-12-05 17:01:47 MST
So, looking at test37.10 in the testsuite, it seems that you are experiencing the expected behavior, since scancel's are propagated throughout the federation, regardless of which cluster you send them to.

I will research further as to whether this behavior should be modified and/or if there is an alternative, preferable method to cancel jobs only on targeted clusters.

I'll take look at your suggestions and respond to those as I am able.
Comment 10 Isaac Hartung 2017-12-05 21:52:55 MST
Here is what I have been told:

The federation was designed to connect multiple homogeneous clusters together and make it feel largely like one cluster. Each cluster independently schedules each of the sibling jobs, coordinating with the origin cluster. 

When jobs are submitted to the federation, the origin cluster (the cluster that receives the job submit request) submits sibling jobs to each viable cluster — where viable clusters are: (all clusters in the federation || the subset of clusters requested with --clusters and/or --cluster-constraint). The active clusters are the clusters that have an actual job (a viable cluster could have rejected the job and thus not be an active cluster). Currently only the origin job knows about the active siblings (you can see this by doing squeue from the origin cluster). If the job is revoked on the origin cluster (meaning it isn't a viable cluster or the federated job was started on a sibling cluster) you can see it with the -a option with an squeue to the origin cluster (or an squeue -a --sibling from any cluster in the federation). The origin job needs to stay around to handle requests like starting, updates, cancellations, etc. to the federated job.

Even though the clusters each have a copy of the federated job, they act as one because they are tied to the origin job on the cluster. So if you scancel a federated job all of them are removed. 

The scancel --sibling= option removes the job from the active siblings. If the job is requeued it would still be eligible to run on the viable siblings.

If you want to modify the active siblings you can use "scontrol update jobid=<jobid> clusters=<clusters>" or "scontrol update clusterfeatures=<features>" (test37.6).

I hope this clarifies how the federation was designed to work.

I am going to mark this ticket as resolved, seeing as your question has been addressed and no immediate action is going to be taken, though this feature may be added in the future.  Should you have anything further, feel free to post it here and we will respond.
Comment 11 Phil Eckert 2017-12-06 11:30:17 MST
Unfortunately, the way the federation works is very contrary to how we operate. We have many cluster, with different capabilities, different users with different accounts. We previously had the Moab grid installed, which jobs would run on the cluster they were submitted on, unless a list of clusters were provided, of which the submission host was not considered unless it was in the list. I mention Moab, only because that is what our users were accustomed to.

Something that I think would make the federation more friendly to our needs would be for a job submitted without a cluster option or list to only be considered for the submitting host, and then we would have expected behavior.

Also, a point that I think important regardless, is how to deal with a a cluster in the federation going down or becoming unavailable.

If a federation setup is viewed as a "single" cluster then a user may have no idea which individual cluster in the larger cluster their job is running on. If that individual cluster goes down, the user will not be able to discern what has happened since the "federation cluster" is still up. Somewhere there needs to be a means of finding out the "last know state" of the jobs, to avoid the confusion that could be caused.
Comment 12 Isaac Hartung 2017-12-07 12:01:27 MST
Squeue can be formatted to show which cluster a job is running on: when a job is running, the cluster named in the active siblings column is the cluster running that job.

I am unfamiliar with the mechanisms (if they exist) that are in place to handle a federated cluster failure, but I will do some research and get you an answer as I am able.   

The federation is the result of sponsored development aimed at satisfying a specific set of needs.  Having met those, we are open to enhancement requests.  I will reopen this bug as a severity 5.
Comment 14 Isaac Hartung 2017-12-07 12:36:34 MST
(In reply to Phil Eckert from comment #11)
> Unfortunately, the way the federation works is very contrary to how we
> operate. We have many cluster, with different capabilities, different users
> with different accounts. We previously had the Moab grid installed, which
> jobs would run on the cluster they were submitted on, unless a list of
> clusters were provided, of which the submission host was not considered
> unless it was in the list. I mention Moab, only because that is what our
> users were accustomed to.
> 
> Something that I think would make the federation more friendly to our needs
> would be for a job submitted without a cluster option or list to only be
> considered for the submitting host, and then we would have expected behavior.


This might be accomplished with the job_submit plugin.  You could catch any job without an explicit cluster list and add the -M or --clusters=<submission host> option to them, allowing them to run on only that cluster.
Comment 16 Phil Eckert 2017-12-07 12:50:14 MST
While we could write a plugin, it would be more manageable to have a slurm.conf setting (hint hint). That way it would be part of the slurm.conf documentation, I say this because I believe that more sites than just LLNL would desire this option.
Comment 17 Isaac Hartung 2017-12-07 13:50:38 MST
Understood.  The bug has been reopened as an enhancement request and will be addressed at some future point.  Until then I hope the plugin solution proves sufficient.
Comment 19 Phil Eckert 2018-02-08 13:19:18 MST
For the lua plugin, ould you please tell me which structure elements contain the "--cluster=<name> and -M <name>" data.

Thank you
Phil
Comment 20 Brian Christiansen 2018-02-08 13:51:16 MST
e.g.
function slurm_job_submit(job_desc, part_list, submit_uid)
	if job_desc ~= nil and job_desc.clusters ~= nil then
		slurm.log_user("Clusters: " .. job_desc.clusters .. "\n");
	end
	return slurm.SUCCESS
end

c1$ sbatch --wrap="hostname" -Mc1,c2
sbatch: Clusters: c1,c2
Submitted batch job 67317899 on cluster c1
Comment 21 Phil Eckert 2018-02-09 13:08:34 MST
I thought I sent this yesterday, but I am not seeing it, How I determine the cluster name using lua?"
Comment 22 Brian Christiansen 2018-02-09 14:18:49 MST
I don't see it exported to the lua interface. But since script is being run by the controller, it could be hard coded in the lua script to the cluster associated with the script.