Ticket 1572

Summary:	two questions about Native Slurm features
Product:	Slurm	Reporter:	James Botts <jfbotts>
Component:	Documentation	Assignee:	Brian Christiansen <brian>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	brian, da
Version:	15.08.x
Hardware:	Cray XC
OS:	Linux
Site:	NERSC	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description James Botts 2015-03-31 07:53:49 MDT

Hi -

We are trying to arrange to configure our Cray XC30 TDS into Native Slurm mode, but we're not quite there yet.  

I have two questions about Native Slurm -

1.  Would it be possible to run slurmctld on the outside in Native Slurm?

Behind the question is - if the inside of a Cray system
is down, we would still like the ability for users to submit jobs
from external login nodes.

Having another slurm instance routing to the inside would work, but
requires more configuration and maintenance.

2.  We would like a list of CCM features that will not be supported in
Native Slurm.

We have asked Cray this question and been answered that currently, the following is not currently supported:

 - X11 support 
 - DBus applications 
 - ssh and rsh launching
 - /dev/random seeding 
 - pseudo-tty support

They state that with a Cray plugin, they could support most of these - but that it will take 2 months of work.

What is your understanding of this (CCM features lacking within Native Slurm)?

Thanks, James

Comment 1 Danny Auble 2015-04-01 08:35:23 MDT

Hey James, the answer to your first question is Yes.

My guess is you want the backup to get jobs, just not run them.  Is that correct?

If that is the case there are 2 different ways you can go about making it work, one would require changes to the code and the other would require an admin knowing how to do simple commands.

In both cases you would run the backup slurmctld on a different node outside the Cray as you want, just making sure it can talk to the normal slurmctld as in other clusters.

In the admin simple command option would be when the cray system is down use scontrol to mark the partitions or nodes down.  If the system really is gone then the slurmds won't be responding in the first place so this will happen automatically as it would on a normal cluster.  In this case the admin will have to resume the nodes when the normal controller comes back on line, unless you have the ReturnToService flag set in the slurm.conf.

The other way to do this is change the code like done on ALPS.  I don't think this is needed since the first method should work as it does on other systems.

Let me know if this does what you would expect.

I have an email out to Cray about CCM, I'll update the bug when I have more information about it.

Comment 2 James Botts 2015-05-29 08:04:06 MDT

(In reply to Danny Auble from comment #1)
> Hey James, the answer to your first question is Yes.
> 
> My guess is you want the backup to get jobs, just not run them.  Is that
> correct?
> 
> If that is the case there are 2 different ways you can go about making it
> work, one would require changes to the code and the other would require an
> admin knowing how to do simple commands.
> 
> In both cases you would run the backup slurmctld on a different node outside
> the Cray as you want, just making sure it can talk to the normal slurmctld
> as in other clusters.
> 
> In the admin simple command option would be when the cray system is down use
> scontrol to mark the partitions or nodes down.  If the system really is gone
> then the slurmds won't be responding in the first place so this will happen
> automatically as it would on a normal cluster.  In this case the admin will
> have to resume the nodes when the normal controller comes back on line,
> unless you have the ReturnToService flag set in the slurm.conf.
> 
> The other way to do this is change the code like done on ALPS.  I don't
> think this is needed since the first method should work as it does on other
> systems.
> 
> Let me know if this does what you would expect.
> 
> I have an email out to Cray about CCM, I'll update the bug when I have more
> information about it.

Hi Danny,

We can retire #2 - list of CCM features that will not be supported in Native Slurm as Cray has stated that CCM won't be ready for Native Slurm until some time in 2016.    This is disappointing, but fortunately we don't have to rely on Cray.  Doug Jacobsen has created a framework to enable passwordless SSH between the nodes allocated to a job by slurm - which appears to be all that we need to run codes like Gaussian.

As to #1 - let me rephrase what we have in mind and what I have found so far.

We currently run Moab/Torque batch system server daemons for our Cray systems on external servers.  When maintenance is done on the inside of the cray, the sdb, all the service nodes, compute nodes, etc, are down - but users can still submit jobs, which queue up with the pre-existing jobs that had not run yet.  Then when the inside is booted, after health checks, the job queues are opened.

We would like this functionality in Native Slurm.  I tried having an external node as the backup controller - but in the Cray environment this fails once that node becomes the primary.  slurmctld dies because it cannot talk to the sdb - this is a result of having SelectType=select/cray, which is required in the Cray environment.

Now since the files in the pre-existing, shared slurmSaveState directory have a format which depends on the SelectType, it appears to me that one can't have a different SelectType on the outside - even if one didn't want to run any jobs.

Am I missing something on how to do this?

Thanks, James

Comment 3 Danny Auble 2015-05-29 08:50:44 MDT

(In reply to James Botts from comment #2)
> 
> Hi Danny,
> 
> We can retire #2 - list of CCM features that will not be supported in Native
> Slurm as Cray has stated that CCM won't be ready for Native Slurm until some
> time in 2016.    This is disappointing, but fortunately we don't have to
> rely on Cray.  Doug Jacobsen has created a framework to enable passwordless
> SSH between the nodes allocated to a job by slurm - which appears to be all
> that we need to run codes like Gaussian.

Yes this is my understanding as well.  I agree this is sad, I don't know if there is much we can do about it though.

> 
> As to #1 - let me rephrase what we have in mind and what I have found so far.
> 
> We currently run Moab/Torque batch system server daemons for our Cray
> systems on external servers.  When maintenance is done on the inside of the
> cray, the sdb, all the service nodes, compute nodes, etc, are down - but
> users can still submit jobs, which queue up with the pre-existing jobs that
> had not run yet.  Then when the inside is booted, after health checks, the
> job queues are opened.
> 
> We would like this functionality in Native Slurm.  I tried having an
> external node as the backup controller - but in the Cray environment this
> fails once that node becomes the primary.  slurmctld dies because it cannot
> talk to the sdb - this is a result of having SelectType=select/cray, which
> is required in the Cray environment.
> 
> Now since the files in the pre-existing, shared slurmSaveState directory
> have a format which depends on the SelectType, it appears to me that one
> can't have a different SelectType on the outside - even if one didn't want
> to run any jobs.
> 
> Am I missing something on how to do this?
> 
> Thanks, James

I don't think so James.  I think what you are looking for is the way Slurm backup works on a Slurm+ALPS system, where the backup would only allow jobs to submit, but not run anything until the primary comes up.  If that is what you are looking for we don't believe it would be hard to but this logic in the code. Another option would be to down all your nodes or partitions during the maintain period.  The most common way to do maintenance windows are to create a maintenance reservation on the entire system.  This would also prevent new jobs from being ran.  It would still allow for jobs to be submitted though.  As you say though in the select/cray land the code today fails because of the aled code.  This code probably be removed if on an external node though.  Perhaps we could make the code look to see if it was indeed on a node that knows about the cray libs before running the link.

What is the exact failure when the backup fails on the external node?

Comment 4 James Botts 2015-05-29 09:03:49 MDT

Hi Danny,

Thanks for the quick response.  On the external node (alva01), we get

[2015-05-29T11:44:29.294] error: ControlMachine mom not responding, BackupController alva01 taking over
[2015-05-29T11:44:29.294] Terminate signal (SIGINT or SIGTERM) received
[2015-05-29T11:44:29.616] layouts: no layout to initialize
[2015-05-29T11:44:29.617] layouts: loading entities/relations information
[2015-05-29T11:44:29.619] Recovered state of 14 nodes
[2015-05-29T11:44:29.622] Recovered state of 12 partitions
[2015-05-29T11:44:29.624] Recovered information about 0 jobs
[2015-05-29T11:44:29.624] error: (select_cray.c: 1536: select_p_node_init) Could not get system topology info: src/lib/alpscomm_sn/topology.c:128 Couldn't connect to the sdb
[2015-05-29T11:44:29.624] fatal: failed to initialize node selection plugin state, Clean start required.

I'll look through the code as well with your hints - one of the great advantages to open source!

Thanks, James

Comment 5 Brian Christiansen 2015-05-29 09:14:50 MDT

I'll take a look and get you a patch to try. We have an idea of how to do this.

Comment 6 Brian Christiansen 2015-06-10 11:40:28 MDT

We added a "no_backup_scheduling" SchedulerParameter. The parameter stops the backup from scheduling new jobs when it takes over. New jobs can be submitted while the backup is in control. For a Cray native Slurm setup, it also stops the backup from trying to do anything Cray related so that it can be on an external Cray node.

This was added to 15.08, but can be patched to 14.11. 

https://github.com/SchedMD/slurm/commit/f9d132fc927266475277c7e35888686d41a59cdb
https://github.com/SchedMD/slurm/commit/5671bde20ff929dc8dfb64ebb6bea742dddc7a30

Are you able to try this on your external setup?

Thanks,
Brian

Comment 7 Brian Christiansen 2015-07-01 11:50:39 MDT

Closing the bug for now. Please reopen if you have any further questions or find anything in your testing.

Thanks,
Brian