Ticket 689

Summary: advice for running in production
Product: Slurm Reporter: Stuart Midgley <stuartm>
Component: OtherAssignee: David Bigagli <david>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 5 - Enhancement    
Priority: --- CC: da
Version: 14.03.0   
Hardware: Linux   
OS: Linux   
Site: DownUnder GeoSolutions Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Stuart Midgley 2014-04-09 11:42:22 MDT
Morning

I was after some advice for running in production.  While I've been pushing these test workloads through the queue, I've been happily stopping and restarting the slurmctld and slurmd's.  Is their any issue with this?  For example, under SGE, if you restart the mom on the client node, you effectively loose control of the jobs.  When the mom restarts, it is able to track the jobs, but is much less reliable to monitoring, killing and suspending running jobs on the node.

Should I ever be concerned about restarting slurmd's or the slurmctld?

The reason I ask is that I'll probably go into production this weekend and will be running from HEAD rather than a released version.  I'm also quite keen to stay uptodate with Slurm (rather than get 7yrs behind as we have done with SGE).  This will mean a lot more restarting of deamons etc.
Comment 1 David Bigagli 2014-04-09 11:59:13 MDT
You should not worry restarting daemons. The slurmctld keeps it state in the StateSaveLocation directory and the job information in the SlurmdSpoolDir so as long as these directories exist you have the information the controller needs
to restore its state.

Each job is administered by the slurmstepd daemon. This process is forked from
slurmd and it is responsible for the job monitoring, its resource collection and so on. This process is unaffected by slurmd restart, so restarting slurmd does not affect the job. Should you lose the slurmstepd than you would lose the job
but practically that never happens, slurmstepd is a small and very stable daemon, but you can test this scenario.

David
Comment 2 Stuart Midgley 2014-04-09 12:11:59 MDT
Cool, thanks.
Comment 3 Moe Jette 2014-04-09 14:22:31 MDT
I would not recommend pulling code from the github master going forward. That code base is under active development. Either pull from the head of the "slurm-14.03" branch as needed, or even better just download the latest tagged version from
http://www.schedmd.com/#repos

We do a fair bit of testing before releasing a tagged version and make releases about once each month. That should give you stable and current code.

We do major upgrades about every 8 months. Update to v14.11 late in the year after we've worked out the bugs. If you pull from the github master, chances of losing state between each upgrade are high as the format of files and RPCs is unstable and under development. If you wait for a release from us, you should not need to worry about losing jobs or other state.
Comment 4 Stuart Midgley 2014-04-09 14:25:23 MDT
Yeh, I accept that.  While I've been testing, I've just been using HEAD.

For production, I want to get onto your stable releases as quickly as possible... which may not be possible for a go-live this weekend...  but that is the plan.