| Summary: | advice for running in production | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Stuart Midgley <stuartm> |
| Component: | Other | Assignee: | David Bigagli <david> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 5 - Enhancement | ||
| Priority: | --- | CC: | da |
| Version: | 14.03.0 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | DownUnder GeoSolutions | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Stuart Midgley
2014-04-09 11:42:22 MDT
You should not worry restarting daemons. The slurmctld keeps it state in the StateSaveLocation directory and the job information in the SlurmdSpoolDir so as long as these directories exist you have the information the controller needs to restore its state. Each job is administered by the slurmstepd daemon. This process is forked from slurmd and it is responsible for the job monitoring, its resource collection and so on. This process is unaffected by slurmd restart, so restarting slurmd does not affect the job. Should you lose the slurmstepd than you would lose the job but practically that never happens, slurmstepd is a small and very stable daemon, but you can test this scenario. David Cool, thanks. I would not recommend pulling code from the github master going forward. That code base is under active development. Either pull from the head of the "slurm-14.03" branch as needed, or even better just download the latest tagged version from http://www.schedmd.com/#repos We do a fair bit of testing before releasing a tagged version and make releases about once each month. That should give you stable and current code. We do major upgrades about every 8 months. Update to v14.11 late in the year after we've worked out the bugs. If you pull from the github master, chances of losing state between each upgrade are high as the format of files and RPCs is unstable and under development. If you wait for a release from us, you should not need to worry about losing jobs or other state. Yeh, I accept that. While I've been testing, I've just been using HEAD. For production, I want to get onto your stable releases as quickly as possible... which may not be possible for a go-live this weekend... but that is the plan. |