| Summary: | slurmctld state data corrupted. Recovery options? | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Greg Wickham <greg.wickham> |
| Component: | slurmctld | Assignee: | Alejandro Sanchez <alex> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | alex |
| Version: | 18.08.1 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | KAUST | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Greg Wickham
2018-11-07 22:50:30 MST
Hi Greg, looking at our last shaheen slurm.conf copy I see these: SlurmdSpoolDir=/var/spool/slurmd StateSaveLocation=/var/spool/slurm/state I guess the only filesystem full has been /var/spool/slurm. Do you preserve a copy of the .old files in the StateSaveLocation which should correspond to the last state before slurmctld restart? When StateSaveLocation becomes full, subsequent attempts to write the state to it obviously fail, and messages like these are logged to slurmctld.log file: slurmctld: error: Error writing file /tmp/slurm/state/hash.5/job.20005/environment, No space left on device ... slurmctld: error: Error writing file /tmp/slurm/state/node_state.new, No space left on device ... slurmctld: error: Error writing file /tmp/slurm/state/priority_last_decay_ran.new, No space left on device State files are usually first written to an <entity_state>.new file. If the filesystem is full, then all the state changes between the last copy and now won't be saved. If the write operation to the <entity_state>.new file succeeds, <entity_state>.old is unlink'ed (removed), <entity_state> is rotated to <entity_state>.old and <entity_state>.new is rotated to <entity_state>. At most what you can do is to start slurmctld from the last valid state. Can you attach slurmctld.log from before the time the filesystem filled to now? Thanks. Unfortunately we didn't notice early enough that slurmctld had an issue. When slurmctld was restarted it looks like it initialised (job ID started at 1) and then killed all the running jobs. As we had lost so much state (long running jobs had been terminated), we didn't try and recover. -greg So do you need anything else or you want me to look at your slurmctld.log? are you back to a stable situation? It's kind of odd the jobid back to 1 wrap around. Did you restart slurmctld with specific flags? Hi Greg. Is there anything else you need from here? thank you. Hi Alejandro, I have no further issues, but I do request two bugs/features be created: 1/ sdiag reports the health of slurmctld. Does it show any information that indicates that slurmctld had issues writing the state data? . . We are using 'sdiag' output to provide some information about whether slurmctld is functioning ok (or not). If slurmctld it should be available by sdiag (rather than looking for obscure messages in the slurmctld.log). 2/ slurmctld started even though the state data files were corrupt. This is a major issue (bug) - as we when slurmctld started it initialised again and then when all the slurmds reported back the running jobs were killed (I expect because slurmctld had no record of it). The result was all running jobs were terminated and as a result why recovering from the previous saved state was futile. If slurmctld has any reason to believe saved state exists but can't be read it should specifically exit with an error, and then if the admin wants to force slurmctld to initialise there should be a '--init' option. BTW - you asked how we started slurmctld? The systemd config file we are using has no options (so it's just: <path to slurmctld>/slurmctld). Thanks, -greg (In reply to Greg Wickham from comment #6) > Hi Alejandro, Hi Greg, > I have no further issues, but I do request two bugs/features be created: > > 1/ sdiag reports the health of slurmctld. Does it show any information that > indicates that slurmctld had issues writing the state data? . . We are using > 'sdiag' output to provide some information about whether slurmctld is > functioning ok (or not). If slurmctld it should be available by sdiag > (rather than looking for obscure messages in the slurmctld.log). sdiag is currently a scheduling/rpc load oriented counter/statistics tool. As stated in the man page: "sdiag shows information related to slurmctld execution about: threads, agents, jobs, and scheduling algorithms. The goal is to obtain data from slurmctld behaviour helping to adjust configuration parameters or queues policies." While it can be useful to identify the current load in terms of RPCs, jobs being scheduled, agent messages and such, I wouldn't say sdiag is intended as a general slurmctld health tool. Abnd no, it currently doesn't show any information related to problems unpacking/packing and/or reading/writing from/to the state. Relying exclusively upon sdiag output to discern whether everything is fine within slurmctld or not isn't a good idea. There are a lot of different use cases that could lead to errors that aren't shown in sdiag. It is not a fully "integrated dashborad" or "cockpit" where you can see if anything bad happened while the controller daemon runs, and it doesn't show an associated counter for every single point in the slurmctld code where an error happens. The StateSaveLocation filesystem being full and/or the fact that sdiag doesn't show there are problems writing/reading to/from the state isn't a bug to me. If you still want to file a separate sev-5 enhancement request indicating that you want counters exposed to sdiag showing problems reading/writing from/to the state feel free to do it, but I don't see any real benefit from that, and we can't commit if/when we would address that unless anyone is willing to sponsor such work. Periodically looking for error() messages in the slurmctld.log is highly advisable, since problems might be triggered for a huge variety of contexts/scenarios. > 2/ slurmctld started even though the state data files were corrupt. This is > a major issue (bug) - as we when slurmctld started it initialised again and > then when all the slurmds reported back the running jobs were killed (I > expect because slurmctld had no record of it). The result was all running > jobs were terminated and as a result why recovering from the previous saved > state was futile. First of all, I want to mention that there's a big difference between the StateSaveLocation filesystem being full and the state files being corrupt. When the state fs is full, new information can't be written, but this doesn't mean that the state files that could be written before are corrupt. In fact, if I emulate a state fs full like I did in comment 2 and as I've just done again now, when I restart the ctld, the pending and running jobs that were there before the restart continue there after the restart. Writing to the state is somehow an atomic operation that fully completes or rather nothing is written, but it's not obvious to me how a full fs can lead to corrupt data state files, since as I mentioned first ctld writes to a .new file, and _once_ it is fully written, you rotate the state and the .old, but if the .new can't be written, the current state isn't touched, so can't be corrupted. I'm still curious to see your slurmctld.log from the time before the StateSaveLocation was first full, so theoretically error() messages like in comment 2 should have been logged until after the slurmctld restart. I think if your jobs got killed there are other reasons for that to happen, not just that the state fs was full. > If slurmctld has any reason to believe saved state exists > but can't be read it should specifically exit with an error, and then if the > admin wants to force slurmctld to initialise there should be a '--init' > option. If there are errors reading from the state, slurmctld is aware of it, and such errors aren't ignored unless slurmctld is started with the -i option. https://slurm.schedmd.com/slurmctld.html#OPT_-i > BTW - you asked how we started slurmctld? The systemd config file we are > using has no options (so it's just: <path to slurmctld>/slurmctld). I asked this, among other things, to see if you used any of -i or similar options. > Thanks, > > -greg Please, let me know if all what I said makes sense to you and/or if you have further questions. If you still want me to look at your slurmctld.log, please attach it in the mentioned time range. Thanks. (In reply to Alejandro Sanchez from comment #7) > (In reply to Greg Wickham from comment #6) > When the state fs is full, new information can't be written, but this > doesn't mean that the state files that could be written before are corrupt. > In fact, if I emulate a state fs full like I did in comment 2 and as I've > just done again now, when I restart the ctld, the pending and running jobs > that were there before the restart continue there after the restart. I want to clarify further the previous statement. New jobs can't be submitted once fs is filled: alex@polaris:~/t$ sbatch --wrap "sleep 9999" sbatch: error: Batch job submission failed: I/O error writing script/environment to file alex@polaris:~/t$ slurmctld: error: Error writing file /tmp/slurm/state/hash.1/job.20011/environment, No space left on device slurmctld: _slurm_rpc_submit_batch_job: I/O error writing script/environment to file If after the last successful state write a PD job starts running, since this change of the job state can't be reflected on the state because it is full, then upon restart we will see this logged in slurmd.log: slurmd: debug: _fill_registration_msg: found apparently running job 20012 ... slurmd: debug2: Processing RPC: REQUEST_ABORT_JOB slurmd: debug: _rpc_abort_job, uid = 1000 slurmd: debug: task_p_slurmd_release_resources: affinity jobid 20012 slurmd: debug: credential for job 20012 revoked slurmd: debug2: container signal 997 to job 20012.4294967294 slurmd: debug2: set revoke expiration for jobid 20012 to 1542721726 UTS And this in slurmctld.log: slurmctld: error: Orphan JobId=20012 StepId=Batch reported on node compute2 so after restart such job goes back to PD state. Hi Greg, I am lowering the severity of this issue down to 3 based on comment #6 and I will let Alex follow up with you a bit later. Hi Jason, Alejandro, Thanks - ok. We've since upgraded slurmctld to 18.08.3 and put in place measures to ensure we are alerted if the state save location gets anywhere close to full again. Also, 18.08.1 had some issues as we had 100K+ "error:" messages in the slurmctld per day, but now on 18.08.3 that seems to have abated. Let me know if there's anything more you need from us regarding this ticket. -greg (In reply to Greg Wickham from comment #10) > Hi Jason, Alejandro, > > Thanks - ok. > > We've since upgraded slurmctld to 18.08.3 and put in place measures to > ensure we are alerted if the state save location gets anywhere close to full > again. Ok, great. > Also, 18.08.1 had some issues as we had 100K+ "error:" messages in the > slurmctld per day, but now on 18.08.3 that seems to have abated. > > Let me know if there's anything more you need from us regarding this ticket. > > -greg Sounds good. Thanks for your feedback. I'm tagging this as infogiven. |