| Summary: | slurmd log file output corruption | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | GSK-ONYX-SLURM <slurm-support> |
| Component: | slurmd | Assignee: | Director of Support <support> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 5 - Enhancement | ||
| Priority: | --- | CC: | jbooth, tim |
| Version: | 17.11.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | GSK | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | ? | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
GSK-ONYX-SLURM
2018-07-25 04:32:38 MDT
That does look annoying. I'll look into it! :) Hey Mark, Could you send more example output demonstrating this issue? I noticed that the log lines get interrupted after the 33rd character, which is very curious. I would have expected it to be more random than that. Do you have any more info regarding when exactly this happens, or how I can duplicate this issue? Thanks, Michael Can you also tell me how frequently you see this issue? Hi. Yesterday afternoon we ran around 20,000 jobs and saw: [root@uk1salx00553 slurm]# grep -F ".[" slurmd-uk1salx00553.log | more [2018-07-26T15:40:46.579] [34009.[2018-07-26T15:40:46.579] [34054.batch] task/cgroup: /slurm/uid_62356/job_34054: alloc=5000MB mem.limit =5000MB memsw.limit=5000MB [2018-07-26T16:29:47.087] [34641.[2018-07-26T16:29:47.087] [34631.batch] done with job [2018-07-26T16:40:42.075] [35009.[2018-07-26T16:40:42.075] [34978.batch] done with job [2018-07-26T16:49:27.675] [36005.[2018-07-26T16:49:27.675] [36048.batch] task/cgroup: /slurm/uid_62356/job_36048: alloc=5000MB mem.limit =5000MB memsw.limit=5000MB [2018-07-26T16:57:26.112] [36136.[2018-07-26T16:57:26.112] [36137.batch] done with job [2018-07-26T16:57:26.412] [36143.[2018-07-26T16:57:26.412] [36205.batch] task/cgroup: /slurm/uid_62356/job_36205: alloc=5000MB mem.limit =5000MB memsw.limit=5000MB [2018-07-26T17:23:20.297] [36981.[2018-07-26T17:23:20.297] [36977.batch] done with job [2018-07-26T18:14:46.867] [38506.[2018-07-26T18:14:46.867] [38509.batch] done with job [2018-07-26T19:46:17.288] [41068.[2018-07-26T19:46:17.288] [41070.batch] done with job From the 23 / 24th July is similar: [root@uk1salx00553 slurm]# grep -F ".[" slurmd-uk1salx00553-25Jul2018.log | more [2018-07-23T20:42:12.329] [10759.[2018-07-23T20:42:12.329] [10754.batch] done with job [2018-07-24T14:26:00.008] [13361.[2018-07-24T14:26:00.008] [13411.batch] task_p_pre_launch: Using sched_affinity for tasks [2018-07-24T19:41:07.165] [20193.[2018-07-24T19:41:07.166] task_p_slurmd_batch_request: 20244 [2018-07-24T20:46:17.696] [23567.[2018-07-24T20:46:17.696] [23582.batch] done with job [2018-07-24T20:50:39.015] [24023.[2018-07-24T20:50:39.015] [24080.batch] task_p_pre_launch: Using sched_affinity for tasks [2018-07-24T21:05:33.494] [25558.[2018-07-24T21:05:33.494] [25595.batch] task_p_pre_launch: Using sched_affinity for tasks [2018-07-24T21:12:35.593] [26377.[2018-07-24T21:12:35.593] [26385.batch] done with job [2018-07-24T21:17:00.815] [26929.[2018-07-24T21:17:00.815] [26928.batch] done with job [2018-07-24T21:19:46.381] [27254.[2018-07-24T21:19:46.381] [27246.batch] done with job [2018-07-24T21:21:58.198] [27494.[2018-07-24T21:21:58.198] [27478.batch] done with job [2018-07-24T21:23:34.607] [27685.[2018-07-24T21:23:34.607] [27694.batch] done with job [2018-07-24T21:25:01.718] [27858.[2018-07-24T21:25:01.718] [27835.batch] done with job [2018-07-24T21:26:02.230] [27982.[2018-07-24T21:26:02.230] Launching batch job 28024 for UID 62356 [2018-07-24T21:26:19.820] [28021.[2018-07-24T21:26:19.820] [28013.batch] done with job [2018-07-24T21:30:15.926] [28406.[2018-07-24T21:30:15.926] [28412.batch] done with job [2018-07-24T21:35:08.732] [28996.[2018-07-24T21:35:08.732] [28999.batch] done with job [2018-07-24T21:37:59.156] [29347.[2018-07-24T21:37:59.156] [29341.batch] done with job [2018-07-24T21:42:01.533] [29815.[2018-07-24T21:42:01.533] [29784.batch] done with job [2018-07-24T21:44:23.917] [30077.[2018-07-24T21:44:23.917] [30071.batch] done with job [2018-07-24T21:53:13.065] [30944.[2018-07-24T21:53:13.065] [30938.batch] done with job [2018-07-24T21:53:43.007] [31007.[2018-07-24T21:53:43.007] [31005.batch] done with job Thanks. Mark. Hey Mark, it looks like this is a known issue. The reason this wasn't fixed was that it causes a performance penalty, due to the synchronization steps that would need to take place. For now, as a workaround, you could switch to using syslog instead, and then route the slurmd-specific output to a dedicated file. This will prevent messages from interfering with each other, as syslog has built-in synchronization. Just leave SlurmdLogFile unspecified in slurm.conf for Slurm to use syslog. You may want to set SlurmdSyslogDebug as well. See https://slurm.schedmd.com/slurm.conf.html We're looking into possibly creating a SynchronizeSlurmdLog config in slurm.conf where you could choose to trade performance for a synchronized slurmd log. If we did, it wouldn’t be slated until at least 19.05. I’m changing this bug to a feature enhancement. OK, thanks for the info. Please go ahead and close the bug. Thanks. mark. Closing out bug. |