Hi. I am seeing corruption in the slurmd log file. See how other output is getting injected into the "done with job" message. [2018-07-24T14:25:59.325] Launching batch job 13412 for UID 62356 [2018-07-24T14:25:59.999] [13361.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0 [2018-07-24T14:26:00.008] [13361.[2018-07-24T14:26:00.008] [13411.batch] task_p_pre_launch: Using sched_affinity for tasks batch] done with job [2018-07-24T14:26:00.043] task_p_slurmd_batch_request: 13413 [2018-07-24T14:26:00.043] task/affinity: job 13413 CPU input mask for node: 0x000000000400 [2018-07-24T14:26:00.043] task/affinity: job 13413 CPU final HW mask for node: 0x000000000020 [2018-07-24T19:41:04.036] [20243.batch] task_p_pre_launch: Using sched_affinity for tasks [2018-07-24T19:41:07.137] [20193.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0 [2018-07-24T19:41:07.165] [20193.[2018-07-24T19:41:07.166] task_p_slurmd_batch_request: 20244 batch] done with job [2018-07-24T19:41:07.166] task/affinity: job 20244 CPU input mask for node: 0x200000000000 [2018-07-24T19:41:07.166] task/affinity: job 20244 CPU final HW mask for node: 0x400000000000 [2018-07-24T19:41:07.167] _run_prolog: run job script took usec=12 [2018-07-24T21:25:01.709] [27884.batch] task_p_pre_launch: Using sched_affinity for tasks [2018-07-24T21:25:01.718] [27858.[2018-07-24T21:25:01.718] [27835.batch] done with job batch] done with job [2018-07-24T21:25:01.788] [27885.batch] task/cgroup: /slurm/uid_62356/job_27885: alloc=5000MB mem.limit=5000MB memsw.limit=5000MB [2018-07-24T21:25:01.845] [27885.batch] task/cgroup: /slurm/uid_62356/job_27885/step_batch: alloc=5000MB mem.limit=5000MB memsw.limit=5000MB [2018-07-24T21:26:02.226] _run_prolog: run job script took usec=6 [2018-07-24T21:26:02.226] _run_prolog: prolog with lock for job 28024 ran for 0 seconds [2018-07-24T21:26:02.230] [27982.[2018-07-24T21:26:02.230] Launching batch job 28024 for UID 62356 batch] done with job [2018-07-24T21:26:02.250] task_p_slurmd_batch_request: 28025 [2018-07-24T21:26:02.250] task/affinity: job 28025 CPU input mask for node: 0x000008000000 [2018-07-24T21:26:02.250] task/affinity: job 28025 CPU final HW mask for node: 0x002000000000 Not a major issue, but makes it messy when looking at issues. Thanks. Mark.
That does look annoying. I'll look into it! :)
Hey Mark, Could you send more example output demonstrating this issue? I noticed that the log lines get interrupted after the 33rd character, which is very curious. I would have expected it to be more random than that. Do you have any more info regarding when exactly this happens, or how I can duplicate this issue? Thanks, Michael
Can you also tell me how frequently you see this issue?
Hi. Yesterday afternoon we ran around 20,000 jobs and saw: [root@uk1salx00553 slurm]# grep -F ".[" slurmd-uk1salx00553.log | more [2018-07-26T15:40:46.579] [34009.[2018-07-26T15:40:46.579] [34054.batch] task/cgroup: /slurm/uid_62356/job_34054: alloc=5000MB mem.limit =5000MB memsw.limit=5000MB [2018-07-26T16:29:47.087] [34641.[2018-07-26T16:29:47.087] [34631.batch] done with job [2018-07-26T16:40:42.075] [35009.[2018-07-26T16:40:42.075] [34978.batch] done with job [2018-07-26T16:49:27.675] [36005.[2018-07-26T16:49:27.675] [36048.batch] task/cgroup: /slurm/uid_62356/job_36048: alloc=5000MB mem.limit =5000MB memsw.limit=5000MB [2018-07-26T16:57:26.112] [36136.[2018-07-26T16:57:26.112] [36137.batch] done with job [2018-07-26T16:57:26.412] [36143.[2018-07-26T16:57:26.412] [36205.batch] task/cgroup: /slurm/uid_62356/job_36205: alloc=5000MB mem.limit =5000MB memsw.limit=5000MB [2018-07-26T17:23:20.297] [36981.[2018-07-26T17:23:20.297] [36977.batch] done with job [2018-07-26T18:14:46.867] [38506.[2018-07-26T18:14:46.867] [38509.batch] done with job [2018-07-26T19:46:17.288] [41068.[2018-07-26T19:46:17.288] [41070.batch] done with job From the 23 / 24th July is similar: [root@uk1salx00553 slurm]# grep -F ".[" slurmd-uk1salx00553-25Jul2018.log | more [2018-07-23T20:42:12.329] [10759.[2018-07-23T20:42:12.329] [10754.batch] done with job [2018-07-24T14:26:00.008] [13361.[2018-07-24T14:26:00.008] [13411.batch] task_p_pre_launch: Using sched_affinity for tasks [2018-07-24T19:41:07.165] [20193.[2018-07-24T19:41:07.166] task_p_slurmd_batch_request: 20244 [2018-07-24T20:46:17.696] [23567.[2018-07-24T20:46:17.696] [23582.batch] done with job [2018-07-24T20:50:39.015] [24023.[2018-07-24T20:50:39.015] [24080.batch] task_p_pre_launch: Using sched_affinity for tasks [2018-07-24T21:05:33.494] [25558.[2018-07-24T21:05:33.494] [25595.batch] task_p_pre_launch: Using sched_affinity for tasks [2018-07-24T21:12:35.593] [26377.[2018-07-24T21:12:35.593] [26385.batch] done with job [2018-07-24T21:17:00.815] [26929.[2018-07-24T21:17:00.815] [26928.batch] done with job [2018-07-24T21:19:46.381] [27254.[2018-07-24T21:19:46.381] [27246.batch] done with job [2018-07-24T21:21:58.198] [27494.[2018-07-24T21:21:58.198] [27478.batch] done with job [2018-07-24T21:23:34.607] [27685.[2018-07-24T21:23:34.607] [27694.batch] done with job [2018-07-24T21:25:01.718] [27858.[2018-07-24T21:25:01.718] [27835.batch] done with job [2018-07-24T21:26:02.230] [27982.[2018-07-24T21:26:02.230] Launching batch job 28024 for UID 62356 [2018-07-24T21:26:19.820] [28021.[2018-07-24T21:26:19.820] [28013.batch] done with job [2018-07-24T21:30:15.926] [28406.[2018-07-24T21:30:15.926] [28412.batch] done with job [2018-07-24T21:35:08.732] [28996.[2018-07-24T21:35:08.732] [28999.batch] done with job [2018-07-24T21:37:59.156] [29347.[2018-07-24T21:37:59.156] [29341.batch] done with job [2018-07-24T21:42:01.533] [29815.[2018-07-24T21:42:01.533] [29784.batch] done with job [2018-07-24T21:44:23.917] [30077.[2018-07-24T21:44:23.917] [30071.batch] done with job [2018-07-24T21:53:13.065] [30944.[2018-07-24T21:53:13.065] [30938.batch] done with job [2018-07-24T21:53:43.007] [31007.[2018-07-24T21:53:43.007] [31005.batch] done with job Thanks. Mark.
Hey Mark, it looks like this is a known issue. The reason this wasn't fixed was that it causes a performance penalty, due to the synchronization steps that would need to take place. For now, as a workaround, you could switch to using syslog instead, and then route the slurmd-specific output to a dedicated file. This will prevent messages from interfering with each other, as syslog has built-in synchronization. Just leave SlurmdLogFile unspecified in slurm.conf for Slurm to use syslog. You may want to set SlurmdSyslogDebug as well. See https://slurm.schedmd.com/slurm.conf.html We're looking into possibly creating a SynchronizeSlurmdLog config in slurm.conf where you could choose to trade performance for a synchronized slurmd log. If we did, it wouldn’t be slated until at least 19.05. I’m changing this bug to a feature enhancement.
OK, thanks for the info. Please go ahead and close the bug. Thanks. mark.
Closing out bug.