Hi, we have recently upgraded slurm from 20.02.7 to 21.08.7 with the CVE-2022-29500_29501_29502-21.08-v2.patch patch provided by Tim. Since then we have noticed that sometimes slurmstepd uses too much cpu resources. For example: [root@nia0015 ~]# ps -auxwwf | grep -E "slurmstepd|RSS" USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 53520 0.0 0.0 112812 984 pts/0 S+ 10:22 0:00 \_ grep --color=auto -E slurmstepd|RSS root 234727 75.0 0.0 1896692 18432 ? Sl 08:15 95:13 slurmstepd: [7454418.extern] root 234893 0.0 0.0 272732 3756 ? Sl 08:15 0:01 slurmstepd: [7454418.batch] slurmd.log shows that some connection to slurmstepd is timing out: [2022-05-26T08:15:13.843] [7454418.extern] task/cgroup: _memcg_initialize: job: alloc=175000MB mem.limit=175000MB memsw.limit=unlimited [2022-05-26T08:15:13.844] [7454418.extern] task/cgroup: _memcg_initialize: step: alloc=175000MB mem.limit=175000MB memsw.limit=unlimited [2022-05-26T08:15:15.562] Launching batch job 7454418 for UID 3054483 [2022-05-26T08:15:15.591] [7454418.batch] task/cgroup: _memcg_initialize: job: alloc=175000MB mem.limit=175000MB memsw.limit=unlimited [2022-05-26T08:15:15.591] [7454418.batch] task/cgroup: _memcg_initialize: step: alloc=175000MB mem.limit=175000MB memsw.limit=unlimited [2022-05-26T08:45:34.822] error: slurm_msg_sendto: address:port=10.20.1.15:33358 msg_type=5020: No error [2022-05-26T08:56:47.589] error: Timeout waiting for slurmstepd [2022-05-26T08:56:47.589] error: gathering job accounting: 0 [2022-05-26T09:06:53.607] error: Timeout waiting for slurmstepd [2022-05-26T09:06:53.607] error: gathering job accounting: 0 [2022-05-26T09:16:15.661] error: Timeout waiting for slurmstepd [2022-05-26T09:16:15.662] error: gathering job accounting: 0 [2022-05-26T09:27:07.624] error: Timeout waiting for slurmstepd [2022-05-26T09:27:07.624] error: gathering job accounting: 0 [2022-05-26T09:36:11.682] error: Timeout waiting for slurmstepd [2022-05-26T09:36:11.682] error: gathering job accounting: 0 [2022-05-26T09:38:41.675] error: /var/spool/slurm/scripts/nodecheck2: exited with status 0x0100 [2022-05-26T09:43:40.110] error: /var/spool/slurm/scripts/nodecheck2: exited with status 0x0100 and our nodecheck script picks up on slurmstepd using too much cpu resources and drains the node for investigation. Nothing else is pointing to a problem with this job from the slurmctld point of view: [2022-05-26T08:15:12.424] _slurm_rpc_submit_batch_job: JobId=7454418 InitPrio=2630799 usec=11246 [2022-05-26T08:15:13.413] sched/backfill: _start_job: Started JobId=7454418 in compute on nia0015 This particular job asks for 1 node and 40 tasks per node and uses gnu-parallel to handle the several serial jobs launched. Do you have any idea on why slurmstepd is taking so much cpu resources? Thanks, Bruno.
Bruno - I will have Carlos take a look at this and get back to you. In the interim please attach your current slurm.conf, and cgroup.conf.
Created attachment 25248 [details] slurm.conf and cgroup.conf I have attached a tarball with slurm.conf and cgroup.conf as requested. The case I reported this morning solved by itself once the job was done. That seems an exception, since the previous cases we had to intervene. I have another case now and I will follow it closely.
Bruno, Is it possible to execute: gdb -batch -p [PID] -ex "thread apply all bt full" for the crazy slurmstepd? Additionally, could you please send us the output for: strace -p [PID] of few seconds to see if we spot any pattern there? Let's see if we can narrow the code area where the stepd is doing it bad. I see "error: Timeout waiting for slurmstepd" and that's mostly the slurmd end complaining extern stepd is not answering some message. And possibly, the stepd is locked in whatever section forever. Thanks, Carlos.
Looking into the info we already have: [2022-05-26T08:45:34.822] error: slurm_msg_sendto: address:port=10.20.1.15:33358 msg_type=5020: No error [2022-05-26T08:56:47.589] error: Timeout waiting for slurmstepd [2022-05-26T08:56:47.589] error: gathering job accounting: 0 msg_type=5020 is RESPONSE_JOB_STEP_STAT. RPC only created in _rpc_stat_jobacct. _rpc_stat_jobacct calls stepd_stat_jobacct. This function is the one which logs error("gathering job accounting: %d", rc); stepd_stat_jobacct calls wait_fd_readable. This function is the one which logs error("Timeout waiting for slurmstepd"); This is the calling tree for the error messages, arises when slurmd receives REQUEST_JOB_STEP_STAT RPC request: wait_fd_readable(int, int) : int - /slurm-22.05_14152/src/common/fd.c stepd_stat_jobacct(int, uint16_t, slurm_step_id_t *, job_step_stat_t *) : int - /slurm-22.05_14152/src/common/stepd_api.c _rpc_stat_jobacct(slurm_msg_t *) : void - /slurm-22.05_14152/src/slurmd/slurmd/req.c slurmd_req(slurm_msg_t *) : void - /slurm-22.05_14152/src/slurmd/slurmd/req.c The REQUEST_JOB_STEP_STAT RPC is issued only by 2 guys: slurm_job_step_stat(slurm_step_id_t *, char *, uint16_t, job_step_stat_response_msg_t * *) : int - /slurm-22.05_14152/src/api/job_step_info.c _do_stat(slurm_step_id_t *, char *, uint32_t, uint32_t, uint32_t, uint16_t) : int - /slurm-22.05_14152/src/sstat/sstat.c main(int, char * *) : int - /slurm-22.05_14152/src/sstat/sstat.c XS(int) : int - /slurm-22.05_14152/contribs/perlapi/libslurm/perl/Slurm.c Please, report us back whether you use the Perl API or not. I'm mostly sure the issue could be reproduced issuing sstat command but I need to check whether it were you or the user who did that, and how it was issued (the exact sstat command).
Continuation. What stepd_stat_jobacct is triggering in the slurmstepd is request REQUEST_STEP_STAT. On the stepd side, this is handled in _handle_stat_jobacct function. As the rogue stepd belongs to extern step, we can point potential issue in this block: if (job->step_id.step_id == SLURM_EXTERN_CONT) { pid_t *pids = NULL; int npids = 0; /* * We only have one task in the extern step on each node, * despite many pids may have been adopted. */ num_tasks = 1; proctrack_g_get_pids(job->cont_id, &pids, &npids); for (i = 0; i < npids; i++) { temp_jobacct = jobacct_gather_stat_task(pids[i]); if (temp_jobacct) { jobacctinfo_aggregate(jobacct, temp_jobacct); jobacctinfo_destroy(temp_jobacct); } log_flag(JAG, "%s: step_extern cont_id=%"PRIu64" includes pid=%"PRIu64, __func__, job->cont_id, (uint64_t) pids[i]); } xfree(pids); } else { I can't see any way to make the "for" statement unbound, so it's something deeper. proctrack_g_get_pids calls this tree because you have ProctrackType=proctrack/cgroup in your slurm.conf: cgroup_g_step_get_pids(pid_t * *, int *) : int - /slurm-22.05_14152/src/common/cgroup.c proctrack_p_get_pids(uint64_t, pid_t * *, int *) : int - /slurm-22.05_14152/src/plugins/proctrack/cgroup/proctrack_cgroup.c Which calls cgroup_p_step_get_pids either from cgroup_v1/2.c. So I need to know if your system is running Slurm under Cgroup v1 or v2 to properly continue my investigation. I think, I need you to send us *if possible* a reproduction of this error having debug3 and DebugFlags=Cgroup,JobAccountGather set. If you can't please let us know. Regards, Carlos.
Created attachment 25255 [details] gdb backtrace and strace of slurmstepd I have attached most of the information you asked so far. The output of strace was just a wait call, doing nothing. I waited a few seconds before cancelling. Let me know if I should wait longer.
(In reply to Carlos Tripiana Montes from comment #6) > Continuation. > > What stepd_stat_jobacct is triggering in the slurmstepd is request > REQUEST_STEP_STAT. > > On the stepd side, this is handled in _handle_stat_jobacct function. > > As the rogue stepd belongs to extern step, we can point potential issue in > this block: > > if (job->step_id.step_id == SLURM_EXTERN_CONT) { > pid_t *pids = NULL; > int npids = 0; > > /* > * We only have one task in the extern step on each node, > * despite many pids may have been adopted. > */ > num_tasks = 1; > proctrack_g_get_pids(job->cont_id, &pids, &npids); > > for (i = 0; i < npids; i++) { > temp_jobacct = jobacct_gather_stat_task(pids[i]); > if (temp_jobacct) { > jobacctinfo_aggregate(jobacct, temp_jobacct); > jobacctinfo_destroy(temp_jobacct); > } > log_flag(JAG, "%s: step_extern cont_id=%"PRIu64" includes pid=%"PRIu64, > __func__, job->cont_id, (uint64_t) pids[i]); > } > > xfree(pids); > } else { > > I can't see any way to make the "for" statement unbound, so it's something > deeper. proctrack_g_get_pids calls this tree because you have > ProctrackType=proctrack/cgroup in your slurm.conf: > > cgroup_g_step_get_pids(pid_t * *, int *) : int - > /slurm-22.05_14152/src/common/cgroup.c > proctrack_p_get_pids(uint64_t, pid_t * *, int *) : int - > /slurm-22.05_14152/src/plugins/proctrack/cgroup/proctrack_cgroup.c > Just to remind you that I am using slurm 21.08.7 patched to address the security CVE. I see you seem to be inspecting source code from slurm-22.05_14152. > Which calls cgroup_p_step_get_pids either from cgroup_v1/2.c. So I need to > know if your system is running Slurm under Cgroup v1 or v2 to properly > continue my investigation. > I believe is cgroup v1, the default on CentosOS 7.9. Please take a look at the README file with the info I sent you a few minutes ago. > I think, I need you to send us *if possible* a reproduction of this error > having debug3 and DebugFlags=Cgroup,JobAccountGather set. If you can't > please let us know. > It depends. Can I use these flags localized on one node only? Otherwise I will try to reproduce the problem on our testing system and will get back to you later. Thanks, Bruno. > Regards, > Carlos.
> Just to remind you that I am using slurm 21.08.7 patched to address the > security CVE. I see you seem to be inspecting source code from > slurm-22.05_14152. I'm targeting 22.05 to see if, when we'll able to spot the issue, it's already fixed. This is fairly common situation. So I'm following your issue down to 22.05 and see if git blame show lines with commits that their description point to a fix of a similar issue. > It depends. Can I use these flags localized on one node only? Otherwise I > will > try to reproduce the problem on our testing system and will get back to you > later. If you know how to hit the issue, setting these debug flags in just 1 compute node is enough. I'm looking into the GDB trace and seems a deadlock with the pthread mutex "task_list_lock". But I'm not 100% sure yet, nor seen any commit fixing this issue, nor a possible deadlock with this mutex. So, to summarise, debug logs would be much appreciated to speed the investigation. Thanks, Carlos.
The usage of mutex task_list_lock is fairly simple and I can't see new commits going around the lock. But from your GDB strace indeed seems like the threads are stuck on "lock" for that mutex, so other thread left it locked. The question now is that I can't see how this can happened. I guess it might be not a bug about not properly unlocking the mutex but a thread having dead w/o releasing that mutex. All the threads in the GDB trace that are around the functions that uses this lock are locked. I haven't been able to spot a function that uses it an are not locked. This suggests me of a dead thread that, of course, is not present in the GDB trace. All this reinforces the need of a reproducer with debug logs, so I can see how the story evolves in the logs and see if there's an error that points to a dead thread as well. I'm still looking into this so it might happen that I discover something new but up to now this is what I now about. Cheers, Carlos.
(In reply to Carlos Tripiana Montes from comment #10) > > Just to remind you that I am using slurm 21.08.7 patched to address the > > security CVE. I see you seem to be inspecting source code from > > slurm-22.05_14152. > > I'm targeting 22.05 to see if, when we'll able to spot the issue, it's > already fixed. This is fairly common situation. So I'm following your issue > down to 22.05 and see if git blame show lines with commits that their > description point to a fix of a similar issue. > Ok, That sounds good! > > It depends. Can I use these flags localized on one node only? Otherwise I > > will > > try to reproduce the problem on our testing system and will get back to you > > later. > > If you know how to hit the issue, setting these debug flags in just 1 > compute node is enough. > I tried on our test system with a submission script similar to the user, but I couldn't reproduce the bug yet. The script I used is very simple though: ### #!/bin/bash # SLURM submission script for multiple serial jobs on multiple Niagara nodes # #SBATCH --nodes=1 #SBATCH --account=scinet #SBATCH --ntasks-per-node=40 #SBATCH --time=01:00:00 # Turn off implicit threading in Python, R export OMP_NUM_THREADS=1 module load NiaEnv/2019b gnu-parallel HOSTS=$(scontrol show hostnames $SLURM_NODELIST | tr '\n' ,) NCORES=40 parallel --env OMP_NUM_THREADS,PATH,LD_LIBRARY_PATH --joblog slurm-$SLURM_JOBID.log -j $NCORES -S $HOSTS --wd $PWD "echo {}; sleep 600" ::: {001..120} ### So far slurmstepd is fine, using 1.0% of cpu only. I guess it won't help if I turn on the debug flags after the user job triggers this bug, right? If so then I would have to turn on slurmd debug flags on whole cluster and monitor until the next case pops up, right? Will that slow down slurmd considerably and affect job launching performance or will it just blow up the slurmd.log size? > I'm looking into the GDB trace and seems a deadlock with the pthread mutex > "task_list_lock". But I'm not 100% sure yet, nor seen any commit fixing this > issue, nor a possible deadlock with this mutex. > > So, to summarise, debug logs would be much appreciated to speed the > investigation. > ok, I will check the queue now if there is any similar job pending there and will turn on the debug flags. > Thanks, > Carlos.
Unfortunately I can't reproduce the bug on demand. I contacted one of our users that had a job triggering this bug and asked him to resubmit it to a reservation early next week. Hopefully this way, and with all debug flags turned on, we can reproduce it. Thanks! Bruno.
Created attachment 25319 [details] slurmd log, strace, gdb backtrace and other relevant info I created a reservation and asked the user to submit the exact same job that triggered this bug. The first few minutes slurmstepd seems fine in terms of cpu consumption. Then it starts to slowly use more and more cpu resources. I collected the slurmd logs with the flags you asked, dumped the gdb backtrace several times, plus other info that might be relevant. They are all attached on the file posted here. I cancelled the job after collecting this data, but slurmstepd was not killed by the scheduler and the node was drained with "Kill task failed" reason. After killing -9 slurmstepd by hand, and restarting slurmd the node came back to normal. Anyway, I hope this data is helpful. Please let me know how it goes and if I need to rerun it with other debug flags. Cheers, Bruno.
Thanks for the feedback. I'm going to have a look at. The only thing it remains unknown to me is the exact jobscript for this test. I guess it's similar to the snippet in Comment 12, but just to be sure. I'll let you know of any discovery asap. Regards, Carlos.
I'm seeing evidences of this job spawning a massive number of processes. I'm fairly sure the spawning isn't happening as expected, I'd recommend to get rid of parallel and use srun for the spawning of the tasks. It's not mandatory the job run with srun is MPI or whatever parallel thing, srun can act as a mere spawner. ---- I wonder whether is possible or not to confirm or test again the following: At the moment you had to manually kill -9 the extern step, there were processes belonging to the user execution still going around? How many of those, if any, were still roge (count it number if you test this again). I'm asking this because I see processes like PID:259236 with parent PPID:20240, that are having their accounting gathered *after* the job has been cancelled. So they must have been still alive at that point in time. ---- Finally, I've been able to spot the running thread both in your 1st GDB backtrace and in the new ones. All the JAG threads are stuck waiting, and there's 1 in _get_offspring_data. So to clarify my previous investigation: I was wrong, there's no deadlock. My worries are that I see a huge amount of independent processes that needs a thread to catch it's resources' usage. I'm fairly sure disabling job accounting gather will make the cpu usage to just drop to zero, but that's not the solution. The question here is more that seems like the stepd cpu usage is fine, if you take into account that is getting overloaded by a massive number of processes -that needs 1 thread each to get the accounting. ---- To summarize, I'll take the user job, and migrate from parallel to srun. Then see if the issue still happens. I'm not 100% sure, but I feel that would solve the issue. Thanks, Carlos.
(In reply to Carlos Tripiana Montes from comment #15) > Thanks for the feedback. I'm going to have a look at. > > The only thing it remains unknown to me is the exact jobscript for this > test. I guess it's similar to the snippet in Comment 12, but just to be sure. > I don't know the gory details of what this user is doing, but I have seen the same problem coming from other users using gnu-parallel. This user's script is roughly the same I mentioned earlier: #!/bin/bash # SLURM submission script for multiple serial jobs on multiple Niagara nodes # #SBATCH --nodes=1 #SBATCH --ntasks-per-node=40 #SBATCH --time=06:00:00 #SBATCH --job-name Y1_OV_64 # DIRECTORY TO RUN - $SLURM_SUBMIT_DIR is the directory from which the job was submitted cd $SLURM_SUBMIT_DIR # Turn off implicit threading in Python, R export OMP_NUM_THREADS=1 module load NiaEnv/2018a module load python/3.6.5 source ~/.virtualenvs/python_3_6_5/bin/activate module load gnu-parallel module load gcc/7.3.0 module load bedtools/2.27.1 module load r/3.5.1 HOSTS=$(scontrol show hostnames $SLURM_NODELIST | tr '\n' ,) parallel --env OMP_NUM_THREADS,PATH,LD_LIBRARY_PATH --joblog slurm-$SLURM_JOBID.log -j ${SLURM_TASKS_PER_NODE:0:2} -S $HOSTS --wd $PWD "/path/to/user/executable/command_{}.sh" ::: {1..40} Where the several command_1.sh, etc... are just python scripts running on the user data: python /path/to/user/python_script.py USER_SCRIPT_ARGUMENT /path/to/user/data.bed /path/to/user/another/data.bed > I'll let you know of any discovery asap. > Thanks! > Regards, > Carlos.
(In reply to Carlos Tripiana Montes from comment #16) > I'm seeing evidences of this job spawning a massive number of processes. I'm > fairly sure the spawning isn't happening as expected, I'd recommend to get > rid of parallel and use srun for the spawning of the tasks. It's not > mandatory the job run with srun is MPI or whatever parallel thing, srun can > act as a mere spawner. I understand that srun can act as a spawner, however it doesn't have the capabilities that parallel command from gnu-parallel package (https://www.gnu.org/software/parallel/) has to offer. For example, if you have lots of tasks with varying time for completion, gnu-parallel will take care of the load balance and keep as many cores busy as you have specified. In the case of Niagara 40 tasks should be running on the 40 physical cores of the compute nodes all the time. The nice thing about this tool is that you can ask to run 4000 tasks 40 at the time, and if you reach the end of the allocation, when you submit your job again, it will pick up the tasks that were not completed and restart from them. We have written a wiki page with a few examples, https://docs.scinet.utoronto.ca/index.php/Running_Serial_Jobs_on_Niagara. Maybe srun could adopt most of these features in the future? > ---- > > I wonder whether is possible or not to confirm or test again the following: > > At the moment you had to manually kill -9 the extern step, there were > processes belonging to the user execution still going around? > Good question! I don't know the answer. I didn't record the output of ps for all processes. > How many of those, if any, were still roge (count it number if you test this > again). > > I'm asking this because I see processes like PID:259236 with parent > PPID:20240, that are having their accounting gathered *after* the job has > been cancelled. So they must have been still alive at that point in time. > ---- > So maybe they could be stuck at IO? I will have to ask the user to submit again. Please let me know if you have more ideas on what kind of extra info to collect. > Finally, I've been able to spot the running thread both in your 1st GDB > backtrace and in the new ones. All the JAG threads are stuck waiting, and > there's 1 in _get_offspring_data. So to clarify my previous investigation: I > was wrong, there's no deadlock. > Ok. What are these JAG threads you are referring to? jobacct_gather threads? > My worries are that I see a huge amount of independent processes that needs > a thread to catch it's resources' usage. I'm fairly sure disabling job > accounting gather will make the cpu usage to just drop to zero, but that's > not the solution. The question here is more that seems like the stepd cpu > usage is fine, if you take into account that is getting overloaded by a > massive number of processes -that needs 1 thread each to get the accounting. > ---- What is the evidence for a massive number of processes? As I understood the user has only 40 tasks launched at the time. If I understood you correctly that would request 40 extra JAG threads to collect resource consumption, right? Why would that overload slurmstepd? > > To summarize, I'll take the user job, and migrate from parallel to srun. > Then see if the issue still happens. I'm not 100% sure, but I feel that > would solve the issue. > We can't really do that because of missing capabilities on srun. However it is true that this is happening only with gnu-parallel commands so far, and only after the slurm upgrade. Thanks, Bruno. > Thanks, > Carlos.
> For example, if you have lots of tasks with varying time for completion, gnu-parallel will take care of the load balance and keep as many cores busy as you have specified. > In the case of Niagara 40 tasks should be running on the 40 physical cores of the compute nodes all the time. The nice thing about this tool is that you can ask to run 4000 tasks 40 at the time, and if you reach the end of the allocation, when you submit your job again, it will pick up the tasks that were not completed and restart from them. OK, that makes sense. You are doing greedy stuff. Now I understand better the logs. Looking at the 2nd bunch of GDB traces, you'll see more than, but close to, 40 threads keeping track of the parallel tasks for the job account gather plugin (yes, JAG means exactly that :)). But if you look at the 1st GDB trace you'll see quite more than ~40. My guess is that, at that time, things gone quite more overloaded. > > At the moment you had to manually kill -9 the extern step, there were processes belonging to the user execution still going around? > > Good question! I don't know the answer. I didn't record the output of ps for all processes. This is something I think it really needs to be checked. See below. > So maybe they could be stuck at IO? I will have to ask the user to submit again. Please let me know if you have more ideas on what kind of extra info to collect. Yes, this could be for sure a possibility. We need to check why there's still JAG threads looking at the stats (those from /proc) of parallel tasks. > What is the evidence for a massive number of processes? As I understood the user has only 40 tasks launched at the time. If I understood you correctly that would request 40 extra JAG threads to collect resource consumption, right? Why would that overload slurmstepd? In fact, you are right for the 2nd bunch of GDB traces. In the logs, you'll see a massive amount of threads for the same jobs doing JAG work. But after looking at the GDB traces, you see the slurmd is consistently around 40 JAG threads. Then, a second look at logs exposes the PIDs tracked by JAG in "blocks" of around 40 threads. I was more afraid if the 1st GDB trace where you'll clearly see there were way more than 40 JAG threads. Either the user did wrong with parallel, or something else bad happened. All in all, the point that is causing problems is related to things like: 2982:[2022-06-01T14:15:06.639] [7503353.extern] debug: Handling REQUEST_ADD_EXTERN_PID 2983:[2022-06-01T14:15:06.639] [7503353.extern] debug: _handle_add_extern_pid_internal: for StepId=7503353.extern, pid 250376 Every time you see this, a new JAG thread for this PID is created. What I don't really know yet is why this is happening. I wonder whether the parallel job is creating tasks that subsequently get detached from the parallel process. My bet is yes, because the "-S HOSTLIST" flag is making parallel to spawn every single task using SSH. Have a look at the comparison between this 2 commands: parallel -j 40 "sleep 60" ::: {1..40} parallel -j 40 -S localhost "sleep 60" ::: {1..40} The second one exposes the issue I'm pointing out. IF the user is only intending to use 1 node, just wipe out the "-S" flag so Slurm can keep the whole work as part of the batch step. If the SSH is in the middle causing the parallel tasks to get detached from the batch step, slurm tries to catch them all in the extern step to avoid users from running beyond the allowed, but this causes the extra overhead you see. In any case, this is the explanation and the workaround to get rid of 40 JAG threads. I can't really know why at some point the node got so overloaded that slurmd got ~150 JAG threads. Regards, Carlos.
Hi Bruno, please, update us whenever possible. It's been more than a week since our last communication. Regards, Carlos.
(In reply to Carlos Tripiana Montes from comment #20) > Hi Bruno, > > please, update us whenever possible. It's been more than a week since our > last communication. > > Regards, > Carlos. Sorry Carlos, we have a maintenance going on at the datacenter this week and I will need an extra day or two to get back to you on this. Thanks, Bruno.
(In reply to Carlos Tripiana Montes from comment #19) > > For example, if you have lots of tasks with varying time for completion, gnu-parallel will take care of the load balance and keep as many cores busy as you have specified. > > In the case of Niagara 40 tasks should be running on the 40 physical cores of the compute nodes all the time. The nice thing about this tool is that you can ask to run 4000 tasks 40 at the time, and if you reach the end of the allocation, when you submit your job again, it will pick up the tasks that were not completed and restart from them. > > OK, that makes sense. You are doing greedy stuff. Now I understand better > the logs. Looking at the 2nd bunch of GDB traces, you'll see more than, but > close to, 40 threads keeping track of the parallel tasks for the job account > gather plugin (yes, JAG means exactly that :)). But if you look at the 1st > GDB trace you'll see quite more than ~40. My guess is that, at that time, > things gone quite more overloaded. > But then why it didn't decrease with time? > > > At the moment you had to manually kill -9 the extern step, there were processes belonging to the user execution still going around? > > > > Good question! I don't know the answer. I didn't record the output of ps for all processes. > > This is something I think it really needs to be checked. See below. > I rerun your suggestions below and did one with a python script of mine and this problem was absent in all 3 cases. Not sure how to reproduce it yet. > > So maybe they could be stuck at IO? I will have to ask the user to submit again. Please let me know if you have more ideas on what kind of extra info to collect. > > Yes, this could be for sure a possibility. We need to check why there's > still JAG threads looking at the stats (those from /proc) of parallel tasks. > Do you see evidence of that on the logs for the new tests? (that I am about to upload) > > What is the evidence for a massive number of processes? As I understood the user has only 40 tasks launched at the time. If I understood you correctly that would request 40 extra JAG threads to collect resource consumption, right? Why would that overload slurmstepd? > > In fact, you are right for the 2nd bunch of GDB traces. In the logs, you'll > see a massive amount of threads for the same jobs doing JAG work. But after > looking at the GDB traces, you see the slurmd is consistently around 40 JAG > threads. Then, a second look at logs exposes the PIDs tracked by JAG in > "blocks" of around 40 threads. > > I was more afraid if the 1st GDB trace where you'll clearly see there were > way more than 40 JAG threads. Either the user did wrong with parallel, or > something else bad happened. > Could it be the extra ssh processes to launch the tasks on the node? > All in all, the point that is causing problems is related to things like: > > 2982:[2022-06-01T14:15:06.639] [7503353.extern] debug: Handling > REQUEST_ADD_EXTERN_PID > 2983:[2022-06-01T14:15:06.639] [7503353.extern] debug: > _handle_add_extern_pid_internal: for StepId=7503353.extern, pid 250376 > > Every time you see this, a new JAG thread for this PID is created. What I > don't really know yet is why this is happening. I wonder whether the > parallel job is creating tasks that subsequently get detached from the > parallel process. My bet is yes, because the "-S HOSTLIST" flag is making > parallel to spawn every single task using SSH. > So you think the parallel process is forking itself 40 times? > Have a look at the comparison between this 2 commands: > > parallel -j 40 "sleep 60" ::: {1..40} > parallel -j 40 -S localhost "sleep 60" ::: {1..40} > > The second one exposes the issue I'm pointing out. > Yes, I do see all the ssh connections collected by sshd daemon. However I ran both of commands above, but sleeping for 10 minutes, and I didn't see slurmstepd increasing in cpu consumption. > IF the user is only intending to use 1 node, just wipe out the "-S" flag so > Slurm can keep the whole work as part of the batch step. If the SSH is in > the middle causing the parallel tasks to get detached from the batch step, > slurm tries to catch them all in the extern step to avoid users from running > beyond the allowed, but this causes the extra overhead you see. But why that load persists after the connection is established? Definitely we can advise users to use -S flag only when using more than one node, however I didn't notice any difference on slurmstepd cpu consumption. In any case, > this is the explanation and the workaround to get rid of 40 JAG threads. I > can't really know why at some point the node got so overloaded that slurmd > got ~150 JAG threads. > Ok, I have attached the results of 3 runs. The first two uses sleep 600 command but differ on using -S flag or not. I couldn't see any difference between them, beyond the extra ssh processes. The last run I use a simple python code that just load 3 2d arrays with random numbers and keep averaging their values in an infinite loop. I tried to simulate a reasonably large memory foot print per core and a constant use of cpu. I do see a spike in slurmstepd cpu usage at the beginning, but it reduces back to almost zero as times goes on. The bottom line is that even with a python script that actually does something, I couldn't still reproduce the problem. Next I will try to add some heavy IO to see if it helps. Thanks, Bruno. > Regards, > Carlos.
Created attachment 25691 [details] Comparisons between sleep with and without host, and with a python script Several output files and logs comparing first the influence of -S flag on parallel command and slurmstepd cpu consumption. No increase was found. The third set of output is for a simple python script keeping a reasonable memory footprint and the cpu busy. Again no increase of slurmstepd observed. I hope you can spot the differences between the runs that worked fine here and the previous one from the user, where the slurmstepd end up slowly using more and more cpu resources with time. Thanks, Bruno.
Hi Bruno, Sorry for the delay. Overall, I'm still confident on the high number of JAG threads opened is due to the extra ssh processes to launch the tasks. This scenario is keen to cause cpu contention, as there is more threads than cpus. If, for some reason, this collides with an slowdown in grabbing the proc stats (like due to IO overload) this could lead to your situation. On the 1st gdb trace with way more than 40 threads the logs exposed JAG threads able to get stats from PIDs of already cancelled job, so these processes must have been alive at that time. And this also points to some IO overload or similar. The high cpu load usage from slurmstepd isn't only related to the ssh spawning but increases by the time, and I think it's the JAG thread the one consuming cycles computing the energy or any related work related to this. All in all I can't confirm this extent unless we can put this process under a perf record and see the cycles report, so I am just betting the most probable cause. And, most important, you see this isn't something easy to reproduce, so I think this is not the common situation. It must be something that happens only in an overload scenario. I wonder whether you are keen/able to use perf record when the problem arises in a production job, or if you have a reproducer job. But this would report the truth. I didn't wanted to get into that option until now as it requires some time from you and I don't know if you have it. Please tell us your options. Thanks, Carlos.
Hi, There's 2 weeks since the last reply I sent you. Please, tell us what do you think of. Thanks, Carlos.
Hi Carlos, yes I would like to try your suggestion. I am a bit short on time this week and the next but I could start working on it. Could you please elaborate your suggestion on using perf? Thanks, Bruno.
If you spot one slurmstepd consuming a lot of CPU in this precise moment, collect the perf data by doing: perf record -s --call-graph dwarf -p [PID OF slurmstepd] sleep 600 perf archive perf.data With the appropriated PID. Then send both perf.data.tar.bz2 and perf.data. Thanks, Carlos.
Bruno, We need to close the bug as timed out. Just reopen it and provide the information we need if possible. Thanks.
Created attachment 26430 [details] perf data collected and attached I have attached the perf data collected from one instance of slurmstepd running very high on cpu usage. This example collects data for a user using R (https://www.r-project.org/) instead of the previously used gnu-parallel program. Please let me know if you find it helpful. Thanks, Bruno.
Carlos is out of the office on vacation, so I will have Oriol look over this data and let you know what he finds.
Created attachment 26441 [details] perf data for slurmstepd for a gnu-parallel job Thank you! I have also attached here data for a recent gnu-parallel job which slurm seems to fail to kill slurmstepd and it is currently at 97% cpu usage. Please let me know if this data for both cases are enough for your analyses so I can bring those nodes to production again. Thanks, Bruno.
From the compute node, please also gather the following. > sudo lsof > /tmp/lsof_14179.out > sudo pstree -p > /tmp/pstree_14179.out > sudo pldd <PROCESS_ID> >> /tmp/pldd_14179.out > sudo cat /proc/<PROCESS_ID>/maps | awk '{print $6}' | grep '\.so' | sort | uniq >> /tmp/proc_14179.out Please use the last two commands with the process id of the slurmstepd and slurmd. Gather the output files and attach those files to this issue. You then can return these nodes back to the cluster.
Created attachment 26451 [details] pstree, lsof etc for slurmd on both nodes I have attached the output of the commands you asked for 2 nodes with slurmstepd problem. Unfortunately when I collected this data, the slurmstepd process had already been killed. I took note of it and will submit new data when I observe this problem in the cluster again. Thanks, Bruno.
Created attachment 26457 [details] Data collected for a running case I was able to catch a case while still running today. All the data you asked for is attached. I have contacted the user to try understanding what he is doing. Most of his R jobs trigger a high cpu usage from slurmstepd which takes a long time to clear out even after the user job finishes. Thanks, Bruno.
Hi Bruno, > Thank you! I have also attached here data for a recent gnu-parallel job which slurm seems to fail to kill slurmstepd and it is currently at 97% cpu usage. > Please let me know if this data for both cases are enough for your analyses so I can bring those nodes to production again. Yes, with all the data you've provided I have plenty of information to look into, you can bring those nodes to production again. I've read all the bug history and I'm going trough the logs and perf data you sent. For the moment I can confirm that the cpu usage is highly on the JAG threads, so the idea of Carlos confirms, the issue is there, I will continue to study the perf data as well as all the other information you provided. In order to discard points of failure, would you be open to change to jobacct_gather from linux to cgroup and see if the same R job that increased the cpu usage does it again? Regards.
(In reply to Oriol Vilarrubi from comment #36) > In order to discard points of failure, would you be open to change to > jobacct_gather from linux to cgroup and see if the same R job that increased > the cpu usage does it again? Do you think it is possible to do this change on the fly? Ideally I would identify the node with the slurmstepd problem and change the configuration just there and reload slurmd. Would that be possible instead of a change of configuration for the whole cluster? I failed to come up with a test to reproduce this issue so far. A colleague just suggested to create one that spawns new processes and destroy them very fast as a possible way to reproduce the problem. If you think it is useful and have an idea of a test case for it, please let me know. Thanks, Bruno.
(In reply to Bruno Mundim from comment #37) > (In reply to Oriol Vilarrubi from comment #36) > > > In order to discard points of failure, would you be open to change to > > jobacct_gather from linux to cgroup and see if the same R job that increased > > the cpu usage does it again? > > Do you think it is possible to do this change on the fly? Ideally I would > identify the node with the slurmstepd problem and change the configuration > just there and reload slurmd. Would that be possible instead of a change of > configuration for the whole cluster? Yes, it can be done, but it requires the slurmd node to be empty of jobs, as this changes the communication between slurmd and the steps, so if there is any running the communication would fail. Also if you change the configuration in one node add the option NO_CONF_HASH to the parameter DebugFlags in your slurm.conf (https://slurm.schedmd.com/slurm.conf.html#OPT_NO_CONF_HASH), this way you will not get error messages in the log about having different config files. > I failed to come up with a test to reproduce this issue so far. A colleague > just suggested to create one that spawns new processes and destroy them very > fast as a possible way to reproduce the problem. If you think it is useful > and have an idea of a test case for it, please let me know. Let me took a deeper look into all the information and logs provided by you as well as look into the code before answering that, as I want to be sure. Regards.
Hello Bruno, First let me apologize for the late response in this bug. I've been investigating it further and I think I came up with a lead that we can take to further isolate down the issue, but I want to investigate the code further before asking you to do any tests. The issue you are seeing is caused by the syscall poll taking more than 300 seconds, as Carlos already mentioned in past comments this could be caused to having lots of processes having too much IO on the system, but it can also be caused to non responsive slurmstepd as the poll timeout is in the UNIX filesystem socket that is used to communicate slurmd and slurmstepd. As the next step I would like you to try to change to the cgroup procktrack plugin in order to be sure this is a general jobaccountgather issue not a linux specific one. In order to do so, you will need to create a reservation for one node, and when there are no jobs in it, you need to change the local slurm.conf on that node to have the cgroup job account gather plugin and restart the slurmd daemon, and then retest with the user job that tends to reproduce the issue. If we still see the errors "Timeout waiting for slurmstepd" we know that the error is not in the plugin but in the generic JAG platform. Meanwhile I am working on a debug patch, so that you can apply to this same "reserved" node and we can obtain further information on what is happening. Thanks, and sorry again for the late response.
Hi Oriol, I am waiting for a user's code to trigger the problem again. Sometimes it happens almost daily, and sometimes we have a long hiatus like now. Thanks, Bruno. (In reply to Oriol Vilarrubi from comment #39) > Hello Bruno, > > First let me apologize for the late response in this bug. I've been > investigating it further and I think I came up with a lead that we can take > to further isolate down the issue, but I want to investigate the code > further before asking you to do any tests. The issue you are seeing is > caused by the syscall poll taking more than 300 seconds, as Carlos already > mentioned in past comments this could be caused to having lots of processes > having too much IO on the system, but it can also be caused to non > responsive slurmstepd as the poll timeout is in the UNIX filesystem socket > that is used to communicate slurmd and slurmstepd. As the next step I would > like you to try to change to the cgroup procktrack plugin in order to be > sure this is a general jobaccountgather issue not a linux specific one. > > In order to do so, you will need to create a reservation for one node, and > when there are no jobs in it, you need to change the local slurm.conf on > that node to have the cgroup job account gather plugin and restart the > slurmd daemon, and then retest with the user job that tends to reproduce the > issue. If we still see the errors "Timeout waiting for slurmstepd" we know > that the error is not in the plugin but in the generic JAG platform. > Meanwhile I am working on a debug patch, so that you can apply to this same > "reserved" node and we can obtain further information on what is happening. > > Thanks, and sorry again for the late response.
Hi Bruno, Did the user's code triggered the problem again? Regards.
(In reply to Oriol Vilarrubi from comment #41) > Hi Bruno, > > Did the user's code triggered the problem again? > > Regards. It is happening less often, but when it happens I am usually late for the party. Sorry about that. Thanks, Bruno.
Hi Bruno, Did the user's code triggered the problem another time? And have you by any chance tried that same thing in a newer slurm version? Regards.
Hi Bruno, I am closing this bug as at the moment this has not been appearing that much often, also in newer slurm some performance improvements had been made in the jobacctgather part of the code for jobs that have a lot of processes that are not present at the end of the job. Please do not hesitate to reopen this bug in case that you observe this behaviour again. Regards.