Description
Institut Pasteur HPC Admin
2022-04-15 04:54:03 MDT
Hi,
> Can it be due to the substancial changes in Slurm
> cgroup code to prepare it for cgroup V2 as Felip Moll says in the following
> ticket
> https://bugs.schedmd.com/show_bug.cgi?id=7536 ?
This is very unlikely :)
I understand you are asking for 4GB (with --mem) and you see the processes consuming up to 9GB.
- Is it is possible to see how did you launch the job?
- Can you upload the slurmd log on the node the job ran?
- Can you show me the output of the command "cat /proc/mounts" in the node?
- A dmesg is a good way too to verify if there have been OOMs and at which process and memory limit. The kernel OOM log helps to see the limit applied to the cgroup.
Created attachment 24585 [details] script used to launch the job array Hello, > I understand you are asking for 4GB (with --mem) and you see the processes > consuming up to 9GB. Not in real time (our atop monitoring has a one-minute interval), just in the sacct output. > - Is it is possible to see how did you launch the job? The user wrote a rather simple job script launched in a job array. See the batch_script.sh attached. > - Can you upload the slurmd log on the node the job ran? > - Can you show me the output of the command "cat /proc/mounts" in the node? Both in attachment. > - A dmesg is a good way too to verify if there have been OOMs and at which > process and memory limit. The kernel OOM log helps to see the limit applied > to the cgroup. Please find in attachment the kernel logs containing lines of the type kernel: [426332.241768] Memory cgroup out of memory: Killed process 419423 (nnUNet_train) total-vm:8989768kB, anon-rss:763324kB, file-rss:113076kB, shmem-rss:92236kB, UID:39157 Thanks in advance, Created attachment 24586 [details]
slurmd.log
Created attachment 24587 [details]
content of /proc/mounts
Created attachment 24588 [details]
kernel logs excerpt
Hi, The cgroup works as expected. For example, the process you pointed out (419423) which corresponds to step 0 of job 6062068 and uid 39157, had a limit of exactly 4GiB of Memory, and it has been killed exactly at this limit by cgroup: 2022-04-12T15:30:14.821431+02:00 maestro-3010 kernel: [426332.241768] Memory cgroup out of memory: Killed process 419423 (nnUNet_train) total-vm:8989768kB, anon-rss:763324kB, file-rss:113076kB, shmem-rss:92236kB, UID:39157 ..... 2022-04-12T15:30:19.461128+02:00 maestro-3010 kernel: [426336.914774] Task in /slurm/uid_39157/job_6062068/step_0 killed as a result of limit of /slurm/uid_39157/job_6062068 2022-04-12T15:30:19.493669+02:00 maestro-3010 kernel: [426336.933313] memory: usage 4194304kB, limit 4194304kB, failcnt 769041 2022-04-12T15:30:19.493673+02:00 maestro-3010 kernel: [426336.947373] memory+swap: usage 4194304kB, limit 9007199254740988kB, failcnt 0 Instead, what I think that happens is that the stats of the job are incorrect. Does the nnUNet_train software do fork any processes? Can you repeat the test adding "SlurmdDebug=debug2" and "DebugFlags=JobAccountGather" to slurm.conf (+ reconfig)?. Then upload the slurmd logs again please. A last question just in case the info I request doesn't give me enough, is there any possibility to apply a debug patch in your environment? Thanks > > I understand you are asking for 4GB (with --mem) and you see the processes
> > consuming up to 9GB.
>
> Not in real time (our atop monitoring has a one-minute interval), just in
> the sacct output.
What does your "atop" do exactly to the consumed memory of the job?
Hi, do you have any feedback for me? Thanks! Hi Felip, Not yet but I hope to be able to repeat the test at the beginning of next week. Thank you for asking and sorry for the delay, (In reply to Institut Pasteur HPC Admin from comment #9) > Hi Felip, > > Not yet but I hope to be able to repeat the test at the beginning of > next week. > > Thank you for asking and sorry for the delay, Ok no problem. I will be waiting for your testing. Thanks! Created attachment 25021 [details] kernel logs excerpt for job 8948024 Hi Felip, I was able to reproduce the behavior $ sacct --format=jobid%15,jobname%15,state%15,exitcode,ncpus%5,reqmem%6,maxrss -j 8948024 JobID JobName State ExitCode NCPUS ReqMem MaxRSS --------------- --------------- --------------- -------- ----- ------ ---------- 8948024 nnUNet_train OUT_OF_MEMORY 0:125 1 4Gc 8948024.extern extern COMPLETED 0:0 1 4Gc 1296K 8948024.0 nnUNet_train OUT_OF_MEMORY 0:125 1 4Gc 49763804K As you noticed previously, the attached kernel logs show that the job was killed by the cgroup "out of memory" handler as expected. So the reported MaxRSS must be incorrect. > Can you repeat the test adding "SlurmdDebug=debug2" and > "DebugFlags=JobAccountGather" to slurm.conf (+ reconfig)?. Unfortunately "DebugFlags=JobAccountGather" doesn't seem to exist in 20.11.7 so the output of slurmd in debug2 mode doesn't provide any relevant piece of information. > A last question just in case the info I request doesn't give me enough, > is there any possibility to apply a debug patch in your environment? We just planed to upgrade Slurm in June. Now that we are able to reproduce the problem almost at will, maybe we could wait for that upgrade to check if the phenomenon still exists and try a debug patch if you have one? > Does the nnUNet_train software do fork any processes? Yes, it creates 12 threads. > What does your "atop" do exactly to the consumed memory of the job? Nothing, I just meant that with the atop 1-minute-sampling, we couldn't catch the RSize at the moment the job was killed. Thanks for your help, Created attachment 25022 [details]
debug2 mode slurmd logs for job 8948024
> Unfortunately "DebugFlags=JobAccountGather" doesn't seem to exist in > 20.11.7 > so the output of slurmd in debug2 mode doesn't provide any relevant piece of > information. > Ahh, my bad. Yes, the info is not very useful. > > A last question just in case the info I request doesn't give me enough, > > is there any possibility to apply a debug patch in your environment? > > We just planed to upgrade Slurm in June. Now that we are able to reproduce > the problem almost at will, maybe we could wait for that upgrade to check > if the phenomenon still exists and try a debug patch if you have one? That's an idea, but in the meantime I will try to reproduce it locally. If you can provide me with the process layout, doing a "pstree -psla" in a node where the job is running, I will try to do the same. I cannot test nnUNet_train without CUDA, that would've been ideal. > > Does the nnUNet_train software do fork any processes? > > Yes, it creates 12 threads. I said that because there's a case where I've seen processes becoming orphan and being double-accounted in a specific case, but shouldn't apply here. Can you show me the batch script you're using to run the job? > > > What does your "atop" do exactly to the consumed memory of the job? > > Nothing, I just meant that with the atop 1-minute-sampling, we couldn't > catch the RSize at the moment the job was killed. > Got it. Hi Felip, > That's an idea, but in the meantime I will try to reproduce it locally. > If you can provide me with the process layout, doing a "pstree -psla" > in a node where the job is running, I will try to do the same. I cannot > test nnUNet_train without CUDA, that would've been ideal. Here it is: |-slurmd,711535 -d /usr/local/sbin/slurmstepd |-slurmstepd,1681028 | |-sleep,1681033 100000000 | |-{slurmstepd},1681029 | |-{slurmstepd},1681030 | |-{slurmstepd},1681031 | `-{slurmstepd},1681032 |-slurmstepd,1681036 | |-nnUNet_train,1681048 /home/user/bin/nnUNet_train 2d nnUNetTrainerV2 500 0 | | |-nnUNet_train,1681121 /home/user/bin/nnUNet_train 2d nnUNetTrainerV2 500 0 | | |-nnUNet_train,1681151 /home/user/bin/nnUNet_train 2d nnUNetTrainerV2 500 0 | | |-nnUNet_train,1681152 /home/user/bin/nnUNet_train 2d nnUNetTrainerV2 500 0 | | |-nnUNet_train,1681212 /home/user/bin/nnUNet_train 2d nnUNetTrainerV2 500 0 | | |-nnUNet_train,1681213 /home/user/bin/nnUNet_train 2d nnUNetTrainerV2 500 0 | | |-nnUNet_train,1681215 /home/user/bin/nnUNet_train 2d nnUNetTrainerV2 500 0 | | |-nnUNet_train,1681216 /home/user/bin/nnUNet_train 2d nnUNetTrainerV2 500 0 | | |-nnUNet_train,1681217 /home/user/bin/nnUNet_train 2d nnUNetTrainerV2 500 0 | | |-nnUNet_train,1681224 /home/user/bin/nnUNet_train 2d nnUNetTrainerV2 500 0 | | |-nnUNet_train,1681230 /home/user/bin/nnUNet_train 2d nnUNetTrainerV2 500 0 | | |-nnUNet_train,1681231 /home/user/bin/nnUNet_train 2d nnUNetTrainerV2 500 0 | | |-nnUNet_train,1681234 /home/user/bin/nnUNet_train 2d nnUNetTrainerV2 500 0 | | |-{nnUNet_train},1681102 | | |-{nnUNet_train},1681103 | | `-{nnUNet_train},1681235 | |-{slurmstepd},1681037 | |-{slurmstepd},1681038 | |-{slurmstepd},1681039 | |-{slurmstepd},1681041 | |-{slurmstepd},1681042 | |-{slurmstepd},1681043 | |-{slurmstepd},1681044 | |-{slurmstepd},1681045 | `-{slurmstepd},1681047 > Can you show me the batch script you're using to run the job? I just use srun as the user told me: srun -p gpu -q gpu --gres=gpu:1 nnUNet_train 2d nnUNetTrainerV2 500 0 The first time, I had to clone his git repository and to create a virtual environment with the pytorch version corresponding to our CUDA version. Before launching srun, I source it and a file to set some environment variables for input/output directories. Sorry, nothing really useful. Thanks for your help, Hi, I am trying to reproduce but I realize I have two slurm.conf from you and I don't know which one is the most recent. Can you please upload your latest slurm.conf? Are you using JobAcctGather linux plugin? Or cgroup one? I am wondering also what JobAcctGatherParams you have set. It is possible that these forked processes are created after opening shared libraries, so the space accounted in jobacctgather will be counted per process eventhought they are not really using all this space in memory. Cgroup may only account once for an opened shared library, but jobacctgather/linux may account more than one... unless: Can you set JobAcctGatherParams=UsePss and try again? The available options are: NoShared Exclude shared memory from accounting. UsePss Use PSS value instead of RSS to calculate real usage of memory. The PSS value will be saved as RSS. ---- From RSS is the total memory actually held in RAM for a process. RSS can be misleading, because it reports the total all of the shared libraries that the process uses, even though a shared library is only loaded into memory once regardless of how many processes use it. RSS is not an accurate representation of the memory usage for a single process. PSS differs from RSS in that it reports the proportional size of its shared libraries, i.e. if three processes all use a shared library that has 30 pages, that library will only contribute 10 pages to the PSS that is reported for each of the three processes. PSS is a more useful number because when the PSS for all processes in the system are summed together, that is a good representation for the total memory usage in the system. When a process is killed, the shared libraries that contributed to its PSS will be proportionally distributed to the PSS totals for the remaining processes still using that library. In this way PSS can be slightly misleading, because when a process is killed, PSS does not accurately represent the memory returned to the overall system. --- Hi, after looking more internally I am quite sure this can be due to not having UsePss or NoShared in JobAcctGatherParams when using jobacctgather/linux. You could add this parameter, or as an alternative option switch to jobacctgather/cgroup. Please, let me know if you already have NoShared set or what is your current config. Hi Felip, Sorry for the delay, I was on vacation. > after looking more internally I am quite sure this can be due to > not having UsePss or NoShared in JobAcctGatherParams when using > jobacctgather/linux. > > You could add this parameter, or as an alternative option switch > to jobacctgather/cgroup. > > Please, let me know if you already have NoShared set or what is > your current config. Here is what we have in slurm.conf: ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup,task/affinity JobAcctGatherFrequency=task=10 JobAcctGatherType=jobacct_gather/linux # Do not count mmap on files in RSS JobAcctGatherParams=NoShared > Can you set JobAcctGatherParams=UsePss and try again? Do you still want me try it? Regarding your former question > A last question just in case the info I request doesn't give me enough, > is there any possibility to apply a debug patch in your environment? We should upgrade to 21.08.8-2 next week. Do you want me to take this opportunity to apply a debug patch? Thanks in advance, (In reply to Institut Pasteur HPC Admin from comment #24) > Hi Felip, > > Sorry for the delay, I was on vacation. No problem!. > > You could add this parameter, or as an alternative option switch > > to jobacctgather/cgroup. > > > > Please, let me know if you already have NoShared set or what is > > your current config. > > Here is what we have in slurm.conf: > > ProctrackType=proctrack/cgroup > TaskPlugin=task/cgroup,task/affinity > > JobAcctGatherFrequency=task=10 > JobAcctGatherType=jobacct_gather/linux > # Do not count mmap on files in RSS > JobAcctGatherParams=NoShared Interesting, you already had this option set. Would it be possible for you to try with jobacct_gather/cgroup? This one should provide the exact memory metrics than the OOM, since it will then grab stats from the same point than task/cgroup. > > > Can you set JobAcctGatherParams=UsePss and try again? > > Do you still want me try it? UsePss can be an option, but is incompatible with NoShared. I really think NoShared should've worked but I am not sure what kind of memory does your processes use. If it is not a big deal you can try UsePss. In order to see which memory is accounted for we should investigate the memory layout of these jobs from /proc/<pid>/status just before the process is killed. Is it possible to get this information from one job which suffers from this issue? > Regarding your former question > > A last question just in case the info I request doesn't give me enough, > > is there any possibility to apply a debug patch in your environment? > > We should upgrade to 21.08.8-2 next week. Do you want me to take this > opportunity to apply a debug patch? > > Thanks in advance, Thanks, let's first see the outcomes of UsePss + the /proc/pid/status. And maybe setting jobacct_gather/cgroup too. If these are not conclusive we can think of a patch. Hi Felip, Thanks for your quick reply. > Thanks, let's first see the outcomes of UsePss + the /proc/pid/status. Yes, okay. I think we will just save the content of /proc/pid/status since I can't predict when the job will exceed the limit. > And maybe setting jobacct_gather/cgroup too. Hmmm. Given the note on JobAcctGatherType """ NOTE: Changing this configuration parameter changes the contents of the messages between Slurm daemons. Any previously running job steps are managed by a slurmstepd daemon that will persist through the lifetime of that job step and not change its communication protocol. Only change this configuration parameter when there are no running job steps. """ I think I'll wait the upgrade to change that parameter. I'll get back to you as soon as I have the outcomes of the change to UsePSS and the last /proc/pid/status of nnUNet_train before being killed. Given the upgrade preparation (it's only one step in our shutdown roadmap), I''m not sure I'll be able to do it this week. Thanks again, > Yes, okay. I think we will just save the content of /proc/pid/status since > I can't predict when the job will exceed the limit. Sounds good. > > And maybe setting jobacct_gather/cgroup too. > > Hmmm. Given the note on JobAcctGatherType > """ > NOTE: Changing this configuration parameter changes the contents of the > messages between Slurm daemons. Any previously running job steps are managed > by a slurmstepd daemon that will persist through the lifetime of that job > step and not change its communication protocol. Only change this > configuration parameter when there are no running job steps. > """ > > I think I'll wait the upgrade to change that parameter. > In theory what would happen is that current running steps would gather metrics with Linux plugin and new steps with cgroup, so for the same job there would be different ways of getting metrics which is not desirable. So it is ok to wait until the upgrade. In the meantime let's look at the other info I requested. > I'll get back to you as soon as I have the outcomes of the change to UsePSS > and the last /proc/pid/status of nnUNet_train before being killed. Given > the upgrade preparation (it's only one step in our shutdown roadmap), > I''m not sure I'll be able to do it this week. > Ok no problem, will keep this open and waiting. Hello!, I just wanted to know if you have been able to do the upgrade and test the UsePSS and/or JobAcctGather cgroup plugin. (Remember that UsePSS or NoShared is not compatible with cgroup, it has no effect.) Thanks! Hello, I am marking this bug as infogiven. Please, reopen it when you have more feedback for me. Regards |