Hi Guys, One of our user testing a new 17.02.2 cluster is noticing that most of his jobs are failing with the following in his output file: -- slurmstepd: error: Exceeded job memory limit at some point. srun: error: amber305: task 0: Out Of Memory [mpiexec@amber305.sc1.roche.com] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting [mpiexec@amber305.sc1.roche.com] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion [mpiexec@amber305.sc1.roche.com] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion [mpiexec@amber305.sc1.roche.com] main (ui/mpich/mpiexec.c:344): process manager error waiting for completion slurmstepd: error: Exceeded job memory limit at some point. -- Here is what I see him using for this job: [root@amber600 5B3v1.2]# sacct -o MaxRSS,ReqMem -j 573 MaxRSS ReqMem ---------- ---------- 2Gc 137376K 2Gc 812K 2Gc 1095752K 2Gc [root@amber600 5B3v1.2]# sacct | egrep -i 'Job|573' JobID Partition JobName User State NodeList NCPUS NNode AllocG Start End Elapsed ExitCo 573 amber3 5B3v1.2.equil tomp FAILED amber305 1 1 gpu:2 2017-05-15T13:23:51 2017-05-15T13:29:36 00:05:45 127:0 573.batch batch FAILED amber305 1 1 gpu:2 2017-05-15T13:23:51 2017-05-15T13:29:36 00:05:45 127:0 573.0 hydra_pmi_proxy COMPLETED amber305 1 1 gpu:2 2017-05-15T13:28:26 2017-05-15T13:28:41 00:00:15 0:0 573.1 hydra_pmi_proxy COMPLETED amber305 1 1 gpu:2 2017-05-15T13:28:42 2017-05-15T13:29:35 00:00:53 0:0 Why would this job fail. Looks like it is using less than the allocated 2GB's. Does this have something to do mpi? Thanks, -Simran
Do you mind attaching your current slurm.conf? I'd suggest setting JobAcctGatherParams=UsePSS for a start, and if you're using task/cgroup with the memory enforcement from that you can set NoOverMemoryKill as well.
My apologies. Here is my slurm.conf: -- # grep -v '#' /etc/slurm/slurm.conf ControlMachine=amber600 AuthType=auth/munge CryptoType=crypto/munge GresTypes=gpu MpiDefault=none ProctrackType=proctrack/cgroup ReturnToService=0 SlurmctldPidFile=/var/run/slurm/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurm/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm StateSaveLocation=/var/spool/slurm SwitchType=switch/none TaskPlugin=task/cgroup InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 FastSchedule=0 SchedulerType=sched/backfill SelectType=select/cons_res SelectTypeParameters=CR_CPU_Memory SchedulerParameters=kill_invalid_depend AccountingStorageEnforce=associations AccountingStorageType=accounting_storage/slurmdbd AccountingStoreJobComment=YES ClusterName=amber JobCompType=jobcomp/slurmdbd JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/linux SlurmctldDebug=3 SlurmdDebug=3 NodeName=amber[301-314] CPUs=24 RealMemory=128656 Sockets=2 CoresPerSocket=12 ThreadsPerCore=1 Gres=gpu:8 State=UNKNOWN PartitionName=amber3 Nodes=amber[301-314] Default=YES MaxTime=INFINITE State=UP DefMemPerCPU=2048 -- Regards, -Simran
Also, setting NoOverMemoryKill would mean that the user job can use more memory than it is supposed to, which I don't think we want. However, in this case looks like the user is not using more memory than what he is requesting but his job is still being killed unless I am missing something here. Thanks, -Simran
(In reply to Simran from comment #3) > Also, setting NoOverMemoryKill would mean that the user job can use more > memory than it is supposed to, which I don't think we want. However, in > this case looks like the user is not using more memory than what he is > requesting but his job is still being killed unless I am missing something > here. Sorry for the brevity there; it's after hours and I was dashing off a quick response before I headed out. To better break down the settings: NoOverMemoryKill - this only disables a specific type of memory limit enforcement, likely the one you're seeing issues with, and is different than the cgroup memory enforcement. (ConstrainRAMSpace=yes in cgroup.conf.) Disabling that should prevent this type of enforcement, while still allowing the cgroup limits to be respected. UsePSS - this changes the memory statistic gathered for the job from RSS to PSS. The sum of the RSS stat for certain parallel applications can be significantly over the actual usage due to how memory for shared segments are accounted for. The PSS stats try to compensate for this issue, and divvy up the usage of the shared segments between the different processes to prevent erroneously high values. I believe this difference is what accounts for the wide range in usage you're seeing. I think that setting either of these (or both) options should prevent that apparently incorrect enforcement from happening. - Tim
This can be closed. Thanks for your help.
Closing