Ticket 3813

Summary: Most of users jobs in new 17.02.2 cluster are failing
Product: Slurm Reporter: Simran <simran>
Component: slurmctldAssignee: Tim Wickberg <tim>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 17.02.2   
Hardware: Linux   
OS: Linux   
Site: Genentech (Roche) Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Simran 2017-05-15 19:43:07 MDT
Hi Guys,

One of our user testing a new 17.02.2 cluster is noticing that most of his jobs are failing with the following in his output file:

--
slurmstepd: error: Exceeded job memory limit at some point.
srun: error: amber305: task 0: Out Of Memory
[mpiexec@amber305.sc1.roche.com] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec@amber305.sc1.roche.com] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@amber305.sc1.roche.com] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec@amber305.sc1.roche.com] main (ui/mpich/mpiexec.c:344): process manager error waiting for completion
slurmstepd: error: Exceeded job memory limit at some point.
--

Here is what I see him using for this job:

[root@amber600 5B3v1.2]# sacct -o MaxRSS,ReqMem -j  573
    MaxRSS     ReqMem 
---------- ---------- 
                  2Gc 
   137376K        2Gc 
      812K        2Gc 
  1095752K        2Gc 

[root@amber600 5B3v1.2]# sacct | egrep -i 'Job|573'
       JobID Partition                          JobName       User      State     NodeList NCPUS NNode AllocG                Start                  End      Elapsed ExitCo 
573          amber3    5B3v1.2.equil                    tomp       FAILED     amber305     1     1     gpu:2  2017-05-15T13:23:51  2017-05-15T13:29:36  00:05:45     127:0  
573.batch              batch                                       FAILED     amber305     1     1     gpu:2  2017-05-15T13:23:51  2017-05-15T13:29:36  00:05:45     127:0  
573.0                  hydra_pmi_proxy                             COMPLETED  amber305     1     1     gpu:2  2017-05-15T13:28:26  2017-05-15T13:28:41  00:00:15     0:0    
573.1                  hydra_pmi_proxy                             COMPLETED  amber305     1     1     gpu:2  2017-05-15T13:28:42  2017-05-15T13:29:35  00:00:53     0:0    

Why would this job fail.  Looks like it is using less than the allocated 2GB's.  Does this have something to do mpi?

Thanks,
-Simran
Comment 1 Tim Wickberg 2017-05-15 20:32:05 MDT
Do you mind attaching your current slurm.conf?

I'd suggest setting JobAcctGatherParams=UsePSS for a start, and if you're using task/cgroup with the memory enforcement from that you can set NoOverMemoryKill as well.
Comment 2 Simran 2017-05-15 20:38:12 MDT
My apologies.  Here is my slurm.conf:

--
# grep -v '#' /etc/slurm/slurm.conf
ControlMachine=amber600
AuthType=auth/munge
CryptoType=crypto/munge
GresTypes=gpu 
MpiDefault=none
ProctrackType=proctrack/cgroup
ReturnToService=0
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurm
SwitchType=switch/none
TaskPlugin=task/cgroup
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
FastSchedule=0
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
SchedulerParameters=kill_invalid_depend
AccountingStorageEnforce=associations
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreJobComment=YES
ClusterName=amber
JobCompType=jobcomp/slurmdbd
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=3
SlurmdDebug=3
NodeName=amber[301-314] CPUs=24 RealMemory=128656 Sockets=2 CoresPerSocket=12 ThreadsPerCore=1 Gres=gpu:8 State=UNKNOWN 
PartitionName=amber3 Nodes=amber[301-314] Default=YES MaxTime=INFINITE State=UP DefMemPerCPU=2048
--

Regards,
-Simran
Comment 3 Simran 2017-05-15 20:39:43 MDT
Also, setting NoOverMemoryKill would mean that the user job can use more memory than it is supposed to, which I don't think we want.  However, in this case looks like the user is not using more memory than what he is requesting but his job is still being killed unless I am missing something here.

Thanks,
-Simran
Comment 4 Tim Wickberg 2017-05-15 23:21:11 MDT
(In reply to Simran from comment #3)
> Also, setting NoOverMemoryKill would mean that the user job can use more
> memory than it is supposed to, which I don't think we want.  However, in
> this case looks like the user is not using more memory than what he is
> requesting but his job is still being killed unless I am missing something
> here.

Sorry for the brevity there; it's after hours and I was dashing off a quick response before I headed out.

To better break down the settings:

NoOverMemoryKill - this only disables a specific type of memory limit enforcement, likely the one you're seeing issues with, and is different than the cgroup memory enforcement. (ConstrainRAMSpace=yes in cgroup.conf.)

Disabling that should prevent this type of enforcement, while still allowing the cgroup limits to be respected.

UsePSS - this changes the memory statistic gathered for the job from RSS to PSS. The sum of the RSS stat for certain parallel applications can be significantly over the actual usage due to how memory for shared segments are accounted for. The PSS stats try to compensate for this issue, and divvy up the usage of the shared segments between the different processes to prevent erroneously high values. I believe this difference is what accounts for the wide range in usage you're seeing.

I think that setting either of these (or both) options should prevent that apparently incorrect enforcement from happening.

- Tim
Comment 5 Simran 2017-05-16 20:40:57 MDT
This can be closed.  Thanks for your help.
Comment 6 Simran 2017-05-16 20:41:53 MDT
Closing