| Summary: | Most of users jobs in new 17.02.2 cluster are failing | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Simran <simran> |
| Component: | slurmctld | Assignee: | Tim Wickberg <tim> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 17.02.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Genentech (Roche) | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Simran
2017-05-15 19:43:07 MDT
Do you mind attaching your current slurm.conf? I'd suggest setting JobAcctGatherParams=UsePSS for a start, and if you're using task/cgroup with the memory enforcement from that you can set NoOverMemoryKill as well. My apologies. Here is my slurm.conf: -- # grep -v '#' /etc/slurm/slurm.conf ControlMachine=amber600 AuthType=auth/munge CryptoType=crypto/munge GresTypes=gpu MpiDefault=none ProctrackType=proctrack/cgroup ReturnToService=0 SlurmctldPidFile=/var/run/slurm/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurm/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm StateSaveLocation=/var/spool/slurm SwitchType=switch/none TaskPlugin=task/cgroup InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 FastSchedule=0 SchedulerType=sched/backfill SelectType=select/cons_res SelectTypeParameters=CR_CPU_Memory SchedulerParameters=kill_invalid_depend AccountingStorageEnforce=associations AccountingStorageType=accounting_storage/slurmdbd AccountingStoreJobComment=YES ClusterName=amber JobCompType=jobcomp/slurmdbd JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/linux SlurmctldDebug=3 SlurmdDebug=3 NodeName=amber[301-314] CPUs=24 RealMemory=128656 Sockets=2 CoresPerSocket=12 ThreadsPerCore=1 Gres=gpu:8 State=UNKNOWN PartitionName=amber3 Nodes=amber[301-314] Default=YES MaxTime=INFINITE State=UP DefMemPerCPU=2048 -- Regards, -Simran Also, setting NoOverMemoryKill would mean that the user job can use more memory than it is supposed to, which I don't think we want. However, in this case looks like the user is not using more memory than what he is requesting but his job is still being killed unless I am missing something here. Thanks, -Simran (In reply to Simran from comment #3) > Also, setting NoOverMemoryKill would mean that the user job can use more > memory than it is supposed to, which I don't think we want. However, in > this case looks like the user is not using more memory than what he is > requesting but his job is still being killed unless I am missing something > here. Sorry for the brevity there; it's after hours and I was dashing off a quick response before I headed out. To better break down the settings: NoOverMemoryKill - this only disables a specific type of memory limit enforcement, likely the one you're seeing issues with, and is different than the cgroup memory enforcement. (ConstrainRAMSpace=yes in cgroup.conf.) Disabling that should prevent this type of enforcement, while still allowing the cgroup limits to be respected. UsePSS - this changes the memory statistic gathered for the job from RSS to PSS. The sum of the RSS stat for certain parallel applications can be significantly over the actual usage due to how memory for shared segments are accounted for. The PSS stats try to compensate for this issue, and divvy up the usage of the shared segments between the different processes to prevent erroneously high values. I believe this difference is what accounts for the wide range in usage you're seeing. I think that setting either of these (or both) options should prevent that apparently incorrect enforcement from happening. - Tim This can be closed. Thanks for your help. Closing |