Ticket 8011

Summary: After upgrade to 19.05.3 jobs run with DefMemPerCPU instead of requested memory
Product: Slurm Reporter: Martin Siegert <siegert>
Component: slurmctldAssignee: Felip Moll <felip.moll>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: asa188, felip.moll, kaizaad, nate
Version: 19.05.3   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=7876
https://bugs.schedmd.com/show_bug.cgi?id=6950
https://bugs.schedmd.com/show_bug.cgi?id=7499
Site: Simon Fraser University Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 19.05.4
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: Slurmctld log after reconfigure
slurm.conf current 20191029 1300PDT

Description Martin Siegert 2019-10-28 19:26:36 MDT
Created attachment 12128 [details]
slurm config file

After upgrading from 17.11.10 to 19.05.3 all jobs that were still in the queue are run with DefMemPerCPU (in our case 256M) instead of what was requested in the submission script. E.g., a job that requested --mem=8G and --cpus-per-task=8 got
NumNodes=1 NumCPUs=8 NumTasks=0 CPUs/Task=8 ReqB:S:C:T=0:0:*:*
TRES=cpu=8,mem=2G,node=1,billing=8
Consequently the OOM killer running in the cgroup terminated the job.
We tried both cons_res and cons_tres - no difference.
Jobs that were submitted after the upgrade appear to receive the correct amount of memory.
We were forced to pause all partitions.
Comment 1 Nate Rini 2019-10-28 20:45:26 MDT
Martin,

Do you have a backup of statesave location from before the upgrade?
> StateSaveLocation=/var/spool/slurmctld

Thanks,
--Nate
Comment 2 Adam 2019-10-28 20:52:03 MDT
Hi Nate,

Yes we do have a full tarball of the slurmctld state location.

Regards,

Adam Spencer
Comment 4 Nate Rini 2019-10-28 20:56:34 MDT
(In reply to Adam from comment #2)
> Yes we do have a full tarball of the slurmctld state location.

Please attach it (if possible) along with your full /etc/slurm directory. My test attempts to replicate your issue while upgrading from 17.11 -> 19.05 have not replicated the issue.

Please also call the following:
> scontrol show jobs
> scontrol show config
> srun -V
> scontrol -V

How did you upgrade? Was it done in place?

Thanks,
--Nate
Comment 5 Adam 2019-10-28 21:04:29 MDT
Ok, it will take me a bit to gather.

It was done with slurmctld offline and slurmdb off and then done by launching slurmdbd -Dvv and waiting overnight.  Started slurmctld in the morning with all nodes offline and waited the few minutes for that and then brought the nodes online.  As soon as we removed the maintenance reservation (with no users able to submit since we had blocked access to head nodes) we saw pending jobs start and then immediately OOM and began verifying their RAM request against their batch scripts and found that it seemed possible that the upgraded version is possibly taking their cpu count and multiplying it by our defmempercpu of 256M, that is currently speculation though.  Did not seem to affect jobs that we submit from 19.05.3 though.
Comment 6 Nate Rini 2019-10-28 21:12:06 MDT
(In reply to Adam from comment #5)
> It was done with slurmctld offline and slurmdb off and then done by
> launching slurmdbd -Dvv and waiting overnight.  Started slurmctld in the
> morning with all nodes offline and waited the few minutes for that and then
> brought the nodes online.
Please attach the slurmdbd and slurmctld logs from after the upgrade.

> As soon as we removed the maintenance reservation
> (with no users able to submit since we had blocked access to head nodes) 
FYI: Setting state=down on the partitions will allow users to be able to submit jobs but will not allow any to start.

> we saw pending jobs start and then immediately OOM and began verifying their
> RAM request against their batch scripts and found that it seemed possible
> that the upgraded version is possibly taking their cpu count and multiplying
> it by our defmempercpu of 256M, that is currently speculation though.
I would like to verify if this happened to all of your 17.11 jobs or just certain jobs, such as ones that have specific memory limits. Were there any other changes made to configuration between 17.11 and 19.05 (including changes using sacctmgr).
Comment 7 Martin Siegert 2019-10-28 21:56:30 MDT
Created attachment 12129 [details]
slurmdbd.log

Attached are slurmdbd.log and slurmctld.log from today after the upgrade.
Comment 8 Martin Siegert 2019-10-28 22:05:58 MDT
Created attachment 12130 [details]
slurmctld.log
Comment 9 Adam 2019-10-28 22:10:52 MDT
Created attachment 12131 [details]
/etc/slurm along with requested commands
Comment 10 Adam 2019-10-28 22:12:32 MDT
Created attachment 12132 [details]
StateSaveLocation from 17.11.10 immediately prior to upgrade
Comment 11 Adam 2019-10-28 22:17:10 MDT
Hi Nate,

We were avoiding letting people submit currently incase we need to go back to that statesave snapshot for any reason.  Trying to see if we can track this down and get resolution prior to muddying the water.

I'd say it's likely not all jobs that seem to present this issue, at least not ones that as for memory per node.  I'm trying to track down a job that might suffer the issue if submitted.  The ones that we spotted are no longer in scontrol since they ran, went OOM and ended up in the DB now(thousands possibly)
Comment 12 Adam 2019-10-28 22:18:05 MDT
By submitted, in that last paragraph I mean, tracking down a job that might exhibit the issue if it runs on a node.
Comment 13 Adam 2019-10-28 22:20:19 MDT
There were no sacctmgr submissions made during the upgrade period.  I had stopped all cron jobs during that time and slurmdbd.log doesn't show any suggestions of possible unexpected alterations that I can tell.
Comment 14 Nate Rini 2019-10-28 22:23:55 MDT
(In reply to Adam from comment #11)
> The ones that we spotted are no longer
> in scontrol since they ran, went OOM and ended up in the DB now(thousands
> possibly)

Can you please provide the slurmd log from the head node (first node in the job) where this happened. Dmesg from the nodes too. Looking at the `scontrol show jobs` output, I don't see any jobs that look like their memory restrictions got overridden multiple of MinMemoryCPU=256M.
Comment 16 Adam 2019-10-28 22:31:56 MDT
I'll grab those in a minute.

Here's a job that passed already and is in the DB listing this issue.

sacct -j 29914274 -o 'ReqTRES%60,AllocTRES%60'
                                                     ReqTRES                                                    AllocTRES
       -----------------------------------------------------        -----------------------------------------------------
                           cpu=6,gres/gpu=1,mem=3200M,node=1                             billing=1,cpu=6,mem=1.50G,node=1
Comment 18 Nate Rini 2019-10-28 22:33:28 MDT
> [2019-10-28T16:48:18.986] debug2: JobId=29733646 can't run in partition cpubase_bycore_b5: Access/permission denied

Please call `scontrol reconfigure`. Please provide slurmctld logs after about 5 minutes.
Comment 20 Adam 2019-10-28 22:43:56 MDT
(In reply to Nate Rini from comment #18)
> > [2019-10-28T16:48:18.986] debug2: JobId=29733646 can't run in partition cpubase_bycore_b5: Access/permission denied
> 
> Please call `scontrol reconfigure`. Please provide slurmctld logs after
> about 5 minutes.

Ah, that error is old.  It's an odd side-effect I noticed in the upgrade.

Some of our job submission hosts don't have the same domainname as the slurmctld host and we were using short names and /etc/hosts previously for that, but now it seems that it takes the munge passed host?  So submission-host.otherdomain.com no longer works in AllocNode=submission-host
It has to be submission-host.otherdomain.com now.  This affected new submissions as well as past submissions.
Comment 21 Adam 2019-10-28 22:45:06 MDT
Created attachment 12133 [details]
Dmesg and slurmd.log for host with bad job
Comment 23 Nate Rini 2019-10-28 22:55:42 MDT
(In reply to Adam from comment #16)
> sacct -j 29914274 -o 'ReqTRES%60,AllocTRES%60'
> cpu=6,gres/gpu=1,mem=3200M,node=1                

Looking at the 17.11 statesave of this job:

> $ scontrol show job 29914274
> JobId=29914274 JobName=fold2asv
>   NumNodes=1 NumCPUs=6 NumTasks=6 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>   TRES=cpu=6,mem=3200M,node=1,gres/gpu=1
>   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>   MinCPUsNode=1 MinMemoryNode=3200M MinTmpDiskNode=0

The requested TRES in 17.11 matches the value in 19.05?
Comment 25 Adam 2019-10-28 23:05:32 MDT
(In reply to Nate Rini from comment #23)
> (In reply to Adam from comment #16)
> > sacct -j 29914274 -o 'ReqTRES%60,AllocTRES%60'
> > cpu=6,gres/gpu=1,mem=3200M,node=1                
> 
> Looking at the 17.11 statesave of this job:
> 
> > $ scontrol show job 29914274
> > JobId=29914274 JobName=fold2asv
> >   NumNodes=1 NumCPUs=6 NumTasks=6 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
> >   TRES=cpu=6,mem=3200M,node=1,gres/gpu=1
> >   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
> >   MinCPUsNode=1 MinMemoryNode=3200M MinTmpDiskNode=0
> 
> The requested TRES in 17.11 matches the value in 19.05?

I don't think I can get you the scontrol output for that job in 19.05 since it's already run, just the sacct, did you want that?
Comment 26 Adam 2019-10-28 23:07:22 MDT
Created attachment 12137 [details]
Slurmctld log after reconfigure
Comment 27 Nate Rini 2019-10-28 23:11:17 MDT
(In reply to Adam from comment #21)
> Created attachment 12133 [details]
> Dmesg and slurmd.log for host with bad job

From the dmesg log:
> memory: usage 1572864kB -> 1572.8MB -> 1.57GB

Using the 17.11 statesave (job that got oomkilled):

> $ scontrol show job 29914275
> JobId=29914275 JobName=fold2asv
>   NumNodes=1 NumCPUs=6 NumTasks=6 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>   TRES=cpu=6,mem=3200M,node=1,gres/gpu=1
>   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>   MinCPUsNode=1 MinMemoryNode=3200M MinTmpDiskNode=0

The job should have been given mem=3200M -> 3.1GB which is way over 1.57GB used.

The slurmd log has this:
> [2019-10-28T16:56:05.057] [29914275.extern] task/cgroup: /slurm/uid_3071072/job_29914275: alloc=1536MB mem.limit=1536MB memsw.limit=1536MB
> [2019-10-28T16:56:05.065] [29914275.extern] task/cgroup: /slurm/uid_3071072/job_29914275/step_extern: alloc=1536MB mem.limit=1536MB memsw.limit=1536MB

It appears the job memory limits are getting set low in the cgroups.
Comment 28 Nate Rini 2019-10-28 23:12:00 MDT
(In reply to Adam from comment #25)
> I don't think I can get you the scontrol output for that job in 19.05 since
> it's already run, just the sacct, did you want that?

Yeah, also if you can grep out that job from the slurmd and slurmctld logs that would be great.
Comment 29 Nate Rini 2019-10-28 23:19:11 MDT
(In reply to Adam from comment #21)
> Created attachment 12133 [details]
> Dmesg and slurmd.log for host with bad job

Can you please call the following on this node:
> slurmd -C
> slurmd -V
Comment 30 Adam 2019-10-28 23:22:15 MDT
(In reply to Nate Rini from comment #29)
> (In reply to Adam from comment #21)
> > Created attachment 12133 [details]
> > Dmesg and slurmd.log for host with bad job
> 
> Can you please call the following on this node:
> > slurmd -C
> > slurmd -V

NodeName=cdr118 CPUs=24 Boards=1 SocketsPerBoard=2 CoresPerSocket=12 ThreadsPerCore=1 RealMemory=128818
UpTime=0-19:35:56
slurm 19.05.3-2
Comment 31 Adam 2019-10-28 23:22:53 MDT
(In reply to Nate Rini from comment #28)
> (In reply to Adam from comment #25)
> > I don't think I can get you the scontrol output for that job in 19.05 since
> > it's already run, just the sacct, did you want that?
> 
> Yeah, also if you can grep out that job from the slurmd and slurmctld logs
> that would be great.

Not sure I follow, aren't those the logs I sent to you already?
Comment 32 Adam 2019-10-28 23:24:36 MDT
(In reply to Adam from comment #31)
> (In reply to Nate Rini from comment #28)
> > (In reply to Adam from comment #25)
> > > I don't think I can get you the scontrol output for that job in 19.05 since
> > > it's already run, just the sacct, did you want that?
> > 
> > Yeah, also if you can grep out that job from the slurmd and slurmctld logs
> > that would be great.
> 
> Not sure I follow, aren't those the logs I sent to you already?

sacct -j 29914274 -pl
JobID|JobIDRaw|JobName|Partition|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|AveVMSize|MaxRSS|MaxRSSNode|MaxRSSTask|AveRSS|MaxPages|MaxPagesNode|MaxPagesTask|AvePages|MinCPU|MinCPUNode|MinCPUTask|AveCPU|NTasks|AllocCPUS|Elapsed|State|ExitCode|AveCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqMem|ConsumedEnergy|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|AveDiskRead|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|AveDiskWrite|AllocGRES|ReqGRES|ReqTRES|AllocTRES|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutAve|TRESUsageOutTot|
29914274|29914274|fold2asv|gpubase_bygpu_b4||||||||||||||||||6|00:01:53|OUT_OF_MEMORY|0:125||Unknown|Unknown|Unknown|256Mc|0|||||||||gpu:0|gpu:0|cpu=6,gres/gpu=1,mem=3200M,node=1|billing=1,cpu=6,mem=1.50G,node=1||||||||||||||
29914274.batch|29914274.batch|batch||147008K|cdr118|0|147008K|702K|cdr118|0|702K|0|cdr118|0|0|00:00:00|cdr118|0|00:00:00|1|6|00:01:53|OUT_OF_MEMORY|0:125|1.50G|0|0|0|256Mc|0|0.00M|cdr118|0|0.00M|0|cdr118|0|0|gpu:0|gpu:0||cpu=6,mem=1.50G,node=1|cpu=00:00:00,energy=0,fs/disk=1465,mem=702K,pages=0,vmem=147008K|cpu=00:00:00,energy=0,fs/disk=1465,mem=702K,pages=0,vmem=147008K|cpu=cdr118,energy=cdr118,fs/disk=cdr118,mem=cdr118,pages=cdr118,vmem=cdr118|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=1465,mem=702K,pages=0,vmem=147008K|cpu=cdr118,energy=cdr118,fs/disk=cdr118,mem=cdr118,pages=cdr118,vmem=cdr118|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=1465,mem=702K,pages=0,vmem=147008K|energy=0,fs/disk=0|energy=cdr118,fs/disk=cdr118|fs/disk=0|energy=0,fs/disk=0|energy=0,fs/disk=0|
29914274.extern|29914274.extern|extern||146788K|cdr118|0|146788K|37K|cdr118|0|37K|0|cdr118|0|0|00:00:00|cdr118|0|00:00:00|1|6|00:01:54|COMPLETED|0:0|1.29G|0|0|0|256Mc|0|0.00M|cdr118|0|0.00M|0|cdr118|0|0|gpu:0|gpu:0||billing=1,cpu=6,mem=1.50G,node=1|cpu=00:00:00,energy=0,fs/disk=2012,mem=37K,pages=0,vmem=146788K|cpu=00:00:00,energy=0,fs/disk=2012,mem=37K,pages=0,vmem=146788K|cpu=cdr118,energy=cdr118,fs/disk=cdr118,mem=cdr118,pages=cdr118,vmem=cdr118|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=37K,pages=0,vmem=146788K|cpu=cdr118,energy=cdr118,fs/disk=cdr118,mem=cdr118,pages=cdr118,vmem=cdr118|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=37K,pages=0,vmem=146788K|energy=0,fs/disk=0|energy=cdr118,fs/disk=cdr118|fs/disk=0|energy=0,fs/disk=0|energy=0,fs/disk=0|
Comment 33 Adam 2019-10-28 23:27:18 MDT
All the nodes share(via NFS) one /etc/slurm and one /opt/software/slurm/[bin,libexec,...]
So as long as they were restarted at the proper times, they will contain the same config and the binary versions will post a match.
Comment 34 Nate Rini 2019-10-28 23:28:02 MDT
(In reply to Adam from comment #30)
> (In reply to Nate Rini from comment #29)
> > (In reply to Adam from comment #21)
> NodeName=cdr118 CPUs=24 Boards=1 SocketsPerBoard=2 CoresPerSocket=12
> ThreadsPerCore=1

Can you please set your CPUs=# (in slurm.conf) for your nodes to match the value returned from `slurmd -C`? I suspect proc_track/cgroups is getting confused by CPU count not being defined. Does your site has hyperthreads disabled in the UEFI/BIOS?
Comment 35 Nate Rini 2019-10-28 23:29:30 MDT
(In reply to Adam from comment #33)
> All the nodes share(via NFS) one /etc/slurm and one
> /opt/software/slurm/[bin,libexec,...]
> So as long as they were restarted at the proper times, they will contain the
> same config and the binary versions will post a match.

When it comes to Slurm installed version I generally default to safe than confused.
Comment 36 Nate Rini 2019-10-28 23:38:42 MDT
(In reply to Adam from comment #32)
> sacct -j 29914274 -pl
> ReqTRES
>  cpu=6,gres/gpu=1,mem=3200M,node=1

> AllocTRES
>  billing=1,cpu=6,mem=1.50G,node=1
>  cpu=6,mem=1.50G,node=1
>  billing=1,cpu=6,mem=1.50G,node=1

> ReqMem=256Mc
> AllocCPUS=6

Based on your provided logs MinMemoryNode is not being honored. Instead the AllocCPUS=6 * ReqMem=256Mc = 1536MB per step is being calculated and overriding MinMemoryNode=3200M.

Is it more important to have your system up than having memory enforcement working as expected. I currently see no evidence that 17.11 job parameters are being corrupted but instead MinMemoryNode is not being enforced.

If you wanted to get cluster running now, you could set `ConstrainRAMSpace=No` in your cgroup.conf while we figure out why MinMemoryNode is not getting enforced.
Comment 37 Adam 2019-10-28 23:41:29 MDT
(In reply to Nate Rini from comment #34)
> (In reply to Adam from comment #30)
> > (In reply to Nate Rini from comment #29)
> > > (In reply to Adam from comment #21)
> > NodeName=cdr118 CPUs=24 Boards=1 SocketsPerBoard=2 CoresPerSocket=12
> > ThreadsPerCore=1
> 
> Can you please set your CPUs=# (in slurm.conf) for your nodes to match the
> value returned from `slurmd -C`? I suspect proc_track/cgroups is getting
> confused by CPU count not being defined. Does your site has hyperthreads
> disabled in the UEFI/BIOS?

Yes, hyperthreading is disabled and has never been enabled at our site.  I believe we've always followed the Sockets+CoresPerSocket+ThreadsPerCore=CPUS
You're thinking it's possible proc_track/cgroups might be failing to deduce that in this version or you're just wanting to be verbose and cover all bases?
Comment 38 Adam 2019-10-28 23:43:14 MDT
(In reply to Nate Rini from comment #36)
> (In reply to Adam from comment #32)
> > sacct -j 29914274 -pl
> > ReqTRES
> >  cpu=6,gres/gpu=1,mem=3200M,node=1
> 
> > AllocTRES
> >  billing=1,cpu=6,mem=1.50G,node=1
> >  cpu=6,mem=1.50G,node=1
> >  billing=1,cpu=6,mem=1.50G,node=1
> 
> > ReqMem=256Mc
> > AllocCPUS=6
> 
> Based on your provided logs MinMemoryNode is not being honored. Instead the
> AllocCPUS=6 * ReqMem=256Mc = 1536MB per step is being calculated and
> overriding MinMemoryNode=3200M.
> 
> Is it more important to have your system up than having memory enforcement
> working as expected. I currently see no evidence that 17.11 job parameters
> are being corrupted but instead MinMemoryNode is not being enforced.
> 
> If you wanted to get cluster running now, you could set
> `ConstrainRAMSpace=No` in your cgroup.conf while we figure out why
> MinMemoryNode is not getting enforced.

I'm not sure we could run without memory enforcement.  We are usually running 5-10k jobs at any time and packing the nodes as tightly as possible.  I'd imagine there would be a lot of random OOM from undeserving and unsuspecting users?
Comment 41 Nate Rini 2019-10-29 00:00:34 MDT
(In reply to Adam from comment #38)
> I'm not sure we could run without memory enforcement.  We are usually
> running 5-10k jobs at any time and packing the nodes as tightly as possible.
> I'd imagine there would be a lot of random OOM from undeserving and
> unsuspecting users?

Would it be possible to switch back to cons_res temporarily?
> SelectType=select/cons_res

It should only require a config change and a restart of the controller and daemons. Cons_tres handles memory differences slightly differently and I rather get your site running and do the long term debugging outside of an outage if possible.
Comment 42 Adam 2019-10-29 00:03:42 MDT
(In reply to Nate Rini from comment #41)
> (In reply to Adam from comment #38)
> > I'm not sure we could run without memory enforcement.  We are usually
> > running 5-10k jobs at any time and packing the nodes as tightly as possible.
> > I'd imagine there would be a lot of random OOM from undeserving and
> > unsuspecting users?
> 
> Would it be possible to switch back to cons_res temporarily?
> > SelectType=select/cons_res
> 
> It should only require a config change and a restart of the controller and
> daemons. Cons_tres handles memory differences slightly differently and I
> rather get your site running and do the long term debugging outside of an
> outage if possible.

I can, yes, but this problem actually initially showed up hugely when we were on cons_res, I had heard we might have some QoS and/or accounting problems on cons_tres so I didn't put it in until I saw these OOM issues popping up.
Comment 43 Nate Rini 2019-10-29 00:11:51 MDT
Please call this command to set a higher debug level on slurmctld:
> scontrol setdebug debug2
wait 5 mins (submit a job too to kick scheduler)
> scontrol setdebug info #to reverse

I'm mainly looking for a log message similar to this:
> Setting job's pn_min_cpus to %u due to memory limit
Comment 44 Nate Rini 2019-10-29 00:17:38 MDT
Please also call this:
> scontrol show part
Comment 45 Adam 2019-10-29 00:20:26 MDT
Created attachment 12139 [details]
scontrol show part
Comment 46 Nate Rini 2019-10-29 00:26:43 MDT
Please set DefMemPerCPU=0 in all of your partitions temporarily. That should disable the entire code path that may be causing issue in comment #36.

Please also set CPUs= per comment#34 as that is required for 'SelectTypeParameters=CR_Core_Memory' with cons_tres/cons_res.

Once done, please restart all of your slurmctld and slurmd daemons.
Comment 49 Adam 2019-10-29 00:41:21 MDT
(In reply to Nate Rini from comment #43)
> Please call this command to set a higher debug level on slurmctld:
> > scontrol setdebug debug2
> wait 5 mins (submit a job too to kick scheduler)
> > scontrol setdebug info #to reverse
> 
> I'm mainly looking for a log message similar to this:
> > Setting job's pn_min_cpus to %u due to memory limit

I didn't see any lines referencing Setting job's pn_min_cpus... My jobs had the correct mem-per-cpu and minmempercpu set though it seems, along the lines of our thought that jobs submitted after the upgrade seemed ok.

Currently doing those slurm.conf changes and restarting
Comment 50 Martin Siegert 2019-10-29 01:00:14 MDT
Could these problems have anything to do with compiling slurm-19.05.3 with --enable-pam? We did not specify --enable-pam when compiling 17.11.10 (and previous versions). We do have UsePAM=0 in slurm.conf though.

- Martin
Comment 51 Adam 2019-10-29 01:24:47 MDT
(In reply to Nate Rini from comment #46)
> Please set DefMemPerCPU=0 in all of your partitions temporarily. That should
> disable the entire code path that may be causing issue in comment #36.
> 
> Please also set CPUs= per comment#34 as that is required for
> 'SelectTypeParameters=CR_Core_Memory' with cons_tres/cons_res.
> 
> Once done, please restart all of your slurmctld and slurmd daemons.

DefMemPerCPU=0, CPUs= and cons_res set are done along with those restarts.

So far it looks like we're not really getting OOM, but we're also really low on jobs to test with because they were mostly drained out by this point.

Is it safe to restore the StateSave data and restart slurmctld or would that confuse the database records?

Not on my list of things to do tonight since we're in a reported outage and need some rest before finishing this, but curious if that is an option or, if this config works, if we just need to inform the users to resubmit the 18000 jobs that were in the scheduler prior to this...
Also, not sure what DefMemPerCPU will imply if the user doesn't submit any memory requestion, or just --mem ?
Thanks.
Comment 52 Felip Moll 2019-10-29 02:46:42 MDT
(In reply to Martin Siegert from comment #50)
> Could these problems have anything to do with compiling slurm-19.05.3 with
> --enable-pam?

Hi Martin,

No. This has not anything to do with UsePAM.


(In reply to Adam from comment #51)
> DefMemPerCPU=0, CPUs= and cons_res set are done along with those restarts.

It seems that cons_tres vs cons_res won't do a difference. We're studying the impact that commit 8a1e5a5250b3ce469c71cf5f91abf506c0fe8a84
could have caused, since there we changed how DefMemPer[CPU|Node] assignment is done on multipart jobs.

As far as I have seen your jobs are not multi-partition, is that right?


> Is it safe to restore the StateSave data and restart slurmctld or would that
> confuse the database records?

I checked that when stoping slurmctld, restoring state and starting slurmctld again, the job records in the database are overwritten.


For example, that's id_job 728 FAILED before restoring state:

MariaDB [slurm_acct_db_1905]> select job_db_inx,deleted,array_task_str,job_name,id_job,time_submit,time_eligible,time_start,time_end,time_suspended from llagosti_job_table where id_job=728;
+------------+---------+----------------+----------+--------+-------------+---------------+------------+------------+----------------+
| job_db_inx | deleted | array_task_str | job_name | id_job | time_submit | time_eligible | time_start | time_end   | time_suspended |
+------------+---------+----------------+----------+--------+-------------+---------------+------------+------------+----------------+
|      13129 |       0 | NULL           | hostname |    728 |  1572337883 |    1572337941 |          0 | 1572337936 |              0 |
+------------+---------+----------------+----------+--------+-------------+---------------+------------+------------+----------------+
2 rows in set (0.001 sec)

I restore state and job runs again, then the database entry which I see is:

MariaDB [slurm_acct_db_1905]> select job_db_inx,deleted,array_task_str,job_name,id_job,time_submit,time_eligible,time_start,time_end,time_suspended from llagosti_job_table where id_job=728;
+------------+---------+----------------+----------+--------+-------------+---------------+------------+------------+----------------+
| job_db_inx | deleted | array_task_str | job_name | id_job | time_submit | time_eligible | time_start | time_end   | time_suspended |
+------------+---------+----------------+----------+--------+-------------+---------------+------------+------------+----------------+
|       2967 |       0 | NULL           | wrap     |    728 |  1546531261 |    1546531261 | 1546531261 | 1546531416 |              0 |
|      13129 |       0 | NULL           | hostname |    728 |  1572337883 |    1572337941 | 1572337969 | 1572337979 |              0 |
+------------+---------+----------------+----------+--------+-------------+---------------+------------+------------+----------------+
2 rows in set (0.001 sec)

This can conflict with usage records in usage_* tables, but these can be fixed after hacking last_ran_table values.

Nevertheless let me do a couple of checks more before giving you the final "OK this is harmless". In any case I'd recommend a mysqldump if possible.

> Also, not sure what DefMemPerCPU will imply if the user doesn't submit any
> memory requestion, or just --mem ?
> Thanks.

When DefMemPerCPU is 0 it means unlimited. If the user doesn't request memory, it will take all the memory on the selected nodes. For example, having 4 nodes of 100, 200, 300 and 400MB of RealMemory, running a srun -N4 hostname, gives me:

   NumNodes=4 NumCPUs=4 NumTasks=0 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=4,mem=1000M,node=4,billing=4



Will come back to you in a while.
Comment 53 Felip Moll 2019-10-29 03:15:04 MDT
> Is it safe to restore the StateSave data and restart slurmctld or would that
> confuse the database records?

What I see for now is that:

1. PD jobs that were not running and didn't run before will be in the queue again with no further issues.
2. R jobs that were in the queue will be now R again, and it will have to be manually terminated if the real processes had already ended. Then these jobs entries will be overwritten in the database.
3. R srun/salloc jobs that had a connection with the client will be still running. If srun/salloc is no longer there because job has finished, errors like these will be logged:

slurmctld: debug2: slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at 127.0.0.1:34953: Connection refused

4. If you had no R jobs at all when saved the state, this should be safe, but in any case you may need to update last_ran_table.
5. There may be a possibility of getting some duplicated entries in the database, for example if job were R but hadn't received a job_db_inx yet from the database.

Honestly, though I think it may be possible to do this I am not sure we are not forgetting any special corner case. This is not a standard procedure. If you had R jobs and if you have the possibility, I would resubmit everything again. That would ensure no errors will happen later.
Comment 55 Felip Moll 2019-10-29 05:24:17 MDT
We think we've found the issue.

What's happening is due to a new code in 19.05 that uses a new flag in the job description (JOB_MEM_SET) and conditionally updates the memory job limits in job_limits_check() function.

The flag was non-existent in 17.11 or 18.08, so already created jobs without this flag will ignore the case where memory is set by the user, and if they are in partitions that has DefMemPerCPU set it will take this new value as per node min memory instead of what the user specified.

The workaround is to temporarily unset DefMemPerCPU in the partitions and globally, which will do all jobs to get the maximum available in the nodes.
A job_submit.lua could be set to prevent new incoming jobs to not specify memory while the old jobs are being processed. When all older jobs are processed and R or completed, revert back the defmempercpu in the partitions and remove the job_submit.lua.

But this may be undesirable since jobs will use as much memory as they can in the nodes.

The other immediate solution would be to cancel everything and resubmit all the pending jobs.

We're looking into other possibilities too.
Comment 56 Felip Moll 2019-10-29 06:44:52 MDT
I have another probably better workaround. It consists in updating the jobs which will make them to take the missing flag.

0. Slurm nodes or Partitions offline.
1. Set sched debug:

SlurmSchedLogFile=/home/lipi/slurm/19.05/inst/log/slurm_scheduler.log
SlurmSchedLogLevel=1

2. Do an 'scontrol show job'.

Identify all PENDING jobs that has the mem= field in TRES:

JobId=30123 JobName=wrap
...
TRES=cpu=1,mem=10M,node=1,billing=1
...

3. For each of them do:

scontrol update jobid <the_job_id> minmemorynode=<the_original_mem_value - 1>
scontrol update jobid <the_job_id> minmemorynode=<the_original_mem_value>

4. Open queues

With this update we will fall into _update_job section which will set job_ptr->bit_flags |= JOB_MEM_SET; to the jobs.
We're doing 2 updates because if we update it with the same value we would see:

sched_debug("%s: new memory limit identical to old limit for %pJ").

Since the logs are sched_*(), they will be logged in SlurmSchedLogFile. For each job you should see it 4 lines (this duplicated message is a bug, which I will address separately):

For example:

sched: [2019-10-29T13:35:54.124] _update_job: setting min_memory_job to 9 for JobId=3
sched: [2019-10-29T13:35:54.124] _update_job: setting min_memory_job to 9 for job_id 3

sched: [2019-10-29T13:35:55.915] _update_job: setting min_memory_job to 10 for JobId=3
sched: [2019-10-29T13:35:55.915] _update_job: setting min_memory_job to 10 for job_id 3


This will indicate the job has received the flag since the addition is just a line before this message.
Comment 61 Martin Siegert 2019-10-29 11:16:41 MDT
We like to investigate the feasibility of restoring the slurmctld state location.
The tarball was created more than 9 hours after all computenodes were shutdown, i.e., no jobs were running at that time. Under those conditions I would assume that after restoring
- all jobs that failed due to the OOM killer running because of this bug will get run again;
- the few jobs that were not affected because they use only minimal memory and finished by now, will get run again; we would need to scan for such jobs and remove them before starting the scheduler;
- similarly all jobs that are currently running would need to be removed from the restored statedir.

Is this possible?

Alternatively we could kill all running jobs and then start again from the state at the time of the shutdown. Is that feasible?

- Martin
Comment 62 Nate Rini 2019-10-29 11:31:01 MDT
(In reply to Martin Siegert from comment #61)
> We like to investigate the feasibility of restoring the slurmctld state
> location.
> The tarball was created more than 9 hours after all computenodes were
> shutdown, i.e., no jobs were running at that time. Under those conditions I
> would assume that after restoring
> - all jobs that failed due to the OOM killer running because of this bug
> will get run again;
> - the few jobs that were not affected because they use only minimal memory
> and finished by now, will get run again; we would need to scan for such jobs
> and remove them before starting the scheduler;
> - similarly all jobs that are currently running would need to be removed
> from the restored statedir.
> 
> Is this possible?
Yes, but your accounting will be wrong during this period.
 
> Alternatively we could kill all running jobs and then start again from the
> state at the time of the shutdown. Is that feasible?
Yes, assuming you can reload your slurm database from before the upgrade. You will still need to apply the workarounds in comment #56 or comment #51.

Please upload an updated copy of your slurm.conf.

Thanks,
--Nate
Comment 63 Nate Rini 2019-10-29 11:53:31 MDT
(In reply to Nate Rini from comment #62)
> (In reply to Martin Siegert from comment #61)
> > Alternatively we could kill all running jobs and then start again from the
> > state at the time of the shutdown. Is that feasible?
> Yes, assuming you can reload your slurm database from before the upgrade.
> You will still need to apply the workarounds in comment #56 or comment #51.

Just to be clear, you can go back to the state (statesavedir and slurm database) before the upgrade safely. Any jobs ran after the upgrade will be lost. It is not recommended to keep previously ran jobs and load the old statesave directory.
Comment 64 Martin Siegert 2019-10-29 12:24:10 MDT
We do not have a backup of the slurmdb from after the shutdown.
Do I understand correctly that we nevertheless could use the first method to restore jobs? I.e.,
- generate a list of all jobs that are currently running or have been run after the upgrade;
- stop slurmctld and slurmdbd;
- create tarball of current 19.05 slurmctld state save location;
- wipe 19.05 slurmctld state save location;
- restore the slurmctld state save location from 17.11;
- remove from that location all jobs that were found in the first step;
- untar tarball from 19.05 slurmctld state save location;
- start slurmdbd and slurmctld with all partitions down;
- update memory of all jobs according to comment #56;
- unpause partitions.

Correct?
Comment 65 Felip Moll 2019-10-29 12:58:25 MDT
(In reply to Martin Siegert from comment #64)
> We do not have a backup of the slurmdb from after the shutdown.
> Do I understand correctly that we nevertheless could use the first method to
> restore jobs? I.e.,
> - generate a list of all jobs that are currently running or have been run
> after the upgrade;
> - stop slurmctld and slurmdbd;
> - create tarball of current 19.05 slurmctld state save location;
> - wipe 19.05 slurmctld state save location;
> - restore the slurmctld state save location from 17.11;
> - remove from that location all jobs that were found in the first step;

No, you can't modify the state save location contents. They are binary files and would involve creation of new and not trivial code.

> - untar tarball from 19.05 slurmctld state save location;

And you cannot mix state saves at all. That won't work in any way.
Forget about manipulating internals of state files.


1. The secure way is that one:

a) Cancel all running jobs (if any)
b) Put partitions or nodes offline, don't allow anything to run
c) Update memory of all old PD jobs still in the queue
d) Resubmit finished jobs since the upgrade
e) open queues

2. The more insecure one with possible some undesired effect in the accounting but that will put again all jobs in the queue:

a) Cancel all running jobs (if any)
b) Put partitions or nodes offline, don't allow anything to run
c) Stop slurmctld
d) Wipe current state file, restore old state file
e) Start slurmctld
f) Update all PD jobs memory
g) Put partitions online

I recommend the first way. We cannot ensure the accounting will be perfect later, at last last_ran_table should be updated to regenerate usage tables.

Reread comment 53 before considerating option 2.
Comment 66 Adam 2019-10-29 14:15:06 MDT
(In reply to Felip Moll from comment #65)
> (In reply to Martin Siegert from comment #64)
> > We do not have a backup of the slurmdb from after the shutdown.
> > Do I understand correctly that we nevertheless could use the first method to
> > restore jobs? I.e.,
> > - generate a list of all jobs that are currently running or have been run
> > after the upgrade;
> > - stop slurmctld and slurmdbd;
> > - create tarball of current 19.05 slurmctld state save location;
> > - wipe 19.05 slurmctld state save location;
> > - restore the slurmctld state save location from 17.11;
> > - remove from that location all jobs that were found in the first step;
> 
> No, you can't modify the state save location contents. They are binary files
> and would involve creation of new and not trivial code.
> 
> > - untar tarball from 19.05 slurmctld state save location;
> 
> And you cannot mix state saves at all. That won't work in any way.
> Forget about manipulating internals of state files.
> 
> 
> 1. The secure way is that one:
> 
> a) Cancel all running jobs (if any)
> b) Put partitions or nodes offline, don't allow anything to run
> c) Update memory of all old PD jobs still in the queue
> d) Resubmit finished jobs since the upgrade
> e) open queues
> 
> 2. The more insecure one with possible some undesired effect in the
> accounting but that will put again all jobs in the queue:
> 
> a) Cancel all running jobs (if any)
> b) Put partitions or nodes offline, don't allow anything to run
> c) Stop slurmctld
> d) Wipe current state file, restore old state file
> e) Start slurmctld
> f) Update all PD jobs memory
> g) Put partitions online
> 
> I recommend the first way. We cannot ensure the accounting will be perfect
> later, at last last_ran_table should be updated to regenerate usage tables.
> 
> Reread comment 53 before considerating option 2.

Hi Felip,

For clarification on option 1, what would 'resubmit finished jobs since the upgrade' be?  Is that a `scontrol requeue <joblist>`

Thanks.
Comment 67 Adam 2019-10-29 14:20:39 MDT
Created attachment 12150 [details]
slurm.conf current 20191029 1300PDT
Comment 69 Felip Moll 2019-10-29 14:57:48 MDT
> Hi Felip,
> 
> For clarification on option 1, what would 'resubmit finished jobs since the
> upgrade' be?  Is that a `scontrol requeue <joblist>`
> 
> Thanks.

I checked it in my environment and is safe to do a 'scontrol requeue' for the finished jobs.

This will create new entries in the database and will be recorded with different primary keys, which is good:

MariaDB [slurm_acct_db_1905]> select job_db_inx,deleted,array_task_str,job_name,id_job,time_submit,time_eligible,time_start,time_end,time_suspended from llagosti_job_table where id_job=17;
+------------+---------+----------------+----------+--------+-------------+---------------+------------+------------+----------------+
| job_db_inx | deleted | array_task_str | job_name | id_job | time_submit | time_eligible | time_start | time_end   | time_suspended |
+------------+---------+----------------+----------+--------+-------------+---------------+------------+------------+----------------+
|      11762 |       0 | NULL           | wrap     |     17 |  1571325508 |    1571325508 | 1571325509 | 1571325521 |              0 |
|      13167 |       0 | NULL           | wrap     |     17 |  1572382005 |    1572382005 | 1572382005 | 1572382012 |              0 |
|      13169 |       0 | NULL           | wrap     |     17 |  1572382012 |    1572382133 | 1572382149 | 1572382202 |              0 |
|      13170 |       0 | NULL           | wrap     |     17 |  1572382202 |    1572382323 |          0 |          0 |              0 |
+------------+---------+----------------+----------+--------+-------------+---------------+------------+------------+----------------+

Nevertheless, take into account that the script must be still in the state save location:

job.18]$ ll
total 12
-rw------- 1 lipi lipi 6202 29 oct. 21:53 environment
-rwx------ 1 lipi lipi  189 29 oct. 21:53 script

job.18]$ pwd
/..../state/hash.8/job.18

If the script is not found, then it won't be possible to requeue the job and a new submission will be needed.
Comment 70 Adam 2019-10-29 15:06:11 MDT
(In reply to Felip Moll from comment #69)
> > Hi Felip,
> > 
> > For clarification on option 1, what would 'resubmit finished jobs since the
> > upgrade' be?  Is that a `scontrol requeue <joblist>`
> > 
> > Thanks.
> 
> I checked it in my environment and is safe to do a 'scontrol requeue' for
> the finished jobs.
> 
> This will create new entries in the database and will be recorded with
> different primary keys, which is good:
> 
> MariaDB [slurm_acct_db_1905]> select
> job_db_inx,deleted,array_task_str,job_name,id_job,time_submit,time_eligible,
> time_start,time_end,time_suspended from llagosti_job_table where id_job=17;
> +------------+---------+----------------+----------+--------+-------------+--
> -------------+------------+------------+----------------+
> | job_db_inx | deleted | array_task_str | job_name | id_job | time_submit |
> time_eligible | time_start | time_end   | time_suspended |
> +------------+---------+----------------+----------+--------+-------------+--
> -------------+------------+------------+----------------+
> |      11762 |       0 | NULL           | wrap     |     17 |  1571325508 | 
> 1571325508 | 1571325509 | 1571325521 |              0 |
> |      13167 |       0 | NULL           | wrap     |     17 |  1572382005 | 
> 1572382005 | 1572382005 | 1572382012 |              0 |
> |      13169 |       0 | NULL           | wrap     |     17 |  1572382012 | 
> 1572382133 | 1572382149 | 1572382202 |              0 |
> |      13170 |       0 | NULL           | wrap     |     17 |  1572382202 | 
> 1572382323 |          0 |          0 |              0 |
> +------------+---------+----------------+----------+--------+-------------+--
> -------------+------------+------------+----------------+
> 
> Nevertheless, take into account that the script must be still in the state
> save location:
> 
> job.18]$ ll
> total 12
> -rw------- 1 lipi lipi 6202 29 oct. 21:53 environment
> -rwx------ 1 lipi lipi  189 29 oct. 21:53 script
> 
> job.18]$ pwd
> /..../state/hash.8/job.18
> 
> If the script is not found, then it won't be possible to requeue the job and
> a new submission will be needed.

Pretty sure none of them are still in statedir since they're well past oom completion.  Guessing I cant pull those out of the previous statedir hash directories and just place them in?

Also, on option 1 step1 I don't see any reason for cancelling currently running jobs?
Comment 71 Felip Moll 2019-10-29 15:13:00 MDT
> Pretty sure none of them are still in statedir since they're well past oom
> completion.  Guessing I cant pull those out of the previous statedir hash
> directories and just place them in?

They are saved in a specific hash calculated during runtime. I am not sure the same has would be used now, so I'd say is better to resubmit.

> Also, on option 1 step1 I don't see any reason for cancelling currently
> running jobs?

They may be running with incorrect memory and eventually fail with OOM, so I though it was better to start again.

--

1. The secure way is that one:

a) Cancel all running jobs (if any)  [ since they may be running with incorrect memory set ]
b) Put partitions or nodes offline, don't allow anything to run
c) Update memory of all old PD jobs still in the queue
d) Resubmit finished jobs since the upgrade [ this includes jobs in a), which since have been cancellated recently and job script is still in state, they are candidates for requeuing. ]
e) open queues
Comment 72 Felip Moll 2019-10-29 15:43:36 MDT
> They are saved in a specific hash calculated during runtime. I am not sure
> the same has would be used now, so I'd say is better to resubmit.

I checked the function which calculates the hash: _copy_job_desc_to_file() in job_mgr.c:

It turns out is quite basic. The hash is:

	hash = job_id % 10;

Two files must remain inside the hash.X/job.<jobid> , environment and script.

I haven't tested it, but after seeing the code it may theoretically work. I cannot ensure it right now. If you are still interested we can check this option.
Comment 74 Adam 2019-10-29 16:13:23 MDT
(In reply to Felip Moll from comment #72)
> > They are saved in a specific hash calculated during runtime. I am not sure
> > the same has would be used now, so I'd say is better to resubmit.
> 
> I checked the function which calculates the hash: _copy_job_desc_to_file()
> in job_mgr.c:
> 
> It turns out is quite basic. The hash is:
> 
> 	hash = job_id % 10;
> 
> Two files must remain inside the hash.X/job.<jobid> , environment and script.
> 
> I haven't tested it, but after seeing the code it may theoretically work. I
> cannot ensure it right now. If you are still interested we can check this
> option.

Wouldn't I find the hash just by looking in the tarball for job<jobid> and it would already show me the parent hash folder to copy it to then?
Comment 75 Nate Rini 2019-10-29 16:18:07 MDT
(In reply to Adam from comment #74)
> Wouldn't I find the hash just by looking in the tarball for job<jobid> and
> it would already show me the parent hash folder to copy it to then?
Yes, that should work.

Please note that requeue has an important catch, the jobs must still be known to slurmctld (`scontrol show job $jobid` must work). Slurm will free the job information using the setting of MinJobAge (which defaults to 300 seconds). From the jobs I looked at in your hash directory, most did not use the #SBATCH syntax in the batch files for submission meaning the required information for the job is not in the batch scripts themselves.
Comment 76 Nate Rini 2019-10-29 16:18:47 MDT
Is the cluster currently up and running jobs? If I understand the flow of this ticket, the cluster is now up and we are trying to find a way to recover the lost queue.
Comment 78 Adam 2019-10-29 16:33:13 MDT
(In reply to Nate Rini from comment #76)
> Is the cluster currently up and running jobs? If I understand the flow of
> this ticket, the cluster is now up and we are trying to find a way to
> recover the lost queue.

Not currently, we're trying to figure out how to set minmemorynode when scontrol only shows minmemorycpu=1G and tres=mem=64G on a 2node 64 cores, 64 tasks job.  If we use mem= value to put in minmemorynode then their job will now require 64G per node, not 32cores*1G like originally.  Any ideas?
Comment 79 Nate Rini 2019-10-29 16:38:06 MDT
(In reply to Adam from comment #78)
> (In reply to Nate Rini from comment #76)
> > Is the cluster currently up and running jobs? If I understand the flow of
> > this ticket, the cluster is now up and we are trying to find a way to
> > recover the lost queue.
> 
> Not currently, we're trying to figure out how to set minmemorynode when
> scontrol only shows minmemorycpu=1G and tres=mem=64G on a 2node 64 cores, 64
> tasks job.  If we use mem= value to put in minmemorynode then their job will
> now require 64G per node, not 32cores*1G like originally.  Any ideas?

Is this for new job submission or overriding values with scontrol?
Comment 83 Adam 2019-10-29 17:10:53 MDT
(In reply to Nate Rini from comment #79)
> (In reply to Adam from comment #78)
> > (In reply to Nate Rini from comment #76)
> > > Is the cluster currently up and running jobs? If I understand the flow of
> > > this ticket, the cluster is now up and we are trying to find a way to
> > > recover the lost queue.
> > 
> > Not currently, we're trying to figure out how to set minmemorynode when
> > scontrol only shows minmemorycpu=1G and tres=mem=64G on a 2node 64 cores, 64
> > tasks job.  If we use mem= value to put in minmemorynode then their job will
> > now require 64G per node, not 32cores*1G like originally.  Any ideas?
> 
> Is this for new job submission or overriding values with scontrol?

This is for overriding the values for existing jobs so that they don't go OOM
Comment 84 Martin Siegert 2019-10-29 17:14:48 MDT
We are wondering to what value minmempernode should be set for jobs that did not use the --mem=... argument in the job submission, but, e.g., used --ntasks=... --mem-per-cpu-...
In those cases minmempernode is not even defined, isn't it?
Comment 85 Felip Moll 2019-10-29 17:18:32 MDT
(In reply to Martin Siegert from comment #84)
> We are wondering to what value minmempernode should be set for jobs that did
> not use the --mem=... argument in the job submission, but, e.g., used
> --ntasks=... --mem-per-cpu-...
> In those cases minmempernode is not even defined, isn't it?

Correct, in these cases it is not defined.

How does the job look like now? (scontrol show job)

Is that how you would like it to look like??:

]$ sbatch --begin=now+3minute --wrap uptime --mem-per-cpu=1G --cpus-per-task=1 --ntasks-per-node=32 -N2
Submitted batch job 30
]$ scontrol show job 30
JobId=30 JobName=wrap
   ....
   NumNodes=2-2 NumCPUs=64 NumTasks=64 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=64,mem=64G,node=2,billing=64
   Socks/Node=* NtasksPerN:B:S:C=32:0:*:* CoreSpec=*
   MinCPUsNode=32 MinMemoryCPU=1G MinTmpDiskNode=0
   ....
Comment 86 Felip Moll 2019-10-29 17:28:59 MDT
But if you are trying to update running jobs memory you won't be able to.

Memory can only be modified when a job is PD.

]$ scontrol update job 31 MinMemoryCPU=2000
Job is no longer pending execution for job 31

]$ scontrol update job 31 MinMemoryNode=2000
Job is no longer pending execution for job 31

Updating memory of a running job would be complicated since other allocations should be taken into account, the cgroups plugins should change limits, accounting would change, and so on.

There are other parameters like num tasks, nodes, cpus per task, reservations.. that are not changeable when job is already running.
Comment 87 Martin Siegert 2019-10-29 17:32:41 MDT
(In reply to comment #85)
Correct, that's what we see for such jobs as well.
Hence the question becomes what variable do we modify instead of minmemorynode (see comment #56) in order to set job_ptr->bit_flags for such jobs? Do we modify MinMemoryCPU?
Comment 88 Felip Moll 2019-10-29 17:43:37 MDT
(In reply to Martin Siegert from comment #87)
> (In reply to comment #85)
> Correct, that's what we see for such jobs as well.
> Hence the question becomes what variable do we modify instead of
> minmemorynode (see comment #56) in order to set job_ptr->bit_flags for such
> jobs? Do we modify MinMemoryCPU?

Yes. 

Check this example to see if it is what you're looking for:

[lipi@llagosti inst]$ sbatch --begin=now+3minute --wrap uptime --mem-per-cpu=1G --cpus-per-task=1 --ntasks-per-node=32 -N2
Submitted batch job 36
[lipi@llagosti inst]$ scontrol show job 36
JobId=36 JobName=wrap
   UserId=lipi(1000) GroupId=lipi(1000) MCS_label=N/A
   Priority=146607 Nice=0 Account=lipi QOS=normal WCKey=*
   JobState=PENDING Reason=BeginTime Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2019-10-30T00:42:53 EligibleTime=2019-10-30T00:45:53
   AccrueTime=2019-10-30T00:45:53
   StartTime=2019-10-30T00:45:53 EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-10-30T00:42:53
   Partition=debug AllocNode:Sid=llagosti:4936
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=2-2 NumCPUs=64 NumTasks=64 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=64,mem=64G,node=2,billing=64
   Socks/Node=* NtasksPerN:B:S:C=32:0:*:* CoreSpec=*
   MinCPUsNode=32 MinMemoryCPU=1G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/home/lipi/slurm/19.05/inst
   StdErr=/home/lipi/slurm/19.05/inst/slurm-36.out
   StdIn=/dev/null
   StdOut=/home/lipi/slurm/19.05/inst/slurm-36.out
   Power=

[lipi@llagosti inst]$ scontrol update JobId=36 MinMemoryCPU=1023
[lipi@llagosti inst]$ scontrol update JobId=36 MinMemoryCPU=1024

[lipi@llagosti inst]$ scontrol show job 36
JobId=36 JobName=wrap
   UserId=lipi(1000) GroupId=lipi(1000) MCS_label=N/A
   Priority=146607 Nice=0 Account=lipi QOS=normal WCKey=*
   JobState=PENDING Reason=BeginTime Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2019-10-30T00:42:53 EligibleTime=2019-10-30T00:45:53
   AccrueTime=2019-10-30T00:45:53
   StartTime=2019-10-30T00:45:53 EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-10-30T00:42:53
   Partition=debug AllocNode:Sid=llagosti:4936
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=2-2 NumCPUs=64 NumTasks=64 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=64,mem=64G,node=2,billing=64
   Socks/Node=* NtasksPerN:B:S:C=32:0:*:* CoreSpec=*
   MinCPUsNode=32 MinMemoryCPU=1G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/home/lipi/slurm/19.05/inst
   StdErr=/home/lipi/slurm/19.05/inst/slurm-36.out
   StdIn=/dev/null
   StdOut=/home/lipi/slurm/19.05/inst/slurm-36.out
   Power=
Comment 89 Martin Siegert 2019-10-29 22:19:48 MDT
We fixed the jobs in the queue and are back in production.
Thanks for all your help!

Cheers,
Martin
Comment 90 Nate Rini 2019-10-29 22:27:14 MDT
(In reply to Martin Siegert from comment #89)
> We fixed the jobs in the queue and are back in production.

Reducing severity per your response while we QA a patch set for this issue.
Comment 94 Felip Moll 2019-11-11 22:41:33 MST
Hi,

Issue has been fixed in 19.05:

commit 6abe1e7592b8e35bceedddc63033c5e82434d7a6
Author:     Felip Moll <felip.moll@schedmd.com>
AuthorDate: Mon Nov 4 16:29:06 2019 +0100
Commit:     Brian Christiansen <brian@schedmd.com>
CommitDate: Thu Nov 7 17:40:23 2019 -0700

    Fix regression on update from older versions with DefMemPerCPU
    
    In 19.05 JOB_MEM_SET flag was added along with a conditional check on
    this flag that changed the pn_min_memory when validating job limits.
    This caused that after an upgrade, PD jobs in earlier versions didn't
    have this flag and the memory was incorrectly set when their limits were
    checked before starting. The patch here addresses this issue adding this
    flag to jobs from an older protocol version when loading the state
    files.
    
    Bug 8011


We fixed it by setting the unexistent JOB_MEM_SET flag in older versions to all jobs read from the statefiles when upgrading.
This will force to preserve the memory settings for all jobs. The only situation that would not behave perfectly is when the DefMemPerCPU is changed during or after the upgrade to a new version. The old PD jobs won't get the memory values updated depending on the new DefMemPerCPU. This cannot be fixed since in older versions we don't distinguish who set the memory on a job. It is a minor issue which will disappear in 20.11 since the older will be 19.05 which already sets JOB_MEM_SET.

I am closing the issue, thanks!!