8224 – When changing a pending job from 4 to 1 node via scontrol, the change is not reflected in the squeue listing of the job; it is in scontrol

Ticket 8224 - When changing a pending job from 4 to 1 node via scontrol, the change is not reflected in the squeue listing of the job; it is in scontrol

Summary: When changing a pending job from 4 to 1 node via scontrol, the change is not ...

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	19.05.2
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Dominik Bartkiewicz
QA Contact:

URL:

Duplicates (1):	8747 (view as ticket list)
Depends on:
Blocks:

Reported:	2019-12-12 08:46 MST by Jenny Williams
Modified:	2020-09-02 10:27 MDT (History)
CC List:	3 users (show)

See Also:	8110 9304
Site:	University of North Carolina at Chapel Hill
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	20.02.2
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Jenny Williams 2019-12-12 08:46:36 MST

# squeue -j 44289341
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          44289341       hov     wrap hepperla PD       0:00      4 (JobHeldUser)
[root@longleaf-sched slurm_utils]# scontrol show job 44289341
JobId=44289341 JobName=wrap
   UserId=hepperla(214234) GroupId=its_graduate_psx(203) MCS_label=N/A
   Priority=0 Nice=0 Account=rc_ijdavis_pi QOS=normal
   JobState=PENDING Reason=JobHeldUser Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=10-12:00:00 TimeMin=N/A
   SubmitTime=2019-12-12T09:33:38 EligibleTime=2019-12-12T09:33:38
   AccrueTime=Unknown
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-12-12T10:41:49
   Partition=hov AllocNode:Sid=longleaf-login2.its.unc.edu:13492
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=c0308
   NumNodes=1-1 NumCPUs=12 NumTasks=12 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=12,mem=200G,node=1,billing=12
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=50G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/proj/dllab/Austin/TCGA/BRCA/ATAC_bams
   StdErr=/proj/dllab/Austin//SLURM_logs//2019/12/12/20191212_09-33-38-508
   StdIn=/dev/null
   StdOut=/proj/dllab/Austin//SLURM_logs/2019/12/12/20191212_09-33-38-508
   Power=

Comment 1 Dominik Bartkiewicz 2019-12-12 09:47:04 MST

Hi

Dose this occurs only for held jobs?
If yes, I can recreate this.

Dominik

Comment 2 Jenny Williams 2019-12-12 10:12:04 MST

No . I held it after

Get Outlook for Android<https://aka.ms/ghei36>

________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Thursday, December 12, 2019 11:47:04 AM
To: Williams, Jenny Avis <jennyw@email.unc.edu>
Subject: [Bug 8224] When changing a pending job from 4 to 1 node via scontrol, the change is not reflected in the squeue listing of the job; it is in scontrol


Comment # 1<https://bugs.schedmd.com/show_bug.cgi?id=8224#c1> on bug 8224<https://bugs.schedmd.com/show_bug.cgi?id=8224> from Dominik Bartkiewicz<mailto:bart@schedmd.com>

Hi

Dose this occurs only for held jobs?
If yes, I can recreate this.

Dominik

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 3 Dominik Bartkiewicz 2019-12-12 10:38:12 MST

Hi

Squeue display node counts based on the value returned from the select plugin and this can take some time to reevaluate.
Could you check if this value is updated if you wait a few minutes?

Dominik

Comment 6 Ryan Day 2020-03-11 09:34:42 MDT

We're seeing this bug at LLNL too. The job waiting on resources gets updated in the squeue output, but jobs lower in the queue don't. I waited at least 5 minutes and it never updated. Using the cons_res select plugin, and slurm 19.05.5 fwiw.

Here's what I see evidence:
[day36@ipa15:~]$ srun -N3 sleep 600 &
[1] 126473
[day36@ipa15:~]$ squeue -p pall
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             22277      pall    sleep    day36  R       0:06      3 ipa[4-5,7]
[day36@ipa15:~]$ srun -N3 sleep 600 &
[2] 126492
[day36@ipa15:~]$ srun: job 22278 queued and waiting for resources
srun -N3 sleep 600 &
[3] 126498
[day36@ipa15:~]$ srun: job 22279 queued and waiting for resources
squeue -p pall
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             22278      pall    sleep    day36 PD       0:00      3 (Resources)
             22279      pall    sleep    day36 PD       0:00      3 (Priority)
             22277      pall    sleep    day36  R       0:14      3 ipa[4-5,7]
[day36@ipa15:~]$ scontrol update jobid=22279 numnodes=1-1
[day36@ipa15:~]$ squeue -p pall
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             22278      pall    sleep    day36 PD       0:00      3 (Resources)
             22279      pall    sleep    day36 PD       0:00      3 (Priority)
             22277      pall    sleep    day36  R       0:49      3 ipa[4-5,7]
[day36@ipa15:~]$ scontrol show job 22279 | grep -i numnodes
   NumNodes=1-1 NumCPUs=3 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
[day36@ipa15:~]$ squeue -p pall
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             22278      pall    sleep    day36 PD       0:00      3 (Resources)
             22279      pall    sleep    day36 PD       0:00      3 (Priority)
             22277      pall    sleep    day36  R       1:12      3 ipa[4-5,7]
[day36@ipa15:~]$ scontrol update jobid=22278 numnodes=1-1
[day36@ipa15:~]$ squeue -p pall
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             22278      pall    sleep    day36 PD       0:00      1 (Resources)
             22279      pall    sleep    day36 PD       0:00      3 (Priority)
             22277      pall    sleep    day36  R       1:23      3 ipa[4-5,7]
[day36@ipa15:~]$ scontrol show job 22278 | grep -i numnodes
   NumNodes=1-1 NumCPUs=3 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
[day36@ipa15:~]$ squeue -p pall
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             22278      pall    sleep    day36 PD       0:00      1 (Resources)
             22279      pall    sleep    day36 PD       0:00      3 (Priority)
             22277      pall    sleep    day36  R       5:28      3 ipa[4-5,7]
[day36@ipa15:~]$ squeue -p pall
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             22278      pall    sleep    day36 PD       0:00      1 (Resources)
             22279      pall    sleep    day36 PD       0:00      3 (Priority)
             22277      pall    sleep    day36  R       6:43      3 ipa[4-5,7]
[day36@ipa15:~]$

Comment 9 Dominik Bartkiewicz 2020-03-30 05:49:26 MDT

Hi

This commit should fix this issue:
https://github.com/SchedMD/slurm/commit/623574431d545b2ff0

We are still waiting for a review of additional patches (bug 8110) to be able to return a more precise estimation of nodes count required by jobs.

I'll go ahead and close this ticket, but feel free to let me know if you have
any additional questions about the fix.

Dominik

Comment 10 Ryan Day 2020-03-30 15:33:26 MDT

Thanks Dominik. At first glance, it looks like it shouldn't be any problem to apply this patch back to 19.05 as well. Do you know of reason that wouldn't work?

(In reply to Dominik Bartkiewicz from comment #9)
> Hi
> 
> This commit should fix this issue:
> https://github.com/SchedMD/slurm/commit/623574431d545b2ff0
> 
> We are still waiting for a review of additional patches (bug 8110) to be
> able to return a more precise estimation of nodes count required by jobs.
> 
> I'll go ahead and close this ticket, but feel free to let me know if you have
> any additional questions about the fix.
> 
> Dominik

Comment 11 Dominik Bartkiewicz 2020-03-31 07:39:58 MDT

Hi

This patch was prepared for 19.05 and it will work correctly with it.

Dominik

Comment 12 Dominik Bartkiewicz 2020-04-01 05:08:40 MDT

*** Ticket 8747 has been marked as a duplicate of this ticket. ***