Ticket 8747 - squeue doesn't update number of nodes for pending jobs when it is changed by the user
Summary: squeue doesn't update number of nodes for pending jobs when it is changed by ...
Status: RESOLVED DUPLICATE of ticket 8224
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 20.02.0
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-03-27 12:23 MDT by Ryan Day
Modified: 2020-04-01 05:08 MDT (History)
1 user (show)

See Also:
Site: LLNL
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Ryan Day 2020-03-27 12:23:26 MDT
I believe that this is the same problem described in bug 8224, but that bug doesn't appear to be getting a lot of attention and I confirmed that the issue exists in 20.02.0 as well, so I thought I'd bring it up again. Basically, if a user updates the number of nodes that a job is requesting, that change appears to take effect, but the change is not reflected in the output of squeue for any jobs except the highest priority job. Bug 8224 also suggested that squeue might just take longer to get an updated value from the select plugin, but if that's the case, it's taking more than 1 hour.

Here's an example:
[day36@haze2:~]$ srun -N3 -t60 sleep 1h &
[1] 66888
[day36@haze2:~]$ srun: job 157 queued and waiting for resources

[day36@haze2:~]$ srun: job 157 has been allocated resources

[day36@haze2:~]$ srun -N3 sleep 10m &
[2] 66904
[day36@haze2:~]$ srun: job 158 queued and waiting for resources
srun -N3 sleep 10m &
[3] 66907
[day36@haze2:~]$ srun: job 159 queued and waiting for resources
[day36@haze2:~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
               158    pdebug    sleep    day36 PD       0:00      3 (Resources) 
               159    pdebug    sleep    day36 PD       0:00      3 (Priority) 
               157    pdebug    sleep    day36  R       0:10      3 haze[6-8] 
[day36@haze2:~]$ scontrol update jobid=158 numnodes=1-1
[day36@haze2:~]$ scontrol update jobid=159 numnodes=1-1
[day36@haze2:~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
               159    pdebug    sleep    day36 PD       0:00      3 (Priority) 
               158    pdebug    sleep    day36 PD       0:00      1 (Resources) 
               157    pdebug    sleep    day36  R       1:21      3 haze[6-8] 
[day36@haze2:~]$ for I in `seq 1 12`; do echo $I; squeue; sleep 5m; done
1
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
               158    pdebug    sleep    day36 PD       0:00      1 (Resources) 
               159    pdebug    sleep    day36 PD       0:00      3 (Priority) 
               157    pdebug    sleep    day36  R       3:04      3 haze[6-8] 
2
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
               158    pdebug    sleep    day36 PD       0:00      1 (Resources) 
               159    pdebug    sleep    day36 PD       0:00      3 (Priority) 
               157    pdebug    sleep    day36  R       8:04      3 haze[6-8] 
...
11
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
               158    pdebug    sleep    day36 PD       0:00      1 (Resources) 
               159    pdebug    sleep    day36 PD       0:00      3 (Priority) 
               157    pdebug    sleep    day36  R      53:04      3 haze[6-8] 
12
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
               158    pdebug    sleep    day36 PD       0:00      1 (Resources) 
               159    pdebug    sleep    day36 PD       0:00      3 (Priority) 
               157    pdebug    sleep    day36  R      58:04      3 haze[6-8] 
srun: job 159 has been allocated resources
srun: job 158 has been allocated resources
[1]   Done                    srun -N3 -t60 sleep 1h
[day36@haze2:~]$ 
[day36@haze2:~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
               158    pdebug    sleep    day36  R       4:06      1 haze6 
               159    pdebug    sleep    day36  R       4:06      1 haze7 
[day36@haze2:~]$

This is with SelectType=select/cons_res. Haven't checked out cons_tres yet.
Comment 1 Jason Booth 2020-03-27 13:15:52 MDT
>I believe that this is the same problem described in bug 8224, but that bug doesn't appear to be getting a lot of attention and I confirmed that the issue exists in 20.02.0 as well...

Hi Ryan - We apologize for the confusion in bug #8224. We have a review process that we send all our potential fixes through so as to avoid introducing new issues into the codebase. There is a patch pending for this and I will leave it up to Dominik if he wants to make that available for you to test with.

I hope you understand why there seems to be a delay in that ticket while we review the patch.
Comment 2 Ryan Day 2020-03-27 14:08:40 MDT
Ah. That makes sense. Thank you for the update Jason. Since 19.05.6 was just released as the last 19.05 release aside from security fixes, should I also assume that the patch will only be available for 20.02?

Thanks,
Ryan

(In reply to Jason Booth from comment #1)
> >I believe that this is the same problem described in bug 8224, but that bug doesn't appear to be getting a lot of attention and I confirmed that the issue exists in 20.02.0 as well...
> 
> Hi Ryan - We apologize for the confusion in bug #8224. We have a review
> process that we send all our potential fixes through so as to avoid
> introducing new issues into the codebase. There is a patch pending for this
> and I will leave it up to Dominik if he wants to make that available for you
> to test with.
> 
> I hope you understand why there seems to be a delay in that ticket while we
> review the patch.
Comment 3 Jason Booth 2020-03-27 14:12:11 MDT
> ... should I also assume that the patch will only be available for 20.02?

That is correct. Bug fixes are being targeted for 20.02 now that it has been released. Security fixes or crashes will be targeted for 19.05 now.
Comment 4 Dominik Bartkiewicz 2020-04-01 05:08:40 MDT
Hi

The fix is committed to the repo and it will be available in 20.02.2 release.
https://github.com/SchedMD/slurm/commit/623574431d545b2ff0

This patch goes only to the 20.02 branch, but you can apply it to 19.05 safely.

I'm going to go ahead and close this bug as a duplicate of 8224.
If you have any questions, feel free to reopen.

Dominik

*** This ticket has been marked as a duplicate of ticket 8224 ***