Ticket 3551 - Allocation for reserved node hangs
Summary: Allocation for reserved node hangs
Status: RESOLVED TIMEDOUT
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 16.05.8
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-03-08 04:00 MST by Sven Sternberger
Modified: 2017-05-23 18:49 MDT (History)
1 user (show)

See Also:
Site: DESY
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf.old (8.16 KB, application/x-trash)
2017-04-11 01:34 MDT, Sven Sternberger
Details
slurm.conf (8.01 KB, application/octet-stream)
2017-04-11 01:34 MDT, Sven Sternberger
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Sven Sternberger 2017-03-08 04:00:54 MST
Hello!

we configured one partition with one node. The config of the
partition is:

PartitionName=parta    Priority=10   Default=NO  State=UP OverSubscribe=FORCE:42   MaxNodes=1   AllowGroups=xxxuser    Nodes=foo

Then we create a reservation:

ReservationName=faa StartTime=2017-03-07T17:12:23 EndTime=2017-03-10T09:00:00 Duration=2-15:47:37
   Nodes=foo NodeCnt=1 CoreCnt=32 Features=(null) PartitionName=parta Flags=IGNORE_JOBS,SPEC_NODES
   TRES=cpu=64
   Users=joe,mike,tim

This works most of the time and joe,mike and time can make an salloc
to the partition (parta) and log into foo in parallel.

BUT from time to time salloc is pending even if the overcommit is not reached. 
If we restart slurmctld the pending salloc is executed.

We think that the problem is triggered when some other job is submitted to partition (partx). The node foo is not part of partx, but we see in the log the following pattern

debug:  sched: Running job scheduler
debug2: found 6 usable nodes from config containing fee[003-008]
...

debug2: Advanced reservation removed 7 nodes from consideration for job 222372
select_p_job_begin: job 222372 could not find partition partx for node foo
...
error: select_g_job_begin(222372): No error

hope you can help us or give us advice how to further debug the problem

best regards,

Sven Sternberger
Comment 1 Sven Sternberger 2017-03-08 04:17:43 MST
So we removed the overcommit stuff and the allocation still is in pending
from time to time
Comment 2 Sven Sternberger 2017-03-09 05:45:37 MST
Okay I investigate further and still have no clue

It looks like there a few nodes which are not accessibly for slurm
after the log shows these line
 
select_p_job_begin: job 222372 could not find partition partx for node foo

the affected nodes shows up in sinfo as idle, but I can't access them.
I can connect to the slurm daemon on the node with netcat.
And wireshark shows me that there is no communication to the affected nodes
when I try to make a srun. But from time to time there is communication
to these nodes (status update?)

The workaround is to restart slumctld but after a time (we assume
that it is triggered by a large job submit to special partition) the log
line (select_p_job_begin*) appears again and afterwards we have the same situation
Comment 3 Sven Sternberger 2017-03-10 02:09:04 MST
now we got the source of our problem:

We have several overlaping partition, node allocation and backfill scheduler.
We added one partition wich has only one node:

PartitionName=foo    Priority=10   Default=NO  State=UP OverSubscribe=FORCE:42   MaxNodes=1   AllowGroups=usergrp    Nodes=nodex

That worked liked expected. But when we submit a job which get queued we saw from
time to time these lines in the log

select_p_job_begin: job 222372 could not find partition partx for node nodey

As neither the partition nor the node was in the partition definition we
don't saw the relation.

After the line appears some nodes couldn't execute jobs, the status was idle,
they were reachable. I set them to drain and resume, I rebooted the node
without any change. I realized that I couldn't set the status to down.
When I restart slurmctld it work again till a job is queued again.
The affected nodes were always more or less the same, it was always the
node from the new partition. The other affected nodes were in different
partitions.

We removed the OverSubscribe=FORCE:42 flag from the Partition and added
the "Shared=No" like on the other partitions and now everthing works fine
(except that we can't share the node :-/)

best regards!
Comment 4 Sven Sternberger 2017-03-16 04:08:21 MDT
The problem is back, we thought we fixed it but it simply occurs
now not everyday instead once a week. We also have the suspicion
that the it depends how busy the cluster is.

So we see again:

slurmctld[11758]: select_p_job_begin: job 123456 could not find partition foo for node wn0815

the node wn0815 is not part of the partition definition for foo.
Afterwards wn0815 can't get jobs and the status of the node is idle.

best regards
Comment 5 Tim Wickberg 2017-04-10 10:32:24 MDT
Hi Sven -

Do you mind attaching a recent copy of your slurm.conf file? That's help me understand the system layout. I'm working through the rest of the problem description now.

cheers,
- Tim
Comment 6 Sven Sternberger 2017-04-11 01:34:31 MDT
Created attachment 4333 [details]
slurm.conf.old

Hi Tim! 

so slurm.conf is the actual slurm.conf and slurm.conf.old is the one which in our assumption 
triggered the problem. 

cheers! 

Sven 

> Von: bugs@schedmd.com
> An: "sven sternberger" <sven.sternberger@desy.de>
> Gesendet: Montag, 10. April 2017 18:32:24
> Betreff: [Bug 3551] Allocation for reserved node hangs

> Tim Wickberg changed bug 3551
> What Removed Added
> Severity 	6 - No support contract 	4 - Minor Issue
> Assignee 	jacob@schedmd.com 	tim@schedmd.com
> Comment # 5 on bug 3551 from Tim Wickberg
> Hi Sven -

> Do you mind attaching a recent copy of your slurm.conf file? That's help me
> understand the system layout. I'm working through the rest of the problem
> description now.

> cheers,
> - Tim

> You are receiving this mail because:

>     * You reported the bug.
>     * You are on the CC list for the bug.
Comment 7 Sven Sternberger 2017-04-11 01:34:32 MDT
Created attachment 4334 [details]
slurm.conf
Comment 8 Tim Wickberg 2017-05-03 13:06:34 MDT
The one thing I'm noticing is the change of ThreadsPerCore on the max-wgs001 node. When you did that, did you run 'scontrol reconfigure' to update the config throughout the cluster?

The 'NodeName=max-wgs[001-001]' syntax is a bit unusual. I don't think that should cause any problems, but just to humor me do you mind changing that to max-wgs001 in the config, then running 'scontrol reconfigure' to sync up everything?

tim@redshift:~$ diff Downloads/slurm.conf.old  Downloads/slurm.conf 
99c99
< NodeName=max-wgs[001-001]  Weight=9  RealMemory=256 Sockets=2 CoresPerSocket=8  ThreadsPerCore=2 State=UNKNOWN Feature=INTEL,V3,E5-2640
---
> NodeName=max-wgs[001-001]  Weight=9  RealMemory=256 Sockets=2 CoresPerSocket=8  ThreadsPerCore=1 State=UNKNOWN Feature=INTEL,V3,E5-2640
137d136
< PartitionName=p3bl-test    Priority=10   Default=NO  State=UP OverSubscribe=FORCE:42   MaxNodes=1   AllowGroups=max-p3-testuser    Nodes=max-wng003
Comment 9 Tim Wickberg 2017-05-23 18:49:19 MDT
(In reply to Tim Wickberg from comment #8)
> The one thing I'm noticing is the change of ThreadsPerCore on the max-wgs001
> node. When you did that, did you run 'scontrol reconfigure' to update the
> config throughout the cluster?

Marking this as resolved/timedout for now - I'd expected a reply on this, and haven't been pursuing this in the meantime. Please reopen if this is still an active concern.

cheers,
- Tim