Hello! we configured one partition with one node. The config of the partition is: PartitionName=parta Priority=10 Default=NO State=UP OverSubscribe=FORCE:42 MaxNodes=1 AllowGroups=xxxuser Nodes=foo Then we create a reservation: ReservationName=faa StartTime=2017-03-07T17:12:23 EndTime=2017-03-10T09:00:00 Duration=2-15:47:37 Nodes=foo NodeCnt=1 CoreCnt=32 Features=(null) PartitionName=parta Flags=IGNORE_JOBS,SPEC_NODES TRES=cpu=64 Users=joe,mike,tim This works most of the time and joe,mike and time can make an salloc to the partition (parta) and log into foo in parallel. BUT from time to time salloc is pending even if the overcommit is not reached. If we restart slurmctld the pending salloc is executed. We think that the problem is triggered when some other job is submitted to partition (partx). The node foo is not part of partx, but we see in the log the following pattern debug: sched: Running job scheduler debug2: found 6 usable nodes from config containing fee[003-008] ... debug2: Advanced reservation removed 7 nodes from consideration for job 222372 select_p_job_begin: job 222372 could not find partition partx for node foo ... error: select_g_job_begin(222372): No error hope you can help us or give us advice how to further debug the problem best regards, Sven Sternberger
So we removed the overcommit stuff and the allocation still is in pending from time to time
Okay I investigate further and still have no clue It looks like there a few nodes which are not accessibly for slurm after the log shows these line select_p_job_begin: job 222372 could not find partition partx for node foo the affected nodes shows up in sinfo as idle, but I can't access them. I can connect to the slurm daemon on the node with netcat. And wireshark shows me that there is no communication to the affected nodes when I try to make a srun. But from time to time there is communication to these nodes (status update?) The workaround is to restart slumctld but after a time (we assume that it is triggered by a large job submit to special partition) the log line (select_p_job_begin*) appears again and afterwards we have the same situation
now we got the source of our problem: We have several overlaping partition, node allocation and backfill scheduler. We added one partition wich has only one node: PartitionName=foo Priority=10 Default=NO State=UP OverSubscribe=FORCE:42 MaxNodes=1 AllowGroups=usergrp Nodes=nodex That worked liked expected. But when we submit a job which get queued we saw from time to time these lines in the log select_p_job_begin: job 222372 could not find partition partx for node nodey As neither the partition nor the node was in the partition definition we don't saw the relation. After the line appears some nodes couldn't execute jobs, the status was idle, they were reachable. I set them to drain and resume, I rebooted the node without any change. I realized that I couldn't set the status to down. When I restart slurmctld it work again till a job is queued again. The affected nodes were always more or less the same, it was always the node from the new partition. The other affected nodes were in different partitions. We removed the OverSubscribe=FORCE:42 flag from the Partition and added the "Shared=No" like on the other partitions and now everthing works fine (except that we can't share the node :-/) best regards!
The problem is back, we thought we fixed it but it simply occurs now not everyday instead once a week. We also have the suspicion that the it depends how busy the cluster is. So we see again: slurmctld[11758]: select_p_job_begin: job 123456 could not find partition foo for node wn0815 the node wn0815 is not part of the partition definition for foo. Afterwards wn0815 can't get jobs and the status of the node is idle. best regards
Hi Sven - Do you mind attaching a recent copy of your slurm.conf file? That's help me understand the system layout. I'm working through the rest of the problem description now. cheers, - Tim
Created attachment 4333 [details] slurm.conf.old Hi Tim! so slurm.conf is the actual slurm.conf and slurm.conf.old is the one which in our assumption triggered the problem. cheers! Sven > Von: bugs@schedmd.com > An: "sven sternberger" <sven.sternberger@desy.de> > Gesendet: Montag, 10. April 2017 18:32:24 > Betreff: [Bug 3551] Allocation for reserved node hangs > Tim Wickberg changed bug 3551 > What Removed Added > Severity 6 - No support contract 4 - Minor Issue > Assignee jacob@schedmd.com tim@schedmd.com > Comment # 5 on bug 3551 from Tim Wickberg > Hi Sven - > Do you mind attaching a recent copy of your slurm.conf file? That's help me > understand the system layout. I'm working through the rest of the problem > description now. > cheers, > - Tim > You are receiving this mail because: > * You reported the bug. > * You are on the CC list for the bug.
Created attachment 4334 [details] slurm.conf
The one thing I'm noticing is the change of ThreadsPerCore on the max-wgs001 node. When you did that, did you run 'scontrol reconfigure' to update the config throughout the cluster? The 'NodeName=max-wgs[001-001]' syntax is a bit unusual. I don't think that should cause any problems, but just to humor me do you mind changing that to max-wgs001 in the config, then running 'scontrol reconfigure' to sync up everything? tim@redshift:~$ diff Downloads/slurm.conf.old Downloads/slurm.conf 99c99 < NodeName=max-wgs[001-001] Weight=9 RealMemory=256 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN Feature=INTEL,V3,E5-2640 --- > NodeName=max-wgs[001-001] Weight=9 RealMemory=256 Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 State=UNKNOWN Feature=INTEL,V3,E5-2640 137d136 < PartitionName=p3bl-test Priority=10 Default=NO State=UP OverSubscribe=FORCE:42 MaxNodes=1 AllowGroups=max-p3-testuser Nodes=max-wng003
(In reply to Tim Wickberg from comment #8) > The one thing I'm noticing is the change of ThreadsPerCore on the max-wgs001 > node. When you did that, did you run 'scontrol reconfigure' to update the > config throughout the cluster? Marking this as resolved/timedout for now - I'd expected a reply on this, and haven't been pursuing this in the meantime. Please reopen if this is still an active concern. cheers, - Tim