Ticket 3287

Summary:	slurmctld thinks a node has been powering up for 8 days (during which time srun jobs hang)
Product:	Slurm	Reporter:	Phil Schwan <phils>
Component:	slurmctld	Assignee:	Tim Wickberg <tim>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	alex, paull, stuartm
Version:	14.11.10
Hardware:	Linux
OS:	Linux
Site:	DownUnder GeoSolutions	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Phil Schwan 2016-11-23 03:58:37 MST

It's as simple as this:

srun -N 1 -p idle hostname

It works on some nodes:

> $ srun -N 1 -p idle -w lnod0046 hostname
> srun: job 6047591 queued and waiting for resources
> srun: job 6047591 has been allocated resources
> lnod0046

but not on others:

> $ srun -N 1 -p idle -w lnod0007 hostname
> srun: job 6047605 queued and waiting for resources
> srun: job 6047605 has been allocated resources
> (infinite hang)

When I run srun with -vvvvv:

> srun: job 6047632 has been allocated resources
> srun: Waiting for nodes to boot

Waiting for nodes to boot, you say?  That's a clue:

> $ scontrol show node lnod0007
> NodeName=lnod0007 CoresPerSocket=6
>    CPUAlloc=24 CPUErr=0 CPUTot=24 CPULoad=N/A Features=intel,nogpu,gpu,8gpu,fastio
>    Gres=(null)
>    NodeAddr=lnod0007 NodeHostName=lnod0007 Version=(null)
>    RealMemory=129083 AllocMem=112640 Sockets=2 Boards=1
>    MemSpecLimit=122939
>    State=ALLOCATED+POWER ThreadsPerCore=2 TmpDisk=750 Weight=1
>    BootTime=None SlurmdStartTime=None
>    CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

This document (http://slurm.schedmd.com/sinfo.html#lbAG) doesn't acknowledge "POWER" as a valid state, but maybe that documentation refers to a newer version.  I assume that this means "the node has been allocated to a job, and is in the process of being powered up"?

But the node has been up for 8 days!

> $ ssh lnod0007 uptime
>  10:43:45 up 8 days,  8:59,  0 users,  load average: 0.35, 0.86, 1.84

And moreover, slurm knows it -- it's running another job on it right now:

> 2016-11-23T10:43:46+00:00 lnod0007 slurmd[5988]: Launching batch job 6047735 for UID 1278

...and that job is running just fine, producing output, etc.

So I feel like there are two problems here:

a. Why does it think it's still in POWER state, with no BootTime or SlurmdStartTime, when lnod0007's slurmd is clearly talking to the slurmctld?

b. Why does an srun job hang, but a non-srun job works fine?

The combination of these two is pernicious, because slurmctld will *try* to schedule an srun job on these machines, but it will simply hang forever.  That's what makes it high-impact -- that job will get stuck forever, meanwhile those nodes are doing no work.

Thanks!

Comment 1 Alejandro Sanchez 2016-11-23 04:17:15 MST

Hi Phil - 14.11 is still supported but it's already a fairly old version and lots of fixes have been added. Just taking a quick look at the NEWS file I see a few fixes that could be related to this, such as:

[...]
-- Don't mark powered down node as not responding. This could be triggered by
    race condition of the node suspend and ping logic, preventing use of the
    node.
[...]
-- If a node is in DOWN or DRAIN state, leave it unavailable for allocation
    when powered down.
[...]

Can you attach slurm.conf, slurmctld.log and slurmd.log from lnod0007?

Comment 3 Alejandro Sanchez 2016-11-23 06:03:15 MST

Please, note also that 14.11 support ends at the end of month.

Comment 4 Phil Schwan 2016-11-28 02:31:55 MST

(In reply to Alejandro Sanchez from comment #1)
> 
> -- Don't mark powered down node as not responding. This could be triggered by
>     race condition of the node suspend and ping logic, preventing use of the
>     node.
> -- If a node is in DOWN or DRAIN state, leave it unavailable for allocation
>     when powered down.

Hmm.  These seem like kind of the opposite situation, non?  Isn't the issue here that it in fact is *not* powered down, but some part of slurm (but only part) thinks it is?

> Can you attach slurm.conf, slurmctld.log and slurmd.log from lnod0007?

Should be in your inbox now.

Comment 9 Alejandro Sanchez 2016-11-28 05:01:29 MST

Phil - isn't this bug a duplicate of bug #2965 opened by Paul and handled by Tim? I think it's the same issue and we should mark this one as a duplicate of the other and close it. Let me know what do you think.

Anyhow, unless the config is not consistent across nodes in the cluster, the config you attached have no Suspend/ResumeTimeout values, so default SuspendTimeout is 30s and default ResumeTimeout is 60s. max_delay should then be (30 + 60) * 5 = 450s. When you say it is an "infinite hang" is it actually infinite or after 450s something happens? (job starts, fails, node changes state). Also is there any prolog script or SPANK plugin that might be influencing the initiation of the job?

src/srun/libsrun/allocate.c _wait_nodes_ready()

[...]
        suspend_time = slurm_get_suspend_timeout();
        resume_time  = slurm_get_resume_timeout();
        if ((suspend_time == 0) || (resume_time == 0))
                return 1;       /* Power save mode disabled */
        max_delay = suspend_time + resume_time;
        max_delay *= 5;         /* Allow for ResumeRate support */

        pending_job_id = alloc->job_id;

        for (i = 0; (cur_delay < max_delay); i++) {
                if (i) {
                        if (i == 1)
                                verbose("Waiting for nodes to boot");
[...]

Comment 11 Phil Schwan 2016-11-28 05:54:14 MST

Thanks, Alex -- I do agree that it looks like a duplicate.

Although it seems to me that that bug is off in the weeds.  Understanding why there are communication issues and job launch delays is one thing, and probably worth solving.  But I see that as merely a symptom of the REAL problem, which is that the internal slurm state is clearly deeply confused.

When I run an MPI job, it's "waiting for nodes to boot".

Whereas if I run a non-MPI job, it works just fine, runs immediately.  So slurm is totally aware that the node is alive and responsive, yet it persists with this fantasy that it's waiting to boot.

Surely this is the real problem?

You're right, though, it's not infinite, merely indefinite:

$ date ; srun -N 1 -p lud58 -w lud58 hostname ; date
Mon Nov 28 12:10:53 GMT 2016
srun: job 6057741 queued and waiting for resources
srun: job 6057741 has been allocated resources
srun: error: Nodes lud58 are still not ready
srun: error: Something is wrong with the boot of the nodes.
srun: Force Terminated job 6057741
Mon Nov 28 12:46:02 GMT 2016

Not sure if 36 minutes means anything to you, but there it is.

Comment 12 Alejandro Sanchez 2016-11-28 06:56:58 MST

Is it possible that the nodes that currently belong to lugy cluster were used in the past with a Power Save mode? Perhaps a possible scenario happened where some nodes saved its state NODE_STATE_POWER_UP to the node state file, then you changed the configuration and started the nodes again so that this flag still persists in these nodes.

Comment 13 Alejandro Sanchez 2016-11-28 07:26:51 MST

Also what happens if you restart slurmd on one of these ALLOCATED+POWER nodes? Does the state change?

Comment 14 Alejandro Sanchez 2016-11-28 07:35:39 MST

And I see this other commit that might be related to this:

https://github.com/SchedMD/slurm/commit/c8d46bfe2819f24fc

these type of commits + 14.11 being unsupported at the end of month is why we encourage to upgrade to latest 16.05 version.

Comment 15 paull 2016-11-28 09:10:17 MST

Hi Alejandro,

I think this has everything to do with the powersaving in 14.11. It may be slurm related or it could be our powersaving scripts. 

I have been working on a newer version of our scripts but for now we are using the same version that has been in place for months. 

Every single node in this cluster was set to either IDLE+POWER or ALLOCATED+POWER. In order to remove the power flag I simply did scontrol update node=node212 state=power_down. It is my understanding that setting a node state to power_down changes the state in the internal tables, but does not execute the power down script. The job on the node continued running. I'm curious to see what happens after the job completes with the node. Will need to see this before I run the scontrol command on all nodes.

When the node "resumes" from powersave, what would be some causes as to why the state stays with +POWER? 

State prior to changes
....
[root@lugy ~]# scontrol show node=lnod0004
NodeName=lnod0004 CoresPerSocket=6
   State=ALLOCATED+POWER ThreadsPerCore=2 TmpDisk=750 Weight=1
....

Command run:
....
[root@lugy ~]# ping lnod0004
PING lnod0004.dugeo.com (172.29.4.4) 56(84) bytes of data.
64 bytes from lnod0004.dugeo.com (172.29.4.4): icmp_seq=1 ttl=64 time=0.129 ms
64 bytes from lnod0004.dugeo.com (172.29.4.4): icmp_seq=2 ttl=64 time=0.113 ms
^C
--- lnod0004.dugeo.com ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1843ms
rtt min/avg/max/mdev = 0.113/0.121/0.129/0.008 ms
[root@lugy ~]# scontrol update node=lnod0004 state=power_down
....

State after changes:
....
[root@lugy ~]# scontrol show node=lnod0004
NodeName=lnod0004 CoresPerSocket=6
   State=ALLOCATED ThreadsPerCore=2 TmpDisk=750 Weight=1
....

I took this a step further and ran power_up:
....
[root@lugy ~]# scontrol update node=lnod0004 state=power_up
[root@lugy ~]# scontrol show node=lnod0004
NodeName=lnod0004 CoresPerSocket=6
   CPUAlloc=24 CPUErr=0 CPUTot=24 CPULoad=4.33 Features=intel,nogpu,gpu,8gpu,fastio
   Gres=(null)
   NodeAddr=lnod0004 NodeHostName=lnod0004 Version=(null)
   RealMemory=129083 AllocMem=112640 Sockets=2 Boards=1
   MemSpecLimit=122939
   State=ALLOCATED ThreadsPerCore=2 TmpDisk=750 Weight=1
   BootTime=None SlurmdStartTime=None
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
....

As far as upgrading to 16.05, there is an issue that needs to be cleared before we upgrade (#3239). Once we have a fix for this we will be set to upgrade to that version. I will be testing our new powersave scripts against that version this week.

Thanks,
Paul

Comment 16 paull 2016-11-28 09:36:34 MST

Hi Alejandro,

We have discovered that due to invalid nodes in the SuspendExcNodes list, powersaving module was disabled for this cluster.

Since this the case, why is slurm trying to manage power states at all?

See below:
....
[2016-11-28T14:03:36.272] error: power_save module disabled, invalid SuspendExcNodes lud[1-2,4-5,7-8,10-13,15-16,18,21,23-27,29-32,34-57]
....

We are currently making efforts to correct this across all clusters. Just hope this is not an issue in version 16.05.

Comment 17 Alejandro Sanchez 2016-11-28 09:38:33 MST

It might be related to bug #3078 which we are currently working on. It seems SuspendExcNodes is ignored on SIGHUP (scontrol reconfigure).

Comment 19 paull 2016-11-28 10:57:30 MST

After correcting the SuspendExcNodes list in the cluster in question in this ticket, the +POWER was automatically removed from each nodes STATE without manual intervention. I now need to do some tests to see whether or not this solves the ticketed issue.

Comment 20 paull 2016-11-28 11:10:49 MST

These ran swiftly as desired:

[root@lugy 000scratch]# srun -N 1 -p lud60 hostname                                                                                                 
srun: job 6059449 queued and waiting for resources
srun: job 6059449 has been allocated resources
lud60
[root@lugy 000scratch]# srun -N 1 -p lud61 hostname                                                                                                 
srun: job 6059450 queued and waiting for resources
srun: job 6059450 has been allocated resources
lud61
[root@lugy 000scratch]# srun -N 1 -p lud62 hostname                                                                                                 
srun: job 6059452 queued and waiting for resources
srun: job 6059452 has been allocated resources
lud62


We still need to see the results of the MPI job but they should not wait for nodes to boot since all nodes will no longer have the +POWER flag attached unless they are truly in powersave state and are booting via the ResumeProgram.

Comment 22 Alejandro Sanchez 2016-11-28 12:54:27 MST

(In reply to paull from comment #15)
> It is my understanding that setting a node state to power_down changes the 
> state in the internal tables, but does not execute the power down script.

scontrol man page says:
"POWER_DOWN" and "POWER_UP" will use the configured SuspendProg and ResumeProg programs to explicitly place a node in or out of a power saving mode. If a node is already in the process of being powered up or down, the command will have no effect until the configured ResumeTimeout or SuspendTimeout is reached.

So it seems updating the node state invokes the Progs.

Comment 23 Alejandro Sanchez 2016-11-29 07:05:56 MST

(In reply to paull from comment #16)
> Hi Alejandro,
> 
> We have discovered that due to invalid nodes in the SuspendExcNodes list,
> powersaving module was disabled for this cluster.
> 
> Since this the case, why is slurm trying to manage power states at all?
> 
> See below:
> ....
> [2016-11-28T14:03:36.272] error: power_save module disabled, invalid
> SuspendExcNodes lud[1-2,4-5,7-8,10-13,15-16,18,21,23-27,29-32,34-57]
> ....
> 
> We are currently making efforts to correct this across all clusters. Just
> hope this is not an issue in version 16.05.

Paul - this other commit addressed to 16.05.3:

https://github.com/SchedMD/slurm/commit/6ef8369ec1fbc4a11

moves the place where the SuspendExcNodes and SuspendExcParts configuration parameters are processed (needs to happen AFTER the partition and node tables in the slurmctld daemon are built.) and thus the you see is also moved to correct this behavior.

Besides that, we've already a fix to not ignore the values of SuspendExcNodes and SuspendExcParts on slurmctld SIGHUP and it is pending to be reviewed and pushed. I think that after that, and when the MemSpecLimit bug #3239 is solved then you can upgrade to latest 16.05. Then I expect most of this power save problems you're experiencing in 14.11 will be solved.

Is it fine for you to close this bug until you upgrade and if then you experience more power save problems reopen it again?

Comment 24 Alejandro Sanchez 2016-12-06 05:28:57 MST

(In reply to paull from comment #15)
> As far as upgrading to 16.05, there is an issue that needs to be cleared
> before we upgrade (#3239). Once we have a fix for this we will be set to
> upgrade to that version. I will be testing our new powersave scripts against
> that version this week.

Memspeclimit bug #3239 and SuspendExcNodes/SuspendExcParts bug #3078 have been fixed in slurm-16.05.7. So I'd encourage you to test your new powersave scripts against latest 16.05 (including the fixs for these two bugs) and see how it goes.

Comment 25 paull 2016-12-06 08:58:32 MST

Thanks Alejandro, I will install 16.05.7 in our test environment and test these out. Thanks for you help and I'll update as soon as that happens. 

Thanks,
Paul

Comment 26 paull 2016-12-07 14:50:13 MST

Alejandro,

I just did a git pull on the slurm git repo, I see 16.05.6.1 but not 16.05.7. 

[paull@hud6 slurm]$ git checkout slurm-16
slurm-16.05           slurm-16-05-0-0pre2   slurm-16-05-0-0rc2    slurm-16-05-1-1       slurm-16-05-3-1       slurm-16-05-5-1 
slurm-16-05-0-0pre1   slurm-16-05-0-0rc1    slurm-16-05-0-1       slurm-16-05-2-1       slurm-16-05-4-1       slurm-16-05-6-1

Has it been pushed to github yet? 

Thanks,
Paul

Comment 27 Tim Wickberg 2016-12-07 14:55:53 MST

(Stepping in for Alex, it's getting late over in Europe.)

16.05.7 should be released on Thursday, we're running some final tests on it today. Sorry if he didn't mention that yet.

If you're working off the github branch, slurm-16.05 should be okay, although there's at least a few more patches you may want to have that'll be committed in the next day, and if you can wait for that we'd highly prefer you to stick to a specific released version.

- Tim

Comment 28 paull 2016-12-07 14:57:44 MST

Thanks for the update. I will wait for 16.05.7 as I was told it fixes various issues I have currently ticketed.

Thanks,
Paul

Comment 29 Alejandro Sanchez 2016-12-13 05:17:33 MST

Paul, any updates so far after 16.05.7 release?

Comment 30 Alejandro Sanchez 2016-12-20 06:15:25 MST

Switching severity from 2 to 3 since we don't have feedback since a week ago. Please, let us know how it goes after the update to 16.05.7.

Comment 31 Alejandro Sanchez 2017-01-11 09:02:12 MST

Paul/Phil - did you update Slurm to 16.05.7 or higher? your last comment is from December 7th. I'll mark the bug as resolved/timedout in a few days assuming you upgraded and there are no more issues with Power logic. Please, let me know otherwise. Thank you!

Comment 32 paull 2017-01-11 09:26:46 MST

Hi Alejandro,

I did see that the newest version is out. I have not had a chance to test it but will be working on it soon. I have been tweaking the powersave scripts themselves with promising results.

Question: What happens with the when the exit code of the powersave script is non-zero? Does Slurm reverse course, present error or something else?

Thanks,
Paul

Comment 33 Alejandro Sanchez 2017-01-13 06:12:53 MST

(In reply to paull from comment #32)
> Hi Alejandro,
> 
> I did see that the newest version is out. I have not had a chance to test it
> but will be working on it soon. I have been tweaking the powersave scripts
> themselves with promising results.

Ok.
 
> Question: What happens with the when the exit code of the powersave script
> is non-zero? Does Slurm reverse course, present error or something else?
> 
> Thanks,
> Paul

In order to execute the Suspend/ResumeProgram(s), Slurm executes the same function _run_prog() under src/slurmctld/power_save.c. This function fork/execv the configured program passing providing these arguments:

* prog IN      - program to run
* arg1 IN      - first program argument, the hostlist expression
* arg2 IN      - second program argumentor NULL
* job_id IN    - Passed as SLURM_JOB_ID environment variable

After execv is run, the function just exit(1), but it doesn't check any exit status from the program.

Comment 34 Alejandro Sanchez 2017-01-13 06:14:28 MST

Also the pid of the program is returned and one of these messages is logged:

static void _do_resume(char *host)
{
        pid_t pid = _run_prog(resume_prog, host, NULL, 0);
#if _DEBUG
        info("power_save: pid %d waking nodes %s", (int) pid, host);
#else
        verbose("power_save: pid %d waking nodes %s", (int) pid, host);
#endif
}

static void _do_suspend(char *host)
{
        pid_t pid = _run_prog(suspend_prog, host, NULL, 0);
#if _DEBUG
        info("power_save: pid %d suspending nodes %s", (int) pid, host);
#else
        verbose("power_save: pid %d suspending nodes %s", (int) pid, host);
#endif
}

Comment 35 paull 2017-01-16 12:16:09 MST

Hi Alejandro,

I have successfully compiled version 17.02.0.0 and have started testing. Huge concern that I am now facing: Upgrading from 14.11.10 to 17.02.0.0 according to documentation will remove job state information. Is there a workaround?

Thanks,
Paul

Comment 36 Alejandro Sanchez 2017-01-17 03:22:44 MST

(In reply to paull from comment #35)
> Hi Alejandro,
> 
> I have successfully compiled version 17.02.0.0 and have started testing.
> Huge concern that I am now facing: Upgrading from 14.11.10 to 17.02.0.0
> according to documentation will remove job state information. Is there a
> workaround?
> 
> Thanks,
> Paul

Slurm daemons will support RPCs and state files from the two previous minor releases (e.g. a version 16.05.x SlurmDBD will support slurmctld daemons and commands with a version of 16.05.x, 15.08.x or 14.11.x). I'd suggest upgrading to the latest stable production release (currently 16.05.8). I wouldn't use the latest release from the master branch (currently 17.02.0pre5) for production use, since pre-releases are still subject to protocol changes and they may contain some instability. Anyhow, if you want to upgrade to 17.02 from 14.11 you should upgrade to an intermediate minor release, such as 15.08 or 16.05 to preserve the state.

Comment 37 Alejandro Sanchez 2017-02-02 10:29:38 MST

Paul - any update on this?

Comment 38 paull 2017-02-02 10:38:07 MST

Currently testing 16.05.7.1 that look promising. Once this is installed we will see if this was fixed in the update.

Comment 39 Alejandro Sanchez 2017-02-06 03:30:11 MST

For the same effort I'd go for 16.05.9 or the current latest.

Comment 40 Alejandro Sanchez 2017-02-14 08:41:28 MST

Paul - please note that at least a new fix has been added to 16.05.10 related to Power Save:

https://github.com/SchedMD/slurm/commit/f6d42fdbb293ca (Bug #3446).

Since you're testing Power Save in newer versions just wanted to make you notice. Please, let us know if this is working as expected for you too in 16.05.10 or 16.05.9 with this patch applied.

Comment 41 paull 2017-02-15 09:47:40 MST

Hi Alejandro,

I just did a git pull and I see slurm-16-05-9-1 but not slurm-16-05-10-1. When will this version be released?

Thanks,
Paul

Comment 42 Alejandro Sanchez 2017-02-15 10:03:56 MST

(In reply to paull from comment #41)
> I just did a git pull and I see slurm-16-05-9-1 but not slurm-16-05-10-1.
> When will this version be released?

We don't have an estimate date yet, but usually maintenance versions are released in a monthly fashion. Slurm 16.05.9 was released ~January 31st so I think 16.05.10 will most probably be tagged around the last week of February or the first of March.

Comment 43 paull 2017-02-15 10:54:07 MST

Hi Alejandro,

Thanks for your update.

Version: 16.05.07.1

I am testing powersaving by explicitly telling slurm to power_down a node. It goes idle* then nothing.:

State: idle hnod0229

scontrol update node=hnod0229 state=power_down

State: idle* hnod0229

Then after a couple minutes:

State: idle hnod0229

Config:

SuspendTime=3600
SuspendRate=1
ResumeRate=12
SuspendProgram="/d/sw/slurm-test/etc/off-test.sh"
ResumeProgram="/d/sw/slurm-test/etc/on-test.sh"
SuspendTimeout=360
ResumeTimeout=1200


Permissions on Suspend/ResumePrograms: 
-rwxrwxr-x 1 root      prod    4944 Feb 15 11:28 off-test.sh
-rwxrwxr-x 1 root      prod    2413 Feb 15 11:23 on-test.sh

Logs:

[2017-02-15T11:42:48.716] powering down node hnod0229
[2017-02-15T11:42:48.716] debug2: _slurm_rpc_update_node complete for hnod0229 usec=98
[2017-02-15T11:42:49.598] debug2: Performing purge of old job records
[2017-02-15T11:42:49.598] debug:  sched: Running job scheduler
[2017-02-15T11:42:59.611] debug2: Testing job time limits and checkpoints
[2017-02-15T11:42:59.612] debug2: Performing purge of old job records
[2017-02-15T11:43:09.627] debug2: Performing purge of old job records
[2017-02-15T11:43:09.628] debug:  sched: Running job scheduler
[2017-02-15T11:43:17.843] debug2: Processing RPC: REQUEST_PARTITION_INFO uid=0
[2017-02-15T11:43:17.843] debug2: _slurm_rpc_dump_partitions, size=1265 usec=158
[2017-02-15T11:43:17.844] debug3: Processing RPC: REQUEST_NODE_INFO from uid=0
[2017-02-15T11:43:19.645] debug2: Performing purge of old job records
[2017-02-15T11:43:28.333] debug:  backfill: beginning
[2017-02-15T11:43:28.333] debug:  backfill: no jobs to backfill
[2017-02-15T11:43:29.662] debug2: Testing job time limits and checkpoints
[2017-02-15T11:43:29.662] debug2: Performing purge of old job records
[2017-02-15T11:43:29.663] debug:  sched: Running job scheduler
[2017-02-15T11:43:29.663] debug2: Performing full system state save
[2017-02-15T11:43:29.666] debug3: Writing job id 960 to header record of job_state file
[2017-02-15T11:43:29.676] debug2: Sending tres '1=738,2=377942,3=0,4=22' for cluster
[2017-02-15T11:43:32.723] debug:  Spawning ping agent for hnod0229
[2017-02-15T11:43:32.723] debug2: Spawning RPC agent for msg_type REQUEST_PING

Comment 44 Alejandro Sanchez 2017-02-17 07:09:42 MST

Paul - With 16.05.9 + commit f6d42fdbb293ca (16.05.10) I'm not experiencing any more issues when Slurm automatically handles Power Save. There's a note in the scontrol man page warning about manually updating node states:

"Generally only "DRAIN", "FAIL" and "RESUME" should be used."

So we don't encourage to manually power_down/power_up nodes through scontrol. Anyhow, we support this and it should work as expected. With 16.05.9 + latest commits I'm seeing some issues when manually powering down/up nodes.

Initial node state: idle
$ scontrol update nodename=compute1 state=power_down
Node state after update: idle~ // OK
$ sbatch -w compute1 --wrap "sleep 9999"
Node state after allocation: mix~ // I believe BAD. I think it should be mix#/mix
Job state: CF

Eventually:

Node state: mix
Job state: R

My best guess is that there's a small divergence in the logic managing state transitions in the power_save.c thread _do_power_work() as compared to the logic in node_mgr.c update_node(). We're working on this and will come back to you, anyhow again I highly recommend using the latest 16.05 and not manually powering up/down nodes.

Comment 47 paull 2017-02-17 11:54:19 MST

I have upgraded my test environment to 16.05.9.1 + commit f6d42fdbb293ca. When I issue power_down the node goes to idle* (no response, right?) and not idle~ which shows power state. Why?

Is there a rule in slurm that will not allow a node into powersave if its the only node available?

[root@htst0001 init.d]# scontrol update node=hnod0229 state=power_down
[root@htst0001 init.d]# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
vms          up   infinite     10  down* htst[0701-0710]
ports        up   infinite     10  down* htclus[001-010]
teamdev      up   infinite      1  idle* hnod0229
teamdev      up   infinite     20  down* htclus[001-010],htst[0701-0710]
idle         up   infinite      1  idle* hnod0229
idle         up   infinite     21  down* hnod0248,htclus[001-010],htst[0701-0710]
noGPU        up   infinite      1  idle* hnod0229
noGPU        up   infinite     10  down* htclus[001-005],htst[0701-0705]
hasGPU       up   infinite     10  down* htclus[006-010],htst[0706-0710]
all        down   infinite     20  down* htclus[001-010],htst[0701-0710]
[root@htst0001 init.d]# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
vms          up   infinite     10  down* htst[0701-0710]
ports        up   infinite     10  down* htclus[001-010]
teamdev      up   infinite      1  idle* hnod0229
teamdev      up   infinite     20  down* htclus[001-010],htst[0701-0710]
idle         up   infinite      1  idle* hnod0229
idle         up   infinite     21  down* hnod0248,htclus[001-010],htst[0701-0710]
noGPU        up   infinite      1  idle* hnod0229
noGPU        up   infinite     10  down* htclus[001-005],htst[0701-0705]
hasGPU       up   infinite     10  down* htclus[006-010],htst[0706-0710]
all        down   infinite     20  down* htclus[001-010],htst[0701-0710]
[root@htst0001 init.d]# scontrol show node=hnod0229
NodeName=hnod0229 Arch=x86_64 CoresPerSocket=6
   CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.05
   AvailableFeatures=localdisk,nogpu,intel
   ActiveFeatures=localdisk,nogpu,intel
   Gres=(null)
   NodeAddr=hnod0229 NodeHostName=hnod0229 Version=16.05
   OS=Linux RealMemory=48390 AllocMem=0 FreeMem=41723 Sockets=2 Boards=1
   MemSpecLimit=6144
   State=IDLE* ThreadsPerCore=2 TmpDisk=107894 Weight=1 Owner=N/A MCS_label=N/A
   BootTime=2017-02-09T23:55:06 SlurmdStartTime=2017-02-17T11:26:10
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   

[root@htst0001 init.d]# scontrol show node=hnod0229 | grep -E "NodeName|State|Reason"
NodeName=hnod0229 Arch=x86_64 CoresPerSocket=6
   State=IDLE ThreadsPerCore=2 TmpDisk=107894 Weight=1 Owner=N/A MCS_label=N/A
[root@htst0001 init.d]# scontrol --version
slurm 16.05.9

Comment 52 Alejandro Sanchez 2017-02-20 09:14:25 MST

(In reply to paull from comment #47)
> I have upgraded my test environment to 16.05.9.1 + commit f6d42fdbb293ca.
> When I issue power_down the node goes to idle* (no response, right?) and not
> idle~ which shows power state. Why?

It's strange that if the initial state is idle, then you manually issue power_down the node goes to idle* (which effectively means that it is not responding), instead of idle~ (which is what I see when I do it locally). What do your suspend/resume programs do? can you attach them?

> Is there a rule in slurm that will not allow a node into powersave if its
> the only node available?

I don't believe so.

Comment 53 paull 2017-02-20 17:51:47 MST

Hi Alejandro,

This was a mistake on my end. I have corrected it and the desired effect happened.

I will continue testing.

Comment 54 paull 2017-02-20 18:05:36 MST

Hi Alejandro,

This was a mistake on my end. I have corrected it and the desired effect happened.

I will continue testing.

Comment 55 Alejandro Sanchez 2017-02-21 07:49:41 MST

(In reply to paull from comment #54)
> Hi Alejandro,
> 
> This was a mistake on my end. I have corrected it and the desired effect
> happened.
> 
> I will continue testing.

No problem. Please, note that in comment #44 I said that once you allocate a job to an idle~ (IDLE+POWER) (node in powersave), then it becomes mix~ and the job to CF state. I thought mix~ was a bug, because I thought that it should be mix# thinking that just right after job allocation node starts powering up, but node doesn't start powering up (mix#) and the ResumeProgram is not executed until SuspendTimeout time has passed, otherwise we could end up running the Suspend and Resume programs at the same time, behavior we obviously not wish to happen. So please, let me know how it goes with the Power Save tests and if there's anything else we can assist you with this bug.

Comment 56 Alejandro Sanchez 2017-03-10 06:17:40 MST

Paul - any progress with this? can we close this bug as well? Thanks.

Comment 57 paull 2017-03-10 09:51:11 MST

I am in the midst of upgrading to version 16.05.9.1. Once this is done, we will test out powersave again but for now it is disabled.

Comment 58 Tim Wickberg 2017-03-15 11:27:12 MDT

Paul -

Marking this as resolved/infogiven; comment 54 indicates this behavior appears to have been resolved in the newer releases.

- Tim