Ticket 12656

Summary:	Adding nodes to partitions
Product:	Slurm	Reporter:	Tom Wurgler <twurgl>
Component:	Configuration	Assignee:	Nate Rini <nate>
Status:	RESOLVED TIMEDOUT	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	bbenedetto, nate, skyler
Version:	20.11.8
Hardware:	Linux
OS:	Linux
Site:	Goodyear	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:	gica
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurmctld.log part of slurm.cof new slurm.conf previous slurm.conf new topology.conf previous topology.conf slurmd -C for the cluster compute nodes slurmd -C for the desktop nodes slurm.conf as of 27-Oct-2021 slurm.conf for test environment as of 27-Oct-2021 slurm_common.conf gres.conf topology.conf slurmd.log for one of the failed jobs second slurmd.log file from different failed job sacct info for job 1161668 output from grep requested rdsxen303 slurmd.log rdsxen326 slurmd.log list of jobs that actuall failed, not hung jobs and their slurmd.log files duplicated the problem with test slurm environment Cleaned up logs from test systems, showing that we can recreate the issue.

Description Tom Wurgler 2021-10-12 13:55:38 MDT

We want to add nodes to slurm.conf file while slurm is active and our cluster is 100% full.

I made a disastrous attempt and made a typo in the NodeAddr (left off closing ]),  did an scontrol reconfigure which killed slurmctld and caused jobs to die, so I had to revert my changes.

So we've fixed the typo butreviewed the on-line docs for adding nodes and have a question:

from the doc:
===============================
1) Stop the slurmctld daemon (e.g. "systemctl stop slurmctld" on the head node)
2) Update the slurm.conf file on all nodes in the cluster
3) Restart the slurmd daemons on all nodes (e.g. "systemctl restart slurmd" on all nodes)
4) Restart the slurmctld daemon (e.g. "systemctl start slurmctld" on the head node)

NOTE: Jobs submitted with srun, and that are waiting for an allocation, prior to new nodes being added to the slurm.conf can fail if the job is allocated one of the new nodes. 
================================

That last NOTE is bothering us.  All our jobs use sbatch then srun.  We have thousands of pending cores ready to run.  It is 100% certain jobs will start flowing to the new nodes immediately.

So how can we add these nodes safely?

Thanks
tom

Comment 1 Tom Wurgler 2021-10-12 14:03:16 MDT

Please note we do have the primary and backup slurmctld's.

So we would need to shut both of them down before changing the slurm.conf?

thanks

Comment 3 Skyler Malinowski 2021-10-13 09:23:24 MDT

Hi Tom,

Shutting down slurmctld is recommended when adding nodes because srun jobs can get messed up and then fail. Moreover, slurm uses bitmaps to keep track of node information and they are created when the daemon starts up based on the slurm.conf. So when srun attempts to make an allocation but there is a discrepancy between the slurmctld and slurmd bitmaps, it will fail the job. Stopping slurmctld ensures that no srun jobs can get scheduled/allocations while slurm.conf is altered and daemons restarted, thus preventing srun jobs from failing while going through the node adding process.

There is a technical difference between an srun job and an sbatch job with srun commands. To make it clear: an srun job is one created directly by srun, where it handles both making the allocation and the step; an sbatch job with srun commands in it is one where sbatch makes the allocation and srun only makes the step.

If your cluster/site can ensure no srun jobs will be submitted and scheduled/allocated during the the node adding process (or okay with them failing for that matter), then you could get away with not stopping slurmctld during the adding node process. It should be noted that stopping slurmctld will only prevent further scheduling. Moreover, restarting slurmd will not interrupt their running jobs and their completing jobs will wait for slurmctld to return to service.

Best,
Skyler

Comment 4 Tom Wurgler 2021-10-13 11:03:28 MDT

Ok, so we follow the steps below.

Couple of more questions:

Do we need to stop the backup slurmctld?  Before or after the primary? When restarting things, do we start the backup slurmctld before of after the primary?

And how long do we have to get things restarted once slurmctld is shutdown before pending or running jobs start dying?


thanks

Comment 5 Skyler Malinowski 2021-10-13 15:14:08 MDT

> Do we need to stop the backup slurmctld?  Before or after the primary? When
> restarting things, do we start the backup slurmctld before of after the
> primary?
Backup daemons form a linear hierarchy -- think an array of daemons. The zeroth daemon is the primary, followed by the secondary daemons in order. It would be best to stop them in backwards order (e.g. slurmctld[n], ..., then slurmctld[0]) and start them in forwards order (e.g. slurmctld[0], ..., then slurmctld[n]) as to reduce the switching the control.

> And how long do we have to get things restarted once slurmctld is shutdown
> before pending or running jobs start dying?
When slurmctld is shutdown, only scheduling will not occur. Job which have been allocated and started will then be running on the slurmd(s). Running jobs will not die, rather stay in a completing state once they have finished on slurmd and will wait to notify slurmctld of their finished jobs. Pending jobs will stay in queue until slurmctld schedules them or is cancelled by user/admin.

Comment 6 Tom Wurgler 2021-10-14 14:58:11 MDT

This was another catastrophic day.
Cluster 100% used for the nodes in slurm.conf.
Literally 100's of jobs pending (over 5000 cores worth).
Wanted to add another chassis of nodes that weren't in slurm.conf to start with.
1) shutdown backup slurmctld
2) shutdown primary slurmctld
3) installed the new slurm.conf and topology file with the new nodes added.
4) restarted slurmd on all cluster nodes
5) restarted slurmd on all dekstops
6) started primary slurmctld
7) started backup slurmctld

within seconds users were calling with failed jobs.
86 jobs died that we know of.  most still say they are running but aren't.

What can we do to prevent this kind of thing?  No typos in the files this time.

The new nodes are now running jobs happily, but loss of jobs is about the worse thing the admins can do.  We just can't have this happen.

I am raising the priority as this is so important.

Please advise.
thanks
tom

Comment 8 Skyler Malinowski 2021-10-14 15:50:45 MDT

Would you please attach your slurm.conf's (before and after node adding) and the slurmctld.log.

Comment 9 Tom Wurgler 2021-10-15 05:44:12 MDT

Created attachment 21768 [details]
slurmctld.log

Comment 10 Tom Wurgler 2021-10-15 05:44:53 MDT

Created attachment 21769 [details]
part of slurm.cof

Comment 11 Tom Wurgler 2021-10-15 05:46:39 MDT

Created attachment 21770 [details]
new slurm.conf

Comment 12 Tom Wurgler 2021-10-15 05:47:04 MDT

Created attachment 21771 [details]
previous slurm.conf

Comment 13 Tom Wurgler 2021-10-15 05:47:33 MDT

Created attachment 21772 [details]
new topology.conf

Comment 14 Tom Wurgler 2021-10-15 05:48:08 MDT

Created attachment 21773 [details]
previous topology.conf

Comment 15 Tom Wurgler 2021-10-15 05:48:36 MDT

Thanks for looking into this....

Comment 16 Skyler Malinowski 2021-10-15 11:08:32 MDT

I notice that there are a number of errors that should be looked into from your end.

> [2021-10-14T15:00:24.678] error: WARNING: switches lack access to 2 nodes: alnxrsch1,rdsxenhn
> [2021-10-14T15:00:24.678] topology/tree: _validate_switches: TOPOLOGY: warning -- no switch can reach all nodes through its descendants. If this is not intentional, fix the topology.conf file.
Please update the topology to connect the remaining nodes (alnxrsch1,rdsxenhn).


> [2021-10-05T10:04:41.460] Batch JobId=1144530 missing from batch node rdsxen134 (not found BatchStartTime after startup), Requeuing job
> [2021-10-05T10:04:41.460] _job_complete: JobId=1144530 WTERMSIG 126
> [2021-10-05T10:04:41.460] _job_complete: JobId=1144530 cancelled by node failure
Please attach a the slurmd.log for one of these nodes. Thanks.


> [2021-10-14T15:00:42.017] error: Node rdsvm210 appears to have a different slurm.conf than the slurmctld.  This could cause issues with communication and functionality.  Please review both files and make sure they are the same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
> [2021-10-14T15:26:42.413] error: Node alnxr500 appears to have a different slurm.conf than the slurmctld.  This could cause issues with communication and functionality.  Please review both files and make sure they are the same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
> [2021-10-14T15:26:42.442] error: Node rdsvm109 appears to have a different slurm.conf than the slurmctld.  This could cause issues with communication and functionality.  Please review both files and make sure they are the same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
Out of sync nodes means out of sync bitmaps which can be a big issue. This can certainly contribute to jobs failing.


> [2021-10-15T07:00:06.765] error: slurm_receive_msg [163.243.17.44:45114]: Zero Bytes were transmitted or received
> [2021-10-15T07:13:42.972] error: slurm_receive_msg [10.103.143.108:35822]: Zero Bytes were transmitted or received
> [2021-10-15T07:19:55.208] error: slurm_receive_msg [10.103.143.121:48600]: Zero Bytes were transmitted or received
Verify these IP addresses. If they correspond to nodes, then you should verify the networking routing and diagnose the node -- drain, restart, check logs, etc.. -- as needed.


> 2021-10-14T14:45:01.868] error: _find_node_record(763): lookup failure for rdsxen7
> [2021-10-14T14:45:01.878] error: _find_node_record(763): lookup failure for rdsxen8
> [2021-10-14T14:45:01.880] error: _find_node_record(763): lookup failure for rdsxen11
> [2021-10-14T14:45:01.993] error: _find_node_record(763): lookup failure for rdsxen6
> [2021-10-14T14:45:02.052] error: _find_node_record(763): lookup failure for rdsxen14
It appears as though the new nodes are not resolving correctly but you report jobs are running on them. Please look into this as well.

Comment 17 Skyler Malinowski 2021-10-22 10:39:56 MDT

please provide an update. Were you able to resolved or verify the networking issues? What issues remain?

Comment 18 Tom Wurgler 2021-10-22 11:35:46 MDT

I have cleaned up most of the errors regarding the nodes that still needed slurmd restarted.

However, we don't feel like any of these issues would/should have caused us to lose jobs.

In looking at the support contract, we think we have like 8 hours of support time of having us talk via teams with you folks..???

We'd like to go into all this much deeper to figure this out.  So what I propose is:

1) let me poll our admins for any slurm questions to bring up (this topic or any other)

2) I will send the questions to you for preview

3) We set up a teams meeting to discuss and perhaps have you comment further on our setup on this topic and on whatever questions we have.

How about a meeting Nov 1 (a week from Monday)  I am out next Thursday and another admin is out Friday...

Will this work for your team?  Should I work with Jess Arington to arrange it?

thanks
tom

Comment 19 Tom Wurgler 2021-10-25 06:41:08 MDT

Our new best date for a technical meeting is November 4.
Would 10:00AM our time work for you?

Comment 20 Jason Booth 2021-10-25 11:03:13 MDT

Tom, I am looping Jess into the conversation, so he is aware and can set that up.

> Our new best date for a technical meeting is November 4.
> Would 10:00AM our time work for you?

Would you also send the list of questions you are looking at covering in that conversation? I want to make sure I put the right engineer on that call when we settle on a day and time.

Comment 22 Nate Rini 2021-10-26 13:13:55 MDT

Tom,

I'll be taking over for Skyler with this ticket in preparation for our meeting next week.

Can you please provide your current slurm.conf & friends if they have changed since comment#11.

Please call and attach the output of these commands on your controller as root:
> scontrol show config
> scontrol show nodes
> sdiag
> sacctmgr show stats

(In reply to Tom Wurgler from comment #18)
> 1) let me poll our admins for any slurm questions to bring up (this topic or
> any other)
> 2) I will send the questions to you for preview

Please attach these questions. The more lead time I have, the easier it is to answer them beforehand or at least be able to answer them at the meeting.
 
> 3) We set up a teams meeting to discuss and perhaps have you comment further
> on our setup on this topic and on whatever questions we have.

How are the nodes getting added to the cluster (outside of Slurm).

Thanks,
--Nate

Comment 23 Tom Wurgler 2021-10-26 13:55:40 MDT

Hi Nate,
I'll get the output of those commands in a bit or first thing tomorrow.
But wanted to ask about your last line....how are nodes getting added to the cluster (outside of Slurm)??

We have a cluster with 26 chassis.
Twenty-five of the chassis had been defined in "prod" slurm.
Chassis #1 was in a separate slurm install for testing slurm configuration/slurm versions etc.  Different primary controller, different mariadb install.Prod slurm and Test slurm were independent.  This was all working fine.

Our prod cluster was full, with hundreds of jobs pending (something like nearly 10000 cores worth of jobs). So I shutdown the test slurm environment with intent to add that first chassis worth of nodes to the prod slurm env.

I reimaged the nodes to our production level (RHEL 7.5). Once all were up again, I followed the step-by-step I listed in this ticket.

Note, when the step came to start slurmd on compute nodes, that first chassis wasn't ready after all.  I had to add hwloc-libs.  I had already shutdown slurmctld on the backup and then the primary controllers.  So there was a few minutes of delay with the slurmctld down while hwloc-libs was added.

Is this what you meant by how did we add nodes outside of slurm?  The nodes we were adding are identical to all the others nodes.  All part of the same cluster.

Comment 24 Nate Rini 2021-10-26 13:57:47 MDT

(In reply to Tom Wurgler from comment #23)
> I reimaged the nodes to our production level (RHEL 7.5). Once all were up
> again, I followed the step-by-step I listed in this ticket.
I take it that these nodes are stateful then? Is Slurm installed on disk locally too?

Comment 26 Tom Wurgler 2021-10-26 14:26:24 MDT

Our slurm is installed via NFS on a shared disk.
Neither the prod or test slurm environments are local.

Comment 27 Nate Rini 2021-10-27 10:31:15 MDT

(In reply to Tom Wurgler from comment #26)
> Our slurm is installed via NFS on a shared disk.
Is it versioned out as suggested by slide 25:
> https://slurm.schedmd.com/SLUG21/Field_Notes_5.pdf

> Neither the prod or test slurm environments are local.
Do the cluster share a common interconnect?

(In reply to Tom Wurgler from comment #23)
> We have a cluster with 26 chassis.
Is it possible to get the output of 'slurmd -C' from all of the compute nodes and login nodes (and any other nodes that users have access)?

> I reimaged the nodes to our production level (RHEL 7.5). Once all were up
> again, I followed the step-by-step I listed in this ticket.
What cluster management software is being used for imaging? The specific one doesn't matter to Slurm but may help inform my suggestions.
 
> Note, when the step came to start slurmd on compute nodes, that first
> chassis wasn't ready after all.  I had to add hwloc-libs.  I had already
> shutdown slurmctld on the backup and then the primary controllers.  So there
> was a few minutes of delay with the slurmctld down while hwloc-libs was
> added.
Has your site considered a healthscript or at very least a set of standard/quick test jobs for nodes post reboot/addition?
 
> Is this what you meant by how did we add nodes outside of slurm?  The nodes
> we were adding are identical to all the others nodes.  All part of the same
> cluster.
I don't like to assume about clusters being perfectly homogenous, I rather be sure before making suggestions. For instance the config provided in comment#11 shows that at the very least there is atleast one node with a different GRES configuration.

Please also provide 'slurmctld -V' from the controllers on your prod and test clusters.

Comment 28 Bill Benedetto 2021-10-27 11:28:52 MDT

Yes, it is versioned with prod and test symlinks.

There is infiniband on all of the cluster (except the headnode, which does not have IB).

And the cluster headnode is our slurmctld.

We also have our desktops as part of slurm, of course.
And they are not on IB nor even in the same physical building.

I'll attach the slurmd -C for the cluster.
Did you want the desktops as well?

We use RHEL kickstart for imaging the cluster (and the desktops, as a matter of fact).

We currently run Node Health Check across our cluster and desktops.
It runs at either 5 or 10 minute intervals.  I don't remember which.

root@rdsxenhn: ~ # /usr/local/slurm/sbin/slurmctld -V
slurm 20.11.8

I didn't run slurmctld -V on the test environment.

Comment 29 Bill Benedetto 2021-10-27 11:30:18 MDT

Created attachment 21967 [details]
slurmd -C for the cluster compute nodes

Comment 30 Nate Rini 2021-10-27 11:38:39 MDT

(In reply to Bill Benedetto from comment #28)
> Yes, it is versioned with prod and test symlinks.
Is systemd used to manage the daemons?

> There is infiniband on all of the cluster (except the headnode, which does
> not have IB).
I assume this means the IB firmwares are also kept in sync and there is a single opensm server?
 
> And the cluster headnode is our slurmctld.
I assume it also has slurmdbd and the MySQL DB?

> We also have our desktops as part of slurm, of course.
Do these desktops run slurmd? Are they using Munge or JWT for auth?

> And they are not on IB nor even in the same physical building.
Is the connection to Slurm getting wrapped by TLS or a VPN? Slurm is not designed to go in the clear over the internet.
 
> I'll attach the slurmd -C for the cluster.
> Did you want the desktops as well?
If they are submit only, then no. If there is a possibility of running jobs on them, then yes.

> We use RHEL kickstart for imaging the cluster (and the desktops, as a matter
> of fact).
How are the nodes kept in sync after that?

> We currently run Node Health Check across our cluster and desktops.
> It runs at either 5 or 10 minute intervals.  I don't remember which.
I didn't see in config provided. Please attach a current version including the other config files for Slurm. We generally ask sites tarball and/or zip files when attaching.

> root@rdsxenhn: ~ # /usr/local/slurm/sbin/slurmctld -V
> slurm 20.11.8
> 
> I didn't run slurmctld -V on the test environment.
Are there any patches active on either cluster outside of the normal tagged releases?

Comment 31 Bill Benedetto 2021-10-27 12:04:12 MDT

On Wed, 2021-10-27 at 17:38 +0000, bugs@schedmd.com wrote:
 External Email....WARNING....Think before you click or respond....WARNING



Comment # 30<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c30&data=04%7C01%7Cbbenedetto%40goodyear.com%7C2b028738e6214c36959e08d999709d4c%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637709531239816005%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=DvLNnGG2oClLvYZvnKP2g1nDWPCTsGqPcddPi1KH2nA%3D&reserved=0> onbug 12656<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656&data=04%7C01%7Cbbenedetto%40goodyear.com%7C2b028738e6214c36959e08d999709d4c%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637709531239825961%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=2XvXtss2Fe4X2GFi4NcUZGnSkFCJEmtLD7Df9JtVDGY%3D&reserved=0> from Nate Rini<mailto:nate@schedmd.com>

(In reply to Bill Benedetto from comment #28<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c28&data=04%7C01%7Cbbenedetto%40goodyear.com%7C2b028738e6214c36959e08d999709d4c%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637709531239825961%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=F7RjubpfFY8gETouphUmsPbBz4CPJ5lgmBBEaW9Hgeg%3D&reserved=0>)
> Yes, it is versioned with prod and test symlinks.
Is systemd used to manage the daemons?

I have no idea what this question means.  We use systemd to START the daemons everywhere.



> There is infiniband on all of the cluster (except the headnode, which does
> not have IB).
I assume this means the IB firmwares are also kept in sync and there is a
single opensm server?

We have IB switches in the cluster and they are all interconnected and one is acting as the master.
When we put the cluster together they would have all been the same level of firmware.
And we wouldn't have updated it.
So based on our experience, we're going to say they are kept in sync.




> And the cluster headnode is our slurmctld.
I assume it also has slurmdbd and the MySQL DB?

No.  They run on another host.  A VM.




> We also have our desktops as part of slurm, of course.
Do these desktops run slurmd? Are they using Munge or JWT for auth?


They run slurmd and are using Munge.



> And they are not on IB nor even in the same physical building.
Is the connection to Slurm getting wrapped by TLS or a VPN? Slurm is not
designed to go in the clear over the internet.

As far as the networking goes, the traffic is all internal to Goodyear with no VPN.




> I'll attach the slurmd -C for the cluster.
> Did you want the desktops as well?
If they are submit only, then no. If there is a possibility of running jobs on
them, then yes.

They are all active participants, submitting and/or running.

See next attachment.




> We use RHEL kickstart for imaging the cluster (and the desktops, as a matter
> of fact).
How are the nodes kept in sync after that?

kickstart
The nodes don't get updated unless we re-image the node.




> We currently run Node Health Check across our cluster and desktops.
> It runs at either 5 or 10 minute intervals.  I don't remember which.
I didn't see in config provided. Please attach a current version including the
other config files for Slurm. We generally ask sites tarball and/or zip files
when attaching.

You want our NHC config? What does this have to do with our issue?




> root@rdsxenhn: ~ # /usr/local/slurm/sbin/slurmctld -V
> slurm 20.11.8
>
> I didn't run slurmctld -V on the test environment.
Are there any patches active on either cluster outside of the normal tagged
releases?

OS patches?  I imagine that there are loads of them. We include a bunch of them during the kickstart process.

Are all of the systems the same? Yes.

The desktops are kickstart'ed from the a file.

The cluster nodes are kickstart'ed from a different file.

________________________________
You are receiving this mail because:

  *   You are on the CC list for the bug.
  *   You are watching the reporter of the bug.

Comment 32 Bill Benedetto 2021-10-27 12:04:54 MDT

Created attachment 21970 [details]
slurmd -C for the desktop nodes

Comment 33 Nate Rini 2021-10-27 12:19:56 MDT

(In reply to Bill Benedetto from comment #31)
> On Wed, 2021-10-27 at 17:38 +0000, bugs@schedmd.com wrote:
> Is systemd used to manage the daemons?
> 
> I have no idea what this question means.  We use systemd to START the
> daemons everywhere.

Is your site using the included systemd unit files generated by the Slurm installer?

Please call on a compute node:
> systemctl show slurmd
> systemctl status slurmd

> > We currently run Node Health Check across our cluster and desktops.
> > It runs at either 5 or 10 minute intervals.  I don't remember which.
> I didn't see in config provided. Please attach a current version including
> the
> other config files for Slurm. We generally ask sites tarball and/or zip files
> when attaching.
> 
> You want our NHC config? What does this have to do with our issue?
I'm attempting to understand your site's setup in order to provide the best advice w/rt to Slurm. No, I don't need the NHC config.

Please attach a current slurm.conf and friends.

Comment 34 Bill Benedetto 2021-10-27 12:48:37 MDT

We started out with a slurm install-generated slurmd.service file.
We've made some minor changes.

Here are the details from one of our compute nodes:

root@rdsxen66: ~ # systemctl cat slurmd
# /usr/lib/systemd/system/slurmd.service
[Unit]
Description=Slurm node daemon
After=network.target munge.service multi-user.target
ConditionPathExists=/etc/slurm/slurm.conf

[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/slurmd
ExecStart=/usr/local/slurm/sbin/slurmd $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurmd.pid
KillMode=process
LimitNOFILE=51200
LimitMEMLOCK=infinity
LimitSTACK=infinity
Delegate=yes

[Install]
WantedBy=multi-user.target



root@rdsxen66: ~ # systemctl show slurmd
Type=forking
Restart=no
PIDFile=/var/run/slurmd.pid
NotifyAccess=none
RestartUSec=100ms
TimeoutStartUSec=1min 30s
TimeoutStopUSec=1min 30s
WatchdogUSec=0
WatchdogTimestampMonotonic=0
StartLimitInterval=10000000
StartLimitBurst=5
StartLimitAction=none
FailureAction=none
PermissionsStartOnly=no
RootDirectoryStartOnly=no
RemainAfterExit=no
GuessMainPID=yes
MainPID=172709
ControlPID=0
FileDescriptorStoreMax=0
StatusErrno=0
Result=success
ExecMainStartTimestamp=Thu 2021-10-14 14:58:04 EDT
ExecMainStartTimestampMonotonic=19178138232539
ExecMainExitTimestampMonotonic=0
ExecMainPID=172709
ExecMainCode=0
ExecMainStatus=0
ExecStart={ path=/usr/local/slurm/sbin/slurmd ; argv[]=/usr/local/slurm/sbin/slurmd $SLURMD_OPTIONS ; ignore_errors=no ; start_time=[Thu 2021-10-14 14:58:03 EDT] ; stop_time=[Thu 2021-10-14 14:58:04 EDT] ; pid=172704 ; code=exited ; status=0 }
ExecReload={ path=/bin/kill ; argv[]=/bin/kill -HUP $MAINPID ; ignore_errors=no ; start_time=[n/a] ; stop_time=[n/a] ; pid=0 ; code=(null) ; status=0/0 }
Slice=system.slice
ControlGroup=/system.slice/slurmd.service
MemoryCurrent=17121280
TasksCurrent=67
Delegate=yes
CPUAccounting=no
CPUShares=18446744073709551615
StartupCPUShares=18446744073709551615
CPUQuotaPerSecUSec=infinity
BlockIOAccounting=no
BlockIOWeight=18446744073709551615
StartupBlockIOWeight=18446744073709551615
MemoryAccounting=no
MemoryLimit=18446744073709551615
DevicePolicy=auto
TasksAccounting=no
TasksMax=18446744073709551615
EnvironmentFile=/etc/sysconfig/slurmd (ignore_errors=yes)
UMask=0022
LimitCPU=18446744073709551615
LimitFSIZE=18446744073709551615
LimitDATA=18446744073709551615
LimitSTACK=18446744073709551615
LimitCORE=18446744073709551615
LimitRSS=18446744073709551615
LimitNOFILE=51200
LimitAS=18446744073709551615
LimitNPROC=514533
LimitMEMLOCK=18446744073709551615
LimitLOCKS=18446744073709551615
LimitSIGPENDING=514533
LimitMSGQUEUE=819200
LimitNICE=0
LimitRTPRIO=0
LimitRTTIME=18446744073709551615
OOMScoreAdjust=0
Nice=0
IOScheduling=0
CPUSchedulingPolicy=0
CPUSchedulingPriority=0
TimerSlackNSec=50000
CPUSchedulingResetOnFork=no
NonBlocking=no
StandardInput=null
StandardOutput=journal
StandardError=inherit
TTYReset=no
TTYVHangup=no
TTYVTDisallocate=no
SyslogPriority=30
SyslogLevelPrefix=yes
SecureBits=0
CapabilityBoundingSet=18446744073709551615
AmbientCapabilities=0
MountFlags=0
PrivateTmp=no
PrivateNetwork=no
PrivateDevices=no
ProtectHome=no
ProtectSystem=no
SameProcessGroup=no
IgnoreSIGPIPE=yes
NoNewPrivileges=no
SystemCallErrorNumber=0
RuntimeDirectoryMode=0755
KillMode=process
KillSignal=15
SendSIGKILL=yes
SendSIGHUP=no
Id=slurmd.service
Names=slurmd.service
Requires=basic.target
Wants=system.slice
WantedBy=multi-user.target
Conflicts=shutdown.target
Before=shutdown.target
After=network.target basic.target system.slice systemd-journald.socket munge.service multi-user.target
Description=Slurm node daemon
LoadState=loaded
ActiveState=active
SubState=running
FragmentPath=/usr/lib/systemd/system/slurmd.service
UnitFileState=enabled
UnitFilePreset=disabled
InactiveExitTimestamp=Thu 2021-10-14 14:58:03 EDT
InactiveExitTimestampMonotonic=19178137721959
ActiveEnterTimestamp=Thu 2021-10-14 14:58:04 EDT
ActiveEnterTimestampMonotonic=19178138232625
ActiveExitTimestamp=Thu 2021-10-14 14:58:03 EDT
ActiveExitTimestampMonotonic=19178137713209
InactiveEnterTimestamp=Thu 2021-10-14 14:58:03 EDT
InactiveEnterTimestampMonotonic=19178137717692
CanStart=yes
CanStop=yes
CanReload=yes
CanIsolate=no
StopWhenUnneeded=no
RefuseManualStart=no
RefuseManualStop=no
AllowIsolate=no
DefaultDependencies=yes
OnFailureJobMode=replace
IgnoreOnIsolate=no
IgnoreOnSnapshot=no
NeedDaemonReload=no
JobTimeoutUSec=0
JobTimeoutAction=none
ConditionResult=yes
AssertResult=yes
ConditionTimestamp=Thu 2021-10-14 14:58:03 EDT
ConditionTimestampMonotonic=19178137717966
AssertTimestamp=Thu 2021-10-14 14:58:03 EDT
AssertTimestampMonotonic=19178137721659
Transient=no



root@rdsxen66: ~ # systemctl status slurmd
* slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2021-10-14 14:58:04 EDT; 1 weeks 5 days ago
  Process: 172704 ExecStart=/usr/local/slurm/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 172709 (slurmd)
    Tasks: 67
   Memory: 16.3M
   CGroup: /system.slice/slurmd.service
           |- 24454 slurmstepd: [1176940.batch]
           |- 24519 /bin/bash /var/spool/slurmd/job1176940/slurm_script
           |- 25017 /bin/csh -f /apps/tpm/lsf/launch -n 96 eagle -i ac.cw.80.m...
           |- 25150 /usr/local/slurm/bin/srun -n 96 /apps/tpm/SIERRA/Eagle3_co...
           |- 25152 /usr/local/slurm/bin/srun -n 96 /apps/tpm/SIERRA/Eagle3_co...
           |- 25162 slurmstepd: [1176940.1]
           |- 25168 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
           |- 25169 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
           |- 25170 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
           |- 25171 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
           |- 25172 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
           |- 25173 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
           |- 25174 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
           |- 25175 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
           |- 25176 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
           |- 25177 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
           |- 25178 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
           |- 25179 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
           |- 25180 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
           |- 25181 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
           |- 25182 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
           |- 25183 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
           |- 25184 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
           |- 25185 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
           |- 25186 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
           |- 25187 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
           |- 25188 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
           |- 25189 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
           |- 25190 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
           |- 25191 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
           `-172709 /usr/local/slurm/sbin/slurmd

Oct 14 14:58:03 rdsxen66 systemd[1]: Starting Slurm node daemon...
Oct 14 14:58:04 rdsxen66 systemd[1]: PID file /var/run/slurmd.pid not readab...t.
Oct 14 14:58:04 rdsxen66 systemd[1]: Started Slurm node daemon.
Hint: Some lines were ellipsized, use -l to show in full.

root@rdsxen66: ~ #

Comment 35 Nate Rini 2021-10-27 12:53:10 MDT

(In reply to Bill Benedetto from comment #34)
> We started out with a slurm install-generated slurmd.service file.
> We've made some minor changes.

For future reference, I suggest using a drop-in config instead of modifying the file directly. This will make upgrades in the future easier but is of course at the discretion of your site.

> ExecStart=/usr/local/slurm/sbin/slurmd $SLURMD_OPTIONS
Is this slurmd a symlink?

Comment 36 Nate Rini 2021-10-27 13:05:28 MDT

(In reply to Nate Rini from comment #33)
> Please attach a current slurm.conf and friends.

If possible for the prod and test clusters.

Comment 37 Bill Benedetto 2021-10-27 13:23:20 MDT

On Wed, 2021-10-27 at 18:53 +0000, bugs@schedmd.com wrote:
 External Email....WARNING....Think before you click or respond....WARNING



Comment # 35<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c35&data=04%7C01%7Cbbenedetto%40goodyear.com%7Cc8f27514cdcd4d70f6f908d9997b05f3%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637709575960987150%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=lxuQcIea%2BNfPRVbDdYVC8mnSngGUYib3vm24iCfl3RI%3D&reserved=0> onbug 12656<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656&data=04%7C01%7Cbbenedetto%40goodyear.com%7Cc8f27514cdcd4d70f6f908d9997b05f3%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637709575960987150%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=V0mh4Xk3nyyhLV6GU9npFQp1QjRTdvcKwQ2r2KoIOoY%3D&reserved=0> from Nate Rini<mailto:nate@schedmd.com>

(In reply to Bill Benedetto from comment #34<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c34&data=04%7C01%7Cbbenedetto%40goodyear.com%7Cc8f27514cdcd4d70f6f908d9997b05f3%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637709575960997105%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=jfOf8hTNDN5m%2Fit5PaMTgCoCPSQMvvrJcC%2FgUMTikng%3D&reserved=0>)
> We started out with a slurm install-generated slurmd.service file.
> We've made some minor changes.

For future reference, I suggest using a drop-in config instead of modifying the
file directly. This will make upgrades in the future easier but is of course at
the discretion of your site.

Like I said, we started with a slurm installation-generated one.
We rarely update the slurmd.service files.  If it works, we're happy.
I CAN see, though, that it would make sense to do what you say as things may change.




> ExecStart=/usr/local/slurm/sbin/slurmd $SLURMD_OPTIONS
Is this slurmd a symlink?

root@rdsxen66<mailto:root@rdsxen66>: ~ # ls -Fl /usr/local/slurm
lrwxrwxrwx 1 root root 25 Mar 27 2021 /usr/local/slurm -> /apps/share/slurm/current/

root@rdsxen66<mailto:root@rdsxen66>: ~ # ls -Fl /apps/share/slurm
total 168
drwxr-xr-x 9 root root 8192 Mar 4 2020 19.05.5/
drwxr-xr-x 8 root root 152 Apr 1 2020 20.02.1/
drwxr-xr-x 8 root root 8192 Sep 30 2020 20.02.2/
drwxr-xr-x 9 lda6434 C15 8192 Oct 13 2020 20.02.4-1/
drwxr-xr-x 9 lda6434 C15 8192 Mar 18 2021 20.02.6-1/
drwxr-xr-x 9 lda6434 C15 8192 Mar 11 2021 20.11.4-1/
drwxr-xr-x 9 lda6434 C15 8192 Mar 27 2021 20.11.5-1/
drwxr-xr-x 3 lda6434 C15 152 Jul 15 13:32 20.11.5-1-desktop/
drwxr-xr-x 2 lda6434 C15 8192 Jun 2 05:50 20.11.5-systemd/
drwxr-xr-x 9 lda6434 C15 8192 Jun 4 05:40 20.11.7-1/
drwxr-xr-x 9 lda6434 C15 8192 Jul 15 03:51 20.11.8-1/
drwxr-xr-x 3 lda6434 C15 8192 Sep 9 02:05 GY-bin/
drwxr-xr-x 3 root root 8192 Dec 3 2020 GY-bin-RCS/
drwxr-xr-x 2 lda6434 C15 8192 Apr 13 2021 GY-bin-from-Jenkins/
drwxr-xr-x 3 lda6434 root 8192 Dec 1 2020 GY-bin-svn.OLD/
drwxr-xr-x 2 root root 8192 Oct 7 14:45 RCS/
drwxr-xr-x 3 lda6434 root 8192 Aug 26 11:59 conf/
drwxr-xr-x 3 root root 8192 Oct 22 2019 contribs/
lrwxrwxrwx 1 root root 9 Aug 21 12:51 current -> 20.11.8-1/
lrwxrwxrwx 1 lda6434 C15 17 Jul 15 14:42 current-desktop -> 20.11.5-1-desktop/
lrwxrwxrwx 1 root root 9 Mar 27 2021 current.210821 -> 20.11.5-1/
lrwxrwxrwx 1 root root 9 Oct 13 2020 old_current -> 20.02.4-1/
lrwxrwxrwx 1 root root 9 Oct 13 2020 old_prod -> 20.02.4-1/
lrwxrwxrwx 1 root root 7 May 8 2020 prior -> 20.02.2/
lrwxrwxrwx 1 root root 9 Aug 21 12:51 prod -> 20.11.8-1/
lrwxrwxrwx 1 root root 9 Mar 27 2021 prod.210821 -> 20.11.5-1/
-r-xr-xr-x 1 root root 4435 Dec 12 2019 setup_desktop_client*
-r-xr-xr-x 1 root root 5054 Oct 7 14:45 setup_desktop_compute*
-rwxr-xr-x 1 root root 5434 Mar 25 2021 setup_desktop_compute_test*
-r-xr-xr-x 1 root root 6924 Dec 12 2019 setup_head*
-rwxr-xr-x 1 root root 7203 Mar 25 2021 setup_head_test*
lrwxrwxrwx 1 root root 9 Jul 21 10:11 test -> 20.11.8-1/
lrwxrwxrwx 1 root root 9 Jun 7 12:07 test-prior -> 20.11.7-1/

root@rdsxen66<mailto:root@rdsxen66>: ~ #

________________________________
You are receiving this mail because:

  *   You are on the CC list for the bug.
  *   You are watching the reporter of the bug.

Comment 38 Nate Rini 2021-10-27 13:28:06 MDT

(In reply to Bill Benedetto from comment #37)
> > ExecStart=/usr/local/slurm/sbin/slurmd $SLURMD_OPTIONS
> Is this slurmd a symlink?
> 
> root@rdsxen66<mailto:root@rdsxen66>: ~ # ls -Fl /usr/local/slurm
> lrwxrwxrwx 1 root root 25 Mar 27 2021 /usr/local/slurm ->
> /apps/share/slurm/current/

Is your site interested in running two sets of slurmd on the test machines? While there is work on the test machine, a simple reservation can be placed on the prod cluster slurm or the nodes can be marked down. This would make capacity increases simple while still allowing your site to test. Since both clusters already share the same IB fabric, and I can only assume NFS mounts, they are already pretty intertwined.

Comment 39 Bill Benedetto 2021-10-27 13:38:09 MDT

Created attachment 21972 [details]
slurm.conf as of 27-Oct-2021

This should be the same as "21770: new slurm.conf".
(Or close to it, TBH)

Comment 40 Bill Benedetto 2021-10-27 13:39:57 MDT

Created attachment 21973 [details]
slurm.conf for test environment as of 27-Oct-2021

Comment 41 Nate Rini 2021-10-27 16:36:59 MDT

Please also attach:
> /apps/share/slurm/conf/slurm_common.conf

How does your site handle name resolution? /etc/hosts?

Comment 42 Bill Benedetto 2021-10-28 06:37:49 MDT

(In reply to Nate Rini from comment #41)
> Please also attach:
> > /apps/share/slurm/conf/slurm_common.conf
> 
> How does your site handle name resolution? /etc/hosts?

Primarily DNS.  /etc/hosts tends to be pretty small everywhere.

Comment 43 Bill Benedetto 2021-10-28 06:39:18 MDT

Created attachment 21992 [details]
slurm_common.conf

Comment 44 Nate Rini 2021-10-28 09:04:05 MDT

Please also provide:
> gres.conf
> topology.conf
> acctgather.conf

Comment 45 Nate Rini 2021-10-28 09:05:51 MDT

Also is it possible to get some background on this?
> ##########################################⏎
> # THIS CAUSES PACKING TO WORK - BUT WE REALLY WANT cons_tres⏎
> #SelectType=select/cons_res⏎
> ##########################################⏎
> # THIS CAUSES WEIRD SPLITTING PROBLEMS!! #⏎
> SelectType=select/cons_tres # for gpu & abaqus⏎
> ##########################################⏎

Comment 46 Bill Benedetto 2021-10-28 10:08:13 MDT

(In reply to Nate Rini from comment #44)
> Please also provide:
> > gres.conf
> > topology.conf
> > acctgather.conf

We don't got no acctgather.conf file....

I'll attach the others directly.

Comment 47 Bill Benedetto 2021-10-28 10:09:00 MDT

Created attachment 21995 [details]
gres.conf

Comment 48 Bill Benedetto 2021-10-28 10:09:38 MDT

Created attachment 21996 [details]
topology.conf

Comment 49 Nate Rini 2021-10-28 10:12:54 MDT

(In reply to Bill Benedetto from comment #46)
> (In reply to Nate Rini from comment #44)
> > Please also provide:
> > > gres.conf
> > > topology.conf
> > > acctgather.conf
> 
> We don't got no acctgather.conf file....
Understood

> I'll attach the others directly.
To make things easier the future, please consider just tarballing up the all of the config files for support tickets. Attaching a single file is usually easier than multiple files individually and we have no problem opening tarballs (or zips).

Comment 50 Tom Wurgler 2021-10-28 10:45:56 MDT

concerning comment #45.
When we started using slurm, we had SelecType=select/cons_res.
So our nodes have 24 cores.  And if we submitted a 32 way job for example, it used 24 on the first node and 8 on a second node.  More jobs would fill in the other 16 cores (plus more nodes if needed).  We always called this node packing.

But our Lux counterpart really wanted cons_tres for GPU tracking etc.

But that made packing nodes do weird stuff.
A 32 way job would get 16 on the first node and 16 on the second node.
Or worse, 16 on one node, 15 on a second one and 1 on a 3rd node.  This seemed pretty inefficient and (we believe) would hurt performance to some level.  We started adding -N 2 for say a  32 way job to keep it to at least 2 nodes.

I filed a ticket for this and the latest slurm 21.08 is supposed to have fixed this so cons_tres packs like cons_res.  We haven't tested 21.08 as yet.

Comment 51 Nate Rini 2021-10-28 11:37:47 MDT

(In reply to Tom Wurgler from comment #50)
> I filed a ticket for this and the latest slurm 21.08 is supposed to have
> fixed this so cons_tres packs like cons_res.  We haven't tested 21.08 as yet.
Great, I'll defer to that ticket then. Generally prefer to not mix issues in a single ticket but I wanted to make sure that wasn't outstanding on our part.

Comment 52 Nate Rini 2021-10-28 11:58:11 MDT

Is Slurm installed from source or from RPM? Given its placement on NFS, I assume from source but I want to verify first.

Comment 53 Bill Benedetto 2021-10-28 12:01:03 MDT

(In reply to Nate Rini from comment #52)
> Is Slurm installed from source or from RPM? Given its placement on NFS, I
> assume from source but I want to verify first.

From source. We use Jenkins so that it's built the same way every time, regardless which one of us builds it.

Comment 54 Tom Wurgler 2021-11-01 14:11:55 MDT

Hi

Here is a list of some questions we'd like info on during our meeting Thursday.

1) Jobs killed when adding nodes----> this ticket
2) In general, changing slurm.conf procedure
3) Node weighting 
4) How to force a pending job to run on specific nodes
5) General critique of our config files and setup

Sorry to be so late with this.
thanks
tom

Comment 55 Nate Rini 2021-11-01 16:23:07 MDT

(In reply to Tom Wurgler from comment #54)
> 1) Jobs killed when adding nodes----> this ticket

Do you have a list of the jobs that failed during the config change? The slurmctld log attached had a good number of authentication errors which means I need to see the slurmd (and maybe slurmstepd) logs at the time of the failures. Is it possible to get them?

Comment 56 Nate Rini 2021-11-02 08:42:17 MDT

(In reply to Tom Wurgler from comment #54)
> 3) Node weighting 

Can you please provide more details on what you mean here. Is this just the node weight parameter in slurm.conf?

Comment 57 Tom Wurgler 2021-11-02 09:23:29 MDT

Created attachment 22076 [details]
slurmd.log for one of the failed jobs

one slurmd.log file from a failed job

Comment 58 Tom Wurgler 2021-11-02 09:24:24 MDT

Created attachment 22077 [details]
second slurmd.log file from different failed job

second slurmd.log

Comment 59 Tom Wurgler 2021-11-02 09:28:44 MDT

(In reply to Nate Rini from comment #56)
> (In reply to Tom Wurgler from comment #54)
> > 3) Node weighting 
> 
> Can you please provide more details on what you mean here. Is this just the
> node weight parameter in slurm.conf?

I need to defer to Patrick Hock (admin in Luxembourg).  There was some reason we couldn't just weight the nodes.

Also, another topic for discussion if there is time is potential purging/backup of the database.

Comment 60 Nate Rini 2021-11-02 12:07:11 MDT

(In reply to Tom Wurgler from comment #57)
> Created attachment 22076 [details]
> slurmd.log for one of the failed jobs
> 
> one slurmd.log file from a failed job

Is this one of the failed jobs?
> [2021-10-25T20:03:28.697] launch task StepId=1177190.0 request from UID:1084 GID:2910 HOST:163.243.23.94 PORT:46610
> [2021-10-25T20:03:29.865] [1177190.0] task/cgroup: _memcg_initialize: /slurm/uid_1084/job_1177190: alloc=0MB mem.limit=128655MB memsw.limit=unlimited
> [2021-10-25T20:03:29.865] [1177190.0] task/cgroup: _memcg_initialize: /slurm/uid_1084/job_1177190/step_0: alloc=0MB mem.limit=128655MB memsw.limit=unlimited
> [2021-10-27T15:47:25.429] [1177190.0] error: Failed to send MESSAGE_TASK_EXIT: Connection refused
> [2021-10-27T15:47:28.394] [1177190.0] error: Failed to send MESSAGE_TASK_EXIT: Connection refused
> [2021-10-27T15:47:28.395] [1177190.0] done with job

Comment 61 Nate Rini 2021-11-02 12:22:16 MDT

(In reply to Nate Rini from comment #60)
> Is this one of the failed jobs?
Please provide this output for at least one of the failed jobs:
> sacct -o all -p -D -j $JOBID

Comment 62 Tom Wurgler 2021-11-02 12:58:56 MDT

Created attachment 22084 [details]
sacct info for job 1161668

Comment 64 Nate Rini 2021-11-02 13:01:24 MDT

Looks like the slurmctld log for job 1161668 is incomplete. Please grep that number out of the slurmctld logs and attach the output.

Comment 65 Tom Wurgler 2021-11-02 13:26:56 MDT

Created attachment 22085 [details]
output from grep requested

Comment 66 Nate Rini 2021-11-02 13:29:37 MDT

(In reply to Tom Wurgler from comment #65)
> Created attachment 22085 [details]
> output from grep requested

This log shows that the user (or a Slurm aware job) requested the job killed and not Slurm's internal checks:
> slurmctld.log-20211020:[2021-10-15T14:36:03.414] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=1161668 uid 2256

Please provide slurmd logs from the other job nodes:
> NodeList=rdsxen[303,326]

Comment 67 Tom Wurgler 2021-11-02 13:36:12 MDT

Not all these jobs exited slurm.
They "died" as in went what we call stale.  In the partition, tying up resources, but not continuing. I went in and had to remove them and the users had to resubmit their jobs.

Comment 68 Tom Wurgler 2021-11-02 13:38:54 MDT

Created attachment 22086 [details]
rdsxen303 slurmd.log

Comment 69 Tom Wurgler 2021-11-02 13:39:24 MDT

Created attachment 22087 [details]
rdsxen326 slurmd.log

Comment 70 Nate Rini 2021-11-02 14:30:21 MDT

(In reply to Tom Wurgler from comment #67)
> They "died" as in went what we call stale.  In the partition, tying up
> resources, but not continuing. I went in and had to remove them and the
> users had to resubmit their jobs.

So they hung? How does your site determine if they died? The Slurm logs presented so far are basically silent on the why beyond the explicit request to kill the jobs.

Comment 71 Tom Wurgler 2021-11-02 14:50:07 MDT

We have a script running every 30 minutes on all running jobs to check the directory for files updated in the last 30 minutes.  If no files are found, it sends the admins mail.  The code we use updates files regularly.

The jobs that were hung were hung overnight and into the next morning with no updates.  And these are std jobs we run every day.

Comment 72 Nate Rini 2021-11-02 14:53:22 MDT

(In reply to Tom Wurgler from comment #71)
> We have a script running every 30 minutes on all running jobs to check the
> directory for files updated in the last 30 minutes.  If no files are found,
> it sends the admins mail.  The code we use updates files regularly.
> 
> The jobs that were hung were hung overnight and into the next morning with
> no updates.  And these are std jobs we run every day.

Are any logs available from the jobs? Do these hangs only happen while restarting slurmctld? Based on the existence of the script, I suspect there are more triggers for this. Are there any relevant logs from dmesg in the time window of the hang, particularly an NFS hang?

Comment 73 Nate Rini 2021-11-05 09:41:49 MDT

followup from our meeting:

* I tested how `sbatch --wait` handled the loss of slurmctld and it just waits for slurmctld to come back.

* node weights and topology plugin were fixed by bug#9729 in slurm-21.08 release.

* Please submit an RFE ticket explicitly requesting how the node weights and topology are calculated. I couldn't find any existing documentation for this.

Comment 74 Nate Rini 2021-11-10 09:54:29 MST

Are instructions needed in helping to find the logs?

Comment 75 Tom Wurgler 2021-11-11 09:57:59 MST

Sorry....took some vacation time.  I will try to get logs today yet


From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Wednesday, November 10, 2021 11:54 AM
To: Tom Wurgler <twurgl@goodyear.com>
Subject: [EXT] [Bug 12656] Adding nodes to partitions

 External Email....WARNING....Think before you click or respond....WARNING


Comment # 74<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c74&data=04%7C01%7Ctwurgl%40goodyear.com%7C90ff7322f7444bb021dd08d9a46ac346%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637721600741113474%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=DYnxDn4HP%2BCT8yCsafSJR5vzJwVN1pqpYHEpL9uIovw%3D&reserved=0> on bug 12656<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656&data=04%7C01%7Ctwurgl%40goodyear.com%7C90ff7322f7444bb021dd08d9a46ac346%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637721600741123424%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=0tlg10CgBZqSg2NmxftMZR62aT8%2BMx4eZCwVoWZS39c%3D&reserved=0> from Nate Rini<mailto:nate@schedmd.com>

Are instructions needed in helping to find the logs?

________________________________
You are receiving this mail because:

  *   You reported the bug.
  *   You are watching someone on the CC list of the bug.

Comment 76 Nate Rini 2021-11-15 09:13:44 MST

(In reply to Tom Wurgler from comment #75)
> Sorry....took some vacation time.  I will try to get logs today yet

We are happy to work on your schedule here. We just generally ask that the severity be lowered to the appropriate levels as the situation changes.

Comment 77 Tom Wurgler 2021-11-15 11:07:53 MST

Created attachment 22249 [details]
list of jobs that actuall failed, not hung jobs and their slurmd.log files

slurmd.log files from first node of mulitnode jobs.
failed_jobs has job numbers etc

Comment 79 Nate Rini 2021-11-15 19:37:30 MST

(In reply to Tom Wurgler from comment #77)
> Created attachment 22249 [details]
> list of jobs that actuall failed, not hung jobs and their slurmd.log files
> 
> slurmd.log files from first node of mulitnode jobs.
> failed_jobs has job numbers etc

Please also grep the slurmctld logs for this too:
> grep -E '1164734|1164738|1164753|1164788|1164830|1164926|1164948|1165048|1165063|1165064|1165065' $PATH_TO_LOG

Comment 80 Nate Rini 2021-11-15 19:38:10 MST

(In reply to Nate Rini from comment #79)
> Please also grep the slurmctld logs for this too:
> > grep -E '1164734|1164738|1164753|1164788|1164830|1164926|1164948|1165048|1165063|1165064|1165065' $PATH_TO_LOG

Grepping all of your logs would be preferable.

Comment 81 Nate Rini 2021-11-15 19:40:22 MST

(In reply to Tom Wurgler from comment #77)
> Created attachment 22249 [details]
> list of jobs that actually failed, not hung jobs and their slurmd.log files

Looking at the exact errors:
> [1165065.batch] error: Could not open stdout file /hpc/scratch/a026560/DEW/202110140054.DewLT.tool/LT_Inflate_Deflect/eagle/torsional_2.000/LT285REFCONST_AKR_A1530149_CrdPrd_1_00_000_3d_full_tor_2.000.lsfout: No such file or directoryO setup failed: No such file or directory

The error for all of these jobs is the output directory doesn't exist which would cause any job to fail. Are there any jobs that did have this same error?

If this is just a user error, please ignore comment#80 and comment#79.

Comment 82 Tom Wurgler 2021-11-16 06:34:01 MST

It is not a user error.  The file should have been created during the run, and with the run being interrupted it failed.

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, November 15, 2021 9:40 PM
To: Tom Wurgler <twurgl@goodyear.com>
Subject: [EXT] [Bug 12656] Adding nodes to partitions

 External Email....WARNING....Think before you click or respond....WARNING


Comment # 81<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c81&data=04%7C01%7Ctwurgl%40goodyear.com%7Ca695306ea68e432fe7e508d9a8aa7071%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726272287449086%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=2oCSk1LBamo4pgJ2wQVP%2F1mCpCZ3xUBW%2FJSYfEGxFHU%3D&reserved=0> on bug 12656<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656&data=04%7C01%7Ctwurgl%40goodyear.com%7Ca695306ea68e432fe7e508d9a8aa7071%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726272287459038%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=xlTZ4CjSz4KfzlfCvUP3qTFW7qgW3fV6wiOVYw7qn3k%3D&reserved=0> from Nate Rini<mailto:nate@schedmd.com>

(In reply to Tom Wurgler from comment #77<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c77&data=04%7C01%7Ctwurgl%40goodyear.com%7Ca695306ea68e432fe7e508d9a8aa7071%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726272287459038%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=etX5mMl2Ns%2BuPXHfzdN%2Fujrp8JV0hR66mJH2a%2FLOmww%3D&reserved=0>)

> Created attachment 22249 [details]<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fattachment.cgi%3Fid%3D22249&data=04%7C01%7Ctwurgl%40goodyear.com%7Ca695306ea68e432fe7e508d9a8aa7071%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726272287468997%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=7j9wdnU1SrrQUpFu9ijIvxz%2FJG%2FMKHDIry9YK5Q7TsA%3D&reserved=0> [details]<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fattachment.cgi%3Fid%3D22249%26action%3Dedit&data=04%7C01%7Ctwurgl%40goodyear.com%7Ca695306ea68e432fe7e508d9a8aa7071%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726272287468997%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=e1f7NayA%2BNu%2FhSrELoV0dsD4PnnKynfiAK5oW8CXlSQ%3D&reserved=0>

> list of jobs that actually failed, not hung jobs and their slurmd.log files



Looking at the exact errors:

> [1165065.batch] error: Could not open stdout file /hpc/scratch/a026560/DEW/202110140054.DewLT.tool/LT_Inflate_Deflect/eagle/torsional_2.000/LT285REFCONST_AKR_A1530149_CrdPrd_1_00_000_3d_full_tor_2.000.lsfout: No such file or directoryO setup failed: No such file or directory



The error for all of these jobs is the output directory doesn't exist which

would cause any job to fail. Are there any jobs that did have this same error?



If this is just a user error, please ignore comment#80<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c80&data=04%7C01%7Ctwurgl%40goodyear.com%7Ca695306ea68e432fe7e508d9a8aa7071%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726272287478950%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=iI1cCD8Kucey9cSMB%2F0Dq37v8U8D1CWyVy0Z2nQR9gM%3D&reserved=0> and comment#79<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c79&data=04%7C01%7Ctwurgl%40goodyear.com%7Ca695306ea68e432fe7e508d9a8aa7071%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726272287478950%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=4Cw3COaWgQHSnisnTX75XHaJSEiU9g9WXBHcguWerNg%3D&reserved=0>.

________________________________
You are receiving this mail because:

  *   You reported the bug.
  *   You are watching someone on the CC list of the bug.

Comment 83 Nate Rini 2021-11-16 09:07:01 MST

(In reply to Tom Wurgler from comment #82)
> It is not a user error.  The file should have been created during the run,
> and with the run being interrupted it failed.

Where in the job should it have been created? Slurm will not create a missing directory, only a file.

Comment 84 Tom Wurgler 2021-11-16 11:01:43 MST

It is created during the fea job.

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, November 16, 2021 11:07 AM
To: Tom Wurgler <twurgl@goodyear.com>
Subject: [EXT] [Bug 12656] Adding nodes to partitions

 External Email....WARNING....Think before you click or respond....WARNING


Comment # 83<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c83&data=04%7C01%7Ctwurgl%40goodyear.com%7Cb0cedae5c12841584eda08d9a91b205d%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726756234253654%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=5e5349u%2F%2BVpl6RNe59xwhtV6CNn7vEEhoPeOQUWOY%2BY%3D&reserved=0> on bug 12656<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656&data=04%7C01%7Ctwurgl%40goodyear.com%7Cb0cedae5c12841584eda08d9a91b205d%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726756234263613%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=j3%2BIvQFPRMvLSnI3P%2BhS6ZRC9InxKoFxBCibmxgG7Vc%3D&reserved=0> from Nate Rini<mailto:nate@schedmd.com>

(In reply to Tom Wurgler from comment #82<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c82&data=04%7C01%7Ctwurgl%40goodyear.com%7Cb0cedae5c12841584eda08d9a91b205d%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726756234263613%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=IPjEb1nAvv8uPdWQQn6mrISQsPn%2B3lWvxVw3Cn9mDv8%3D&reserved=0>)

> It is not a user error.  The file should have been created during the run,

> and with the run being interrupted it failed.



Where in the job should it have been created? Slurm will not create a missing

directory, only a file.

________________________________
You are receiving this mail because:

  *   You reported the bug.
  *   You are watching someone on the CC list of the bug.

Comment 85 Nate Rini 2021-11-16 11:04:06 MST

(In reply to Tom Wurgler from comment #84)
> It is created during the fea job.

I'm not aware of what a 'fea' job is. Is this a job that runs before the current job or in a step that runs before? Possibly in a prolog or jobsubmit script?

Comment 86 Tom Wurgler 2021-11-16 11:08:47 MST

The user submit a parallel FEA (finite element analysis) job to Slurm.  I don't know when/how that file is created.

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, November 16, 2021 1:04 PM
To: Tom Wurgler <twurgl@goodyear.com>
Subject: [EXT] [Bug 12656] Adding nodes to partitions

 External Email....WARNING....Think before you click or respond....WARNING


Comment # 85<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c85&data=04%7C01%7Ctwurgl%40goodyear.com%7C94a57a8915e04bb61a5f08d9a92b7b99%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726826484221725%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=zE1ftUG8bXzBkJChksDltIw%2FSoDBZWJ2gaAPkwD8vQI%3D&reserved=0> on bug 12656<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656&data=04%7C01%7Ctwurgl%40goodyear.com%7C94a57a8915e04bb61a5f08d9a92b7b99%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726826484221725%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=k07PLrr35wlEe1hrbfpB3Xx9YeXxHeQ%2F4vYMoPNuOlU%3D&reserved=0> from Nate Rini<mailto:nate@schedmd.com>

(In reply to Tom Wurgler from comment #84<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c84&data=04%7C01%7Ctwurgl%40goodyear.com%7C94a57a8915e04bb61a5f08d9a92b7b99%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726826484231679%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=okDExVaZDwrhej7czKBM0Kb5Q67X8DqtvJxECbWqKGw%3D&reserved=0>)

> It is created during the fea job.



I'm not aware of what a 'fea' job is. Is this a job that runs before the

current job or in a step that runs before? Possibly in a prolog or jobsubmit

script?

________________________________
You are receiving this mail because:

  *   You reported the bug.
  *   You are watching someone on the CC list of the bug.

Comment 87 Nate Rini 2021-11-16 11:18:03 MST

(In reply to Tom Wurgler from comment #86)
> The user submit a parallel FEA (finite element analysis) job to Slurm.  I
> don't know when/how that file is created.

On their batch script possibly on the first line, can they call 'mkdir -p /hpc/scratch/a026560/DEW/202110140054.DewLT.tool/LT_Inflate_Deflect/eagle/torsional_2.000/' and try again? Slurm will not create a directory (or directory tree) for stdout/stderr/stdin.

Comment 88 Tom Wurgler 2021-11-16 11:30:35 MST

I didn't ask this user, but I'd bet they have since reran the job successfully.
The code they run does all the correct stuff normally.  But the job got interrupted.


From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, November 16, 2021 1:18 PM
To: Tom Wurgler <twurgl@goodyear.com>
Subject: [EXT] [Bug 12656] Adding nodes to partitions

 External Email....WARNING....Think before you click or respond....WARNING


Comment # 87<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c87&data=04%7C01%7Ctwurgl%40goodyear.com%7C7f3b9b5f968a460aec4208d9a92d6eaa%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726834862460250%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2B7bX6IMzpAkpnH0zlagGNQa9yGWFhFPdAvzPqdTv%2Ffc%3D&reserved=0> on bug 12656<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656&data=04%7C01%7Ctwurgl%40goodyear.com%7C7f3b9b5f968a460aec4208d9a92d6eaa%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726834862470203%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Celi7rwaijo3rfQtJqOWKQ3CC%2BTPQ2mG6wa8Fke3gq0%3D&reserved=0> from Nate Rini<mailto:nate@schedmd.com>

(In reply to Tom Wurgler from comment #86<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c86&data=04%7C01%7Ctwurgl%40goodyear.com%7C7f3b9b5f968a460aec4208d9a92d6eaa%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726834862470203%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=BG985zE%2FDfD%2FoN0VtF3OAoJ3O5MMdXXaPaKgFbUk%2FyE%3D&reserved=0>)

> The user submit a parallel FEA (finite element analysis) job to Slurm.  I

> don't know when/how that file is created.



On their batch script possibly on the first line, can they call 'mkdir -p

/hpc/scratch/a026560/DEW/202110140054.DewLT.tool/LT_Inflate_Deflect/eagle/torsional_2.000/'

and try again? Slurm will not create a directory (or directory tree) for

stdout/stderr/stdin.

________________________________
You are receiving this mail because:

  *   You reported the bug.
  *   You are watching someone on the CC list of the bug.

Comment 89 Nate Rini 2021-11-16 11:32:15 MST

(In reply to Tom Wurgler from comment #88)
> I didn't ask this user, but I'd bet they have since reran the job
> successfully.
> The code they run does all the correct stuff normally.  But the job got
> interrupted.

I'm afraid we do not have enough information to debug the issue. Is it possible to try a HA failover again with all the logging activated for slurmctld and slurmd of these test jobs?

Comment 90 Tom Wurgler 2021-11-16 11:41:41 MST

I am current re-imaging some nodes in the prod slurm.  I drained them, then downed them.
Now I am doing the imaging.  I believe that was all I had to do to this point.

Now when it is time to resume them, I want to resume 2 chassis worth of nodes as is (they remain in prod Slurm, in the same partitions etc).
But that first chassis I want to remove from Prod Slurm and re-add it back into our test Slurm.
How to I safely remove them from prod Slurm?  You spoke of "future" state.  Do I put that in the prod slurm.conf?  Should I just remove those nodes from slurm.conf?

Then we were planning on trying to recreate my issues with adding and removing nodes in the test environment with some test jobs.

Thanks


From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, November 16, 2021 1:32 PM
To: Tom Wurgler <twurgl@goodyear.com>
Subject: [EXT] [Bug 12656] Adding nodes to partitions

 External Email....WARNING....Think before you click or respond....WARNING


Comment # 89<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c89&data=04%7C01%7Ctwurgl%40goodyear.com%7C936be2b44e3b45d649de08d9a92f6a68%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726843383709885%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=c%2BWtmSvjOEgFTjCscXIpomwSztDf4FT%2Fmg6SUST%2FmpE%3D&reserved=0> on bug 12656<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656&data=04%7C01%7Ctwurgl%40goodyear.com%7C936be2b44e3b45d649de08d9a92f6a68%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726843383719845%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=4mNdS35NUtsuhNyamVkxzpe5DOuZ6lkHB3DlE4U1fXo%3D&reserved=0> from Nate Rini<mailto:nate@schedmd.com>

(In reply to Tom Wurgler from comment #88<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c88&data=04%7C01%7Ctwurgl%40goodyear.com%7C936be2b44e3b45d649de08d9a92f6a68%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726843383719845%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=3CwShjdAkrNAU7FTnxh1O8fZzZua1wAHRoHBCX8Gnhc%3D&reserved=0>)

> I didn't ask this user, but I'd bet they have since reran the job

> successfully.

> The code they run does all the correct stuff normally.  But the job got

> interrupted.



I'm afraid we do not have enough information to debug the issue. Is it possible

to try a HA failover again with all the logging activated for slurmctld and

slurmd of these test jobs?

________________________________
You are receiving this mail because:

  *   You reported the bug.
  *   You are watching someone on the CC list of the bug.

Comment 91 Nate Rini 2021-11-16 11:53:28 MST

(In reply to Tom Wurgler from comment #90)
> How to I safely remove them from prod Slurm?  You spoke of "future" state. 
> Do I put that in the prod slurm.conf?  Should I just remove those nodes from
> slurm.conf?
Leave nodes configured in the slurm.conf for both clusters. Set the `state=future` to tell Slurm that these nodes will be added at some point in the future. You can just update slurm.conf and reconfigure to do this or use scontrol to apply the state manually and update slurm.conf for the next cycle.

For the nodes themselves, just make sure they have the correct slurm.conf for the cluster you want them to run under currently.
 
> Then we were planning on trying to recreate my issues with adding and
> removing nodes in the test environment with some test jobs.
Please activate at least debug3 on slurmctld and slurmd for the test.

Comment 92 Tom Wurgler 2021-11-18 08:56:36 MST

Hi
We did this in the slurm.conf:

NodeName=DEFAULT RealMemory=128000 CPUs=24 Sockets=2 CoresPerSocket=12 ThreadsPerCore=1 State=UNKNOWN
NodeName=rdsxen[1,3-16] NodeAddr=rdsxen[1,3-16] Feature=ib,ib_9984,openmpi,xenon,intel GRES=fv:1 State=FUTURE

And did the scontrol reconfigure.  No errors, but the nodes still show up as "down" in the scontrol show node=rdsxen1

Now if we do scontrol update node=rdsxen1 state=future on the commandline, then did an scontrol show node=rdsxen1, it says "Node rdsxen1 not found".

So how to do this?

Thanks


From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, November 16, 2021 1:53 PM
To: Tom Wurgler <twurgl@goodyear.com>
Subject: [EXT] [Bug 12656] Adding nodes to partitions

 External Email....WARNING....Think before you click or respond....WARNING


Comment # 91<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c91&data=04%7C01%7Ctwurgl%40goodyear.com%7C838ec5f2ec9b40eaace408d9a93260f0%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726856107018356%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=uw3HEogB8e%2BDSTqy1%2BBaHzBZ69%2BhBGM1ozo%2BqLHaS0k%3D&reserved=0> on bug 12656<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656&data=04%7C01%7Ctwurgl%40goodyear.com%7C838ec5f2ec9b40eaace408d9a93260f0%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726856107028307%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=vtE%2FJZdJN3jRBOqw3uCaQxoSnL756wPwqjv0muJsoFA%3D&reserved=0> from Nate Rini<mailto:nate@schedmd.com>

(In reply to Tom Wurgler from comment #90<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c90&data=04%7C01%7Ctwurgl%40goodyear.com%7C838ec5f2ec9b40eaace408d9a93260f0%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726856107038261%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=tUxtCHpM6P5BriSkC7u65XBkIETaDN%2BWcD3PzPdQgeA%3D&reserved=0>)

> How to I safely remove them from prod Slurm?  You spoke of "future" state.

> Do I put that in the prod slurm.conf?  Should I just remove those nodes from

> slurm.conf?

Leave nodes configured in the slurm.conf for both clusters. Set the

`state=future` to tell Slurm that these nodes will be added at some point in

the future. You can just update slurm.conf and reconfigure to do this or use

scontrol to apply the state manually and update slurm.conf for the next cycle.



For the nodes themselves, just make sure they have the correct slurm.conf for

the cluster you want them to run under currently.



> Then we were planning on trying to recreate my issues with adding and

> removing nodes in the test environment with some test jobs.

Please activate at least debug3 on slurmctld and slurmd for the test.

________________________________
You are receiving this mail because:

  *   You reported the bug.
  *   You are watching someone on the CC list of the bug.

Comment 93 Nate Rini 2021-11-18 09:03:51 MST

(In reply to Tom Wurgler from comment #92)
> NodeName=DEFAULT RealMemory=128000 CPUs=24 Sockets=2 CoresPerSocket=12
> ThreadsPerCore=1 State=UNKNOWN
> NodeName=rdsxen[1,3-16] NodeAddr=rdsxen[1,3-16]
> Feature=ib,ib_9984,openmpi,xenon,intel GRES=fv:1 State=FUTURE
> 
> And did the scontrol reconfigure.  No errors, but the nodes still show up as
> "down" in the scontrol show node=rdsxen1
"down" will stop any jobs from falling on the nodes. Can we get the output of 'scontrol show node $NODE' after doing the reconfigure?
 
> Now if we do scontrol update node=rdsxen1 state=future on the commandline,
> then did an scontrol show node=rdsxen1, it says "Node rdsxen1 not found".
This is expected per the slurm.conf man page:
> Until these nodes are made available, they will not be seen using any Slurm commands or nor will any attempt be made to contact them.

Comment 94 Tom Wurgler 2021-11-18 09:08:18 MST

Adding the State=FUTURE was the only change we made.  The "future" nodes are still in the various partitions etc.
We did the
scontrol reconfigure

t901353@rds4020:gica > scontrol show node rdsxen1
NodeName=rdsxen1 Arch=x86_64 CoresPerSocket=12
   CPUAlloc=0 CPUTot=24 CPULoad=0.01
   AvailableFeatures=ib,ib_9984,openmpi,xenon,intel
   ActiveFeatures=ib,ib_9984,openmpi,xenon,intel
   Gres=fv:1
   NodeAddr=rdsxen1 NodeHostName=rdsxen1 Version=20.11.8
   OS=Linux 3.10.0-1160.15.2.el7.x86_64 #1 SMP Thu Jan 21 16:15:07 EST 2021
   RealMemory=128000 AllocMem=0 FreeMem=123820 Sockets=2 Boards=1
   State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=linlarge,medium,support
   BootTime=2021-11-16T14:00:06 SlurmdStartTime=2021-11-16T14:01:20
   CfgTRES=cpu=24,mem=125G,billing=24
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=testme [root@2021-11-18T10:39:55]
   Comment=(null)

You have new mail in /var/spool/mail/t901353

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Thursday, November 18, 2021 11:04 AM
To: Tom Wurgler <twurgl@goodyear.com>
Subject: [EXT] [Bug 12656] Adding nodes to partitions

 External Email....WARNING....Think before you click or respond....WARNING


Comment # 93<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c93&data=04%7C01%7Ctwurgl%40goodyear.com%7Cc991352c73904d05fe5708d9aaad0439%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637728482358217900%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=PFxwKfcoHZabC%2B1rz5DyU0vFQa6gxu9ENbNkEWzFoWo%3D&reserved=0> on bug 12656<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656&data=04%7C01%7Ctwurgl%40goodyear.com%7Cc991352c73904d05fe5708d9aaad0439%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637728482358217900%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=mhaSX3h3HwbWgT7JXEDzz%2BFUokfZV8%2BPsCgsvehYNec%3D&reserved=0> from Nate Rini<mailto:nate@schedmd.com>

(In reply to Tom Wurgler from comment #92<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c92&data=04%7C01%7Ctwurgl%40goodyear.com%7Cc991352c73904d05fe5708d9aaad0439%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637728482358227855%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=4fTevS6xtY8T1JxsXABVpQD645V%2Fy5ORmPQYYpNRPkI%3D&reserved=0>)

> NodeName=DEFAULT RealMemory=128000 CPUs=24 Sockets=2 CoresPerSocket=12

> ThreadsPerCore=1 State=UNKNOWN

> NodeName=rdsxen[1,3-16] NodeAddr=rdsxen[1,3-16]

> Feature=ib,ib_9984,openmpi,xenon,intel GRES=fv:1 State=FUTURE

>

> And did the scontrol reconfigure.  No errors, but the nodes still show up as

> "down" in the scontrol show node=rdsxen1

"down" will stop any jobs from falling on the nodes. Can we get the output of

'scontrol show node $NODE' after doing the reconfigure?



> Now if we do scontrol update node=rdsxen1 state=future on the commandline,

> then did an scontrol show node=rdsxen1, it says "Node rdsxen1 not found".

This is expected per the slurm.conf man page:

> Until these nodes are made available, they will not be seen using any Slurm commands or nor will any attempt be made to contact them.

________________________________
You are receiving this mail because:

  *   You reported the bug.
  *   You are watching someone on the CC list of the bug.

Comment 95 Nate Rini 2021-11-18 09:43:10 MST

(In reply to Tom Wurgler from comment #94)
> Adding the State=FUTURE was the only change we made.  The "future" nodes are
> still in the various partitions etc.
> We did the
> scontrol reconfigure

I had to check this in the code but looks like a reconfigure will not update the states. However, a restart would. Using scontrol to manually update the states or just restarting the slurmctld daemon will be required. Please choose which you prefer.

Comment 96 Tom Wurgler 2021-11-18 11:08:30 MST

Well, I will suggest this is a bug.  The point of attempting nodes being future is that it won't mess up when adding nodes and doing a restart etc.
So I really would wish that reconfigure did the deed.
So now we have that chassis in future state via the command line and in the slurm.conf as well.  At some point we'll do a restart and it will take effect long term.  Now every time we do a reconfigure, the nodes come back but are still in the drain state.  So at least jobs won't start on those nodes.  But they will be running the test env slurmd anyway.



From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Thursday, November 18, 2021 11:43 AM
To: Tom Wurgler <twurgl@goodyear.com>
Subject: [EXT] [Bug 12656] Adding nodes to partitions

 External Email....WARNING....Think before you click or respond....WARNING


Comment # 95<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c95&data=04%7C01%7Ctwurgl%40goodyear.com%7C780f4f9f22dd472887eb08d9aab2822e%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637728505929015179%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=iVxmnUD8gP18sGVuAz5CdC0TQ6TnG7CshxfnXU9fWR4%3D&reserved=0> on bug 12656<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656&data=04%7C01%7Ctwurgl%40goodyear.com%7C780f4f9f22dd472887eb08d9aab2822e%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637728505929025132%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=VbYg5%2FFo9uf2BQbLnJTIedyTxOnaq8R%2Fppe18DzAodI%3D&reserved=0> from Nate Rini<mailto:nate@schedmd.com>

(In reply to Tom Wurgler from comment #94<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c94&data=04%7C01%7Ctwurgl%40goodyear.com%7C780f4f9f22dd472887eb08d9aab2822e%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637728505929035093%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=k9U5yu8sU5KelKMtPc36EitmFkTeJAHsIg9V5nJ%2FC9g%3D&reserved=0>)

> Adding the State=FUTURE was the only change we made.  The "future" nodes are

> still in the various partitions etc.

> We did the

> scontrol reconfigure



I had to check this in the code but looks like a reconfigure will not update

the states. However, a restart would. Using scontrol to manually update the

states or just restarting the slurmctld daemon will be required. Please choose

which you prefer.

________________________________
You are receiving this mail because:

  *   You reported the bug.
  *   You are watching someone on the CC list of the bug.

Comment 97 Nate Rini 2021-11-18 11:10:07 MST

(In reply to Tom Wurgler from comment #96)
> Well, I will suggest this is a bug.  The point of attempting nodes being
> future is that it won't mess up when adding nodes and doing a restart etc.
> So I really would wish that reconfigure did the deed.
> So now we have that chassis in future state via the command line and in the
> slurm.conf as well.  At some point we'll do a restart and it will take
> effect long term.  Now every time we do a reconfigure, the nodes come back
> but are still in the drain state.  So at least jobs won't start on those
> nodes.  But they will be running the test env slurmd anyway.

Please provide a status update after the test.

Comment 98 Tom Wurgler 2021-11-18 14:47:57 MST

Created attachment 22323 [details]
duplicated the problem with test slurm environment

We have duplicated the problem in our test env.
Please find the slurmctld.log and the slurmd.log along with a 00README file in the attachment.

Thanks
tom

Comment 99 Nate Rini 2021-11-18 14:51:27 MST

(In reply to Tom Wurgler from comment #98)
> Created attachment 22323 [details]
> [2021-11-18T12:26:08.768] error: WARNING: switches lack access to 2 nodes: alnxrsch1,rdsxenhn

Please make sure to fix your topology.conf

Comment 100 Nate Rini 2021-11-18 14:52:58 MST

Which host is 10.103.142.199?

Comment 102 Tom Wurgler 2021-11-18 14:57:52 MST

That is alnx165, the test slurm master node running the db and slurmctld
Need to leave for the day, Bill will work with you tomorrow.
Thanks



From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Thursday, November 18, 2021 4:53 PM
To: Tom Wurgler <twurgl@goodyear.com>
Subject: [EXT] [Bug 12656] Adding nodes to partitions

 External Email....WARNING....Think before you click or respond....WARNING


Comment # 100<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c100&data=04%7C01%7Ctwurgl%40goodyear.com%7Ca01eccc20bb54fd39f0f08d9aaddc966%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637728691818932513%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=0ILtQPa17%2BpxvzlXSFypxMJuQre%2Bb2erxeZ%2BolnDMCo%3D&reserved=0> on bug 12656<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656&data=04%7C01%7Ctwurgl%40goodyear.com%7Ca01eccc20bb54fd39f0f08d9aaddc966%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637728691818942475%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=GVYd9FU14PhhGE6QQsIuyQkKT3624JITC%2BeaE22W7YY%3D&reserved=0> from Nate Rini<mailto:nate@schedmd.com>

Which host is 10.103.142.199?

________________________________
You are receiving this mail because:

  *   You reported the bug.
  *   You are watching someone on the CC list of the bug.

Comment 103 Nate Rini 2021-11-22 13:07:05 MST

(In reply to Nate Rini from comment #99)
> (In reply to Tom Wurgler from comment #98)
> > Created attachment 22323 [details]
> > [2021-11-18T12:26:08.768] error: WARNING: switches lack access to 2 nodes: alnxrsch1,rdsxenhn
> 
> Please make sure to fix your topology.conf

Please update and attach new logs once topology.conf is resynced with the hardware.

Comment 104 Bill Benedetto 2021-11-24 09:09:41 MST

(In reply to Nate Rini from comment #103)
> (In reply to Nate Rini from comment #99)
> > (In reply to Tom Wurgler from comment #98)
> > > Created attachment 22323 [details]
> > > [2021-11-18T12:26:08.768] error: WARNING: switches lack access to 2 nodes: alnxrsch1,rdsxenhn
> > 
> > Please make sure to fix your topology.conf
> 
> Please update and attach new logs once topology.conf is resynced with the
> hardware.

I can updated the topology.conf file and make those warnings go away.  But I don't see the point.

Is the topology.conf file so key to the operation of slurm that having it not cover all hosts can cause this type of catastrophic failure??

If so, then you should update the documentation to say that.
And change it from saying WARNING to ERROR ERROR ERROR.

---

The last set of attachments that Tom sent are from our test environment, not production.

I just now went through those logs and removed everything before the time of the test so that only the log messages from the test itself are shown.
I'll attach those directly.

Regardless, this shows that we were able to duplicate the error in our test environment with only a handful of systems.

- Bill

Comment 105 Bill Benedetto 2021-11-24 09:11:13 MST

Created attachment 22389 [details]
Cleaned up logs from test systems, showing that we can recreate the issue.

Comment 106 Nate Rini 2021-11-29 09:17:13 MST

(In reply to Bill Benedetto from comment #104)
> I can updated the topology.conf file and make those warnings go away.  But I
> don't see the point.
In general, we prefer to isolate issues which means not having any warnings/errors in the logs that need not be there.
 
> Is the topology.conf file so key to the operation of slurm that having it
> not cover all hosts can cause this type of catastrophic failure??
When 'RoutePlugin=route/topology' is not configured and neither is TreeWidth, Slurm will default TreeWidth to 50. If a job has more than 50 nodes, then Slurm will randomly choose one slurmd (which may or may not be in the job) to act as part of the message tree. In this case, it will not use nodes that lack switch access. Not a critical issue but rather have it not be there to avoid any surprises.

> error: Node alnx101 appears to have a different slurm.conf than the slurmctld.
However, this error may cause jobs on the listed nodes to be misplaced. slurmctld will still place jobs in this case (this changes in 21.08 release) and may result in the job getting more resources allocated than available. This is unrelated to the current issue though.

> Regardless, this shows that we were able to duplicate the error in our test
> environment with only a handful of systems.
>
> [316.2] error: *** STEP 316.2 ON rdsxen16 CANCELLED AT 2021-11-18T16:23:32 ***
The job was canceled by a user. There are no relevant errors in the slurmd log such as a NODE_FAIL event.

We will need a higher level of debug logging from the job and slurmd.

Please add this argument to the srun call in the job:
> --slurmd-debug=debug3

or set this in slurm.conf on the test node and restart slurmd to activate the log change:
> SlurmdDebug=debug3

Please add this argument to the srun call in either case:
> -vvvvvvv

Please attach the slurmd log and the srun log from the job when the issue replicates.

Comment 107 Nate Rini 2021-12-16 12:54:52 MST

Any updates?

Comment 108 Nate Rini 2022-01-31 17:28:29 MST

There haven't been any updates in a month. I'm going to time this ticket out. Please reply and we can continue debugging.

Thanks,
--Nate