|
Description
Tom Wurgler
2021-10-12 13:55:38 MDT
Please note we do have the primary and backup slurmctld's. So we would need to shut both of them down before changing the slurm.conf? thanks Hi Tom, Shutting down slurmctld is recommended when adding nodes because srun jobs can get messed up and then fail. Moreover, slurm uses bitmaps to keep track of node information and they are created when the daemon starts up based on the slurm.conf. So when srun attempts to make an allocation but there is a discrepancy between the slurmctld and slurmd bitmaps, it will fail the job. Stopping slurmctld ensures that no srun jobs can get scheduled/allocations while slurm.conf is altered and daemons restarted, thus preventing srun jobs from failing while going through the node adding process. There is a technical difference between an srun job and an sbatch job with srun commands. To make it clear: an srun job is one created directly by srun, where it handles both making the allocation and the step; an sbatch job with srun commands in it is one where sbatch makes the allocation and srun only makes the step. If your cluster/site can ensure no srun jobs will be submitted and scheduled/allocated during the the node adding process (or okay with them failing for that matter), then you could get away with not stopping slurmctld during the adding node process. It should be noted that stopping slurmctld will only prevent further scheduling. Moreover, restarting slurmd will not interrupt their running jobs and their completing jobs will wait for slurmctld to return to service. Best, Skyler Ok, so we follow the steps below. Couple of more questions: Do we need to stop the backup slurmctld? Before or after the primary? When restarting things, do we start the backup slurmctld before of after the primary? And how long do we have to get things restarted once slurmctld is shutdown before pending or running jobs start dying? thanks > Do we need to stop the backup slurmctld? Before or after the primary? When > restarting things, do we start the backup slurmctld before of after the > primary? Backup daemons form a linear hierarchy -- think an array of daemons. The zeroth daemon is the primary, followed by the secondary daemons in order. It would be best to stop them in backwards order (e.g. slurmctld[n], ..., then slurmctld[0]) and start them in forwards order (e.g. slurmctld[0], ..., then slurmctld[n]) as to reduce the switching the control. > And how long do we have to get things restarted once slurmctld is shutdown > before pending or running jobs start dying? When slurmctld is shutdown, only scheduling will not occur. Job which have been allocated and started will then be running on the slurmd(s). Running jobs will not die, rather stay in a completing state once they have finished on slurmd and will wait to notify slurmctld of their finished jobs. Pending jobs will stay in queue until slurmctld schedules them or is cancelled by user/admin. This was another catastrophic day. Cluster 100% used for the nodes in slurm.conf. Literally 100's of jobs pending (over 5000 cores worth). Wanted to add another chassis of nodes that weren't in slurm.conf to start with. 1) shutdown backup slurmctld 2) shutdown primary slurmctld 3) installed the new slurm.conf and topology file with the new nodes added. 4) restarted slurmd on all cluster nodes 5) restarted slurmd on all dekstops 6) started primary slurmctld 7) started backup slurmctld within seconds users were calling with failed jobs. 86 jobs died that we know of. most still say they are running but aren't. What can we do to prevent this kind of thing? No typos in the files this time. The new nodes are now running jobs happily, but loss of jobs is about the worse thing the admins can do. We just can't have this happen. I am raising the priority as this is so important. Please advise. thanks tom Would you please attach your slurm.conf's (before and after node adding) and the slurmctld.log. Created attachment 21768 [details]
slurmctld.log
Created attachment 21769 [details]
part of slurm.cof
Created attachment 21770 [details]
new slurm.conf
Created attachment 21771 [details]
previous slurm.conf
Created attachment 21772 [details]
new topology.conf
Created attachment 21773 [details]
previous topology.conf
Thanks for looking into this.... I notice that there are a number of errors that should be looked into from your end. > [2021-10-14T15:00:24.678] error: WARNING: switches lack access to 2 nodes: alnxrsch1,rdsxenhn > [2021-10-14T15:00:24.678] topology/tree: _validate_switches: TOPOLOGY: warning -- no switch can reach all nodes through its descendants. If this is not intentional, fix the topology.conf file. Please update the topology to connect the remaining nodes (alnxrsch1,rdsxenhn). > [2021-10-05T10:04:41.460] Batch JobId=1144530 missing from batch node rdsxen134 (not found BatchStartTime after startup), Requeuing job > [2021-10-05T10:04:41.460] _job_complete: JobId=1144530 WTERMSIG 126 > [2021-10-05T10:04:41.460] _job_complete: JobId=1144530 cancelled by node failure Please attach a the slurmd.log for one of these nodes. Thanks. > [2021-10-14T15:00:42.017] error: Node rdsvm210 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. > [2021-10-14T15:26:42.413] error: Node alnxr500 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. > [2021-10-14T15:26:42.442] error: Node rdsvm109 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. Out of sync nodes means out of sync bitmaps which can be a big issue. This can certainly contribute to jobs failing. > [2021-10-15T07:00:06.765] error: slurm_receive_msg [163.243.17.44:45114]: Zero Bytes were transmitted or received > [2021-10-15T07:13:42.972] error: slurm_receive_msg [10.103.143.108:35822]: Zero Bytes were transmitted or received > [2021-10-15T07:19:55.208] error: slurm_receive_msg [10.103.143.121:48600]: Zero Bytes were transmitted or received Verify these IP addresses. If they correspond to nodes, then you should verify the networking routing and diagnose the node -- drain, restart, check logs, etc.. -- as needed. > 2021-10-14T14:45:01.868] error: _find_node_record(763): lookup failure for rdsxen7 > [2021-10-14T14:45:01.878] error: _find_node_record(763): lookup failure for rdsxen8 > [2021-10-14T14:45:01.880] error: _find_node_record(763): lookup failure for rdsxen11 > [2021-10-14T14:45:01.993] error: _find_node_record(763): lookup failure for rdsxen6 > [2021-10-14T14:45:02.052] error: _find_node_record(763): lookup failure for rdsxen14 It appears as though the new nodes are not resolving correctly but you report jobs are running on them. Please look into this as well. please provide an update. Were you able to resolved or verify the networking issues? What issues remain? I have cleaned up most of the errors regarding the nodes that still needed slurmd restarted. However, we don't feel like any of these issues would/should have caused us to lose jobs. In looking at the support contract, we think we have like 8 hours of support time of having us talk via teams with you folks..??? We'd like to go into all this much deeper to figure this out. So what I propose is: 1) let me poll our admins for any slurm questions to bring up (this topic or any other) 2) I will send the questions to you for preview 3) We set up a teams meeting to discuss and perhaps have you comment further on our setup on this topic and on whatever questions we have. How about a meeting Nov 1 (a week from Monday) I am out next Thursday and another admin is out Friday... Will this work for your team? Should I work with Jess Arington to arrange it? thanks tom Our new best date for a technical meeting is November 4. Would 10:00AM our time work for you? Tom, I am looping Jess into the conversation, so he is aware and can set that up.
> Our new best date for a technical meeting is November 4.
> Would 10:00AM our time work for you?
Would you also send the list of questions you are looking at covering in that conversation? I want to make sure I put the right engineer on that call when we settle on a day and time.
Tom, I'll be taking over for Skyler with this ticket in preparation for our meeting next week. Can you please provide your current slurm.conf & friends if they have changed since comment#11. Please call and attach the output of these commands on your controller as root: > scontrol show config > scontrol show nodes > sdiag > sacctmgr show stats (In reply to Tom Wurgler from comment #18) > 1) let me poll our admins for any slurm questions to bring up (this topic or > any other) > 2) I will send the questions to you for preview Please attach these questions. The more lead time I have, the easier it is to answer them beforehand or at least be able to answer them at the meeting. > 3) We set up a teams meeting to discuss and perhaps have you comment further > on our setup on this topic and on whatever questions we have. How are the nodes getting added to the cluster (outside of Slurm). Thanks, --Nate Hi Nate, I'll get the output of those commands in a bit or first thing tomorrow. But wanted to ask about your last line....how are nodes getting added to the cluster (outside of Slurm)?? We have a cluster with 26 chassis. Twenty-five of the chassis had been defined in "prod" slurm. Chassis #1 was in a separate slurm install for testing slurm configuration/slurm versions etc. Different primary controller, different mariadb install.Prod slurm and Test slurm were independent. This was all working fine. Our prod cluster was full, with hundreds of jobs pending (something like nearly 10000 cores worth of jobs). So I shutdown the test slurm environment with intent to add that first chassis worth of nodes to the prod slurm env. I reimaged the nodes to our production level (RHEL 7.5). Once all were up again, I followed the step-by-step I listed in this ticket. Note, when the step came to start slurmd on compute nodes, that first chassis wasn't ready after all. I had to add hwloc-libs. I had already shutdown slurmctld on the backup and then the primary controllers. So there was a few minutes of delay with the slurmctld down while hwloc-libs was added. Is this what you meant by how did we add nodes outside of slurm? The nodes we were adding are identical to all the others nodes. All part of the same cluster. (In reply to Tom Wurgler from comment #23) > I reimaged the nodes to our production level (RHEL 7.5). Once all were up > again, I followed the step-by-step I listed in this ticket. I take it that these nodes are stateful then? Is Slurm installed on disk locally too? Our slurm is installed via NFS on a shared disk. Neither the prod or test slurm environments are local. (In reply to Tom Wurgler from comment #26) > Our slurm is installed via NFS on a shared disk. Is it versioned out as suggested by slide 25: > https://slurm.schedmd.com/SLUG21/Field_Notes_5.pdf > Neither the prod or test slurm environments are local. Do the cluster share a common interconnect? (In reply to Tom Wurgler from comment #23) > We have a cluster with 26 chassis. Is it possible to get the output of 'slurmd -C' from all of the compute nodes and login nodes (and any other nodes that users have access)? > I reimaged the nodes to our production level (RHEL 7.5). Once all were up > again, I followed the step-by-step I listed in this ticket. What cluster management software is being used for imaging? The specific one doesn't matter to Slurm but may help inform my suggestions. > Note, when the step came to start slurmd on compute nodes, that first > chassis wasn't ready after all. I had to add hwloc-libs. I had already > shutdown slurmctld on the backup and then the primary controllers. So there > was a few minutes of delay with the slurmctld down while hwloc-libs was > added. Has your site considered a healthscript or at very least a set of standard/quick test jobs for nodes post reboot/addition? > Is this what you meant by how did we add nodes outside of slurm? The nodes > we were adding are identical to all the others nodes. All part of the same > cluster. I don't like to assume about clusters being perfectly homogenous, I rather be sure before making suggestions. For instance the config provided in comment#11 shows that at the very least there is atleast one node with a different GRES configuration. Please also provide 'slurmctld -V' from the controllers on your prod and test clusters. Yes, it is versioned with prod and test symlinks. There is infiniband on all of the cluster (except the headnode, which does not have IB). And the cluster headnode is our slurmctld. We also have our desktops as part of slurm, of course. And they are not on IB nor even in the same physical building. I'll attach the slurmd -C for the cluster. Did you want the desktops as well? We use RHEL kickstart for imaging the cluster (and the desktops, as a matter of fact). We currently run Node Health Check across our cluster and desktops. It runs at either 5 or 10 minute intervals. I don't remember which. root@rdsxenhn: ~ # /usr/local/slurm/sbin/slurmctld -V slurm 20.11.8 I didn't run slurmctld -V on the test environment. Created attachment 21967 [details]
slurmd -C for the cluster compute nodes
(In reply to Bill Benedetto from comment #28) > Yes, it is versioned with prod and test symlinks. Is systemd used to manage the daemons? > There is infiniband on all of the cluster (except the headnode, which does > not have IB). I assume this means the IB firmwares are also kept in sync and there is a single opensm server? > And the cluster headnode is our slurmctld. I assume it also has slurmdbd and the MySQL DB? > We also have our desktops as part of slurm, of course. Do these desktops run slurmd? Are they using Munge or JWT for auth? > And they are not on IB nor even in the same physical building. Is the connection to Slurm getting wrapped by TLS or a VPN? Slurm is not designed to go in the clear over the internet. > I'll attach the slurmd -C for the cluster. > Did you want the desktops as well? If they are submit only, then no. If there is a possibility of running jobs on them, then yes. > We use RHEL kickstart for imaging the cluster (and the desktops, as a matter > of fact). How are the nodes kept in sync after that? > We currently run Node Health Check across our cluster and desktops. > It runs at either 5 or 10 minute intervals. I don't remember which. I didn't see in config provided. Please attach a current version including the other config files for Slurm. We generally ask sites tarball and/or zip files when attaching. > root@rdsxenhn: ~ # /usr/local/slurm/sbin/slurmctld -V > slurm 20.11.8 > > I didn't run slurmctld -V on the test environment. Are there any patches active on either cluster outside of the normal tagged releases? On Wed, 2021-10-27 at 17:38 +0000, bugs@schedmd.com wrote: External Email....WARNING....Think before you click or respond....WARNING Comment # 30<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c30&data=04%7C01%7Cbbenedetto%40goodyear.com%7C2b028738e6214c36959e08d999709d4c%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637709531239816005%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=DvLNnGG2oClLvYZvnKP2g1nDWPCTsGqPcddPi1KH2nA%3D&reserved=0> onbug 12656<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656&data=04%7C01%7Cbbenedetto%40goodyear.com%7C2b028738e6214c36959e08d999709d4c%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637709531239825961%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=2XvXtss2Fe4X2GFi4NcUZGnSkFCJEmtLD7Df9JtVDGY%3D&reserved=0> from Nate Rini<mailto:nate@schedmd.com> (In reply to Bill Benedetto from comment #28<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c28&data=04%7C01%7Cbbenedetto%40goodyear.com%7C2b028738e6214c36959e08d999709d4c%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637709531239825961%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=F7RjubpfFY8gETouphUmsPbBz4CPJ5lgmBBEaW9Hgeg%3D&reserved=0>) > Yes, it is versioned with prod and test symlinks. Is systemd used to manage the daemons? I have no idea what this question means. We use systemd to START the daemons everywhere. > There is infiniband on all of the cluster (except the headnode, which does > not have IB). I assume this means the IB firmwares are also kept in sync and there is a single opensm server? We have IB switches in the cluster and they are all interconnected and one is acting as the master. When we put the cluster together they would have all been the same level of firmware. And we wouldn't have updated it. So based on our experience, we're going to say they are kept in sync. > And the cluster headnode is our slurmctld. I assume it also has slurmdbd and the MySQL DB? No. They run on another host. A VM. > We also have our desktops as part of slurm, of course. Do these desktops run slurmd? Are they using Munge or JWT for auth? They run slurmd and are using Munge. > And they are not on IB nor even in the same physical building. Is the connection to Slurm getting wrapped by TLS or a VPN? Slurm is not designed to go in the clear over the internet. As far as the networking goes, the traffic is all internal to Goodyear with no VPN. > I'll attach the slurmd -C for the cluster. > Did you want the desktops as well? If they are submit only, then no. If there is a possibility of running jobs on them, then yes. They are all active participants, submitting and/or running. See next attachment. > We use RHEL kickstart for imaging the cluster (and the desktops, as a matter > of fact). How are the nodes kept in sync after that? kickstart The nodes don't get updated unless we re-image the node. > We currently run Node Health Check across our cluster and desktops. > It runs at either 5 or 10 minute intervals. I don't remember which. I didn't see in config provided. Please attach a current version including the other config files for Slurm. We generally ask sites tarball and/or zip files when attaching. You want our NHC config? What does this have to do with our issue? > root@rdsxenhn: ~ # /usr/local/slurm/sbin/slurmctld -V > slurm 20.11.8 > > I didn't run slurmctld -V on the test environment. Are there any patches active on either cluster outside of the normal tagged releases? OS patches? I imagine that there are loads of them. We include a bunch of them during the kickstart process. Are all of the systems the same? Yes. The desktops are kickstart'ed from the a file. The cluster nodes are kickstart'ed from a different file. ________________________________ You are receiving this mail because: * You are on the CC list for the bug. * You are watching the reporter of the bug. Created attachment 21970 [details]
slurmd -C for the desktop nodes
(In reply to Bill Benedetto from comment #31) > On Wed, 2021-10-27 at 17:38 +0000, bugs@schedmd.com wrote: > Is systemd used to manage the daemons? > > I have no idea what this question means. We use systemd to START the > daemons everywhere. Is your site using the included systemd unit files generated by the Slurm installer? Please call on a compute node: > systemctl show slurmd > systemctl status slurmd > > We currently run Node Health Check across our cluster and desktops. > > It runs at either 5 or 10 minute intervals. I don't remember which. > I didn't see in config provided. Please attach a current version including > the > other config files for Slurm. We generally ask sites tarball and/or zip files > when attaching. > > You want our NHC config? What does this have to do with our issue? I'm attempting to understand your site's setup in order to provide the best advice w/rt to Slurm. No, I don't need the NHC config. Please attach a current slurm.conf and friends. We started out with a slurm install-generated slurmd.service file.
We've made some minor changes.
Here are the details from one of our compute nodes:
root@rdsxen66: ~ # systemctl cat slurmd
# /usr/lib/systemd/system/slurmd.service
[Unit]
Description=Slurm node daemon
After=network.target munge.service multi-user.target
ConditionPathExists=/etc/slurm/slurm.conf
[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/slurmd
ExecStart=/usr/local/slurm/sbin/slurmd $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurmd.pid
KillMode=process
LimitNOFILE=51200
LimitMEMLOCK=infinity
LimitSTACK=infinity
Delegate=yes
[Install]
WantedBy=multi-user.target
root@rdsxen66: ~ # systemctl show slurmd
Type=forking
Restart=no
PIDFile=/var/run/slurmd.pid
NotifyAccess=none
RestartUSec=100ms
TimeoutStartUSec=1min 30s
TimeoutStopUSec=1min 30s
WatchdogUSec=0
WatchdogTimestampMonotonic=0
StartLimitInterval=10000000
StartLimitBurst=5
StartLimitAction=none
FailureAction=none
PermissionsStartOnly=no
RootDirectoryStartOnly=no
RemainAfterExit=no
GuessMainPID=yes
MainPID=172709
ControlPID=0
FileDescriptorStoreMax=0
StatusErrno=0
Result=success
ExecMainStartTimestamp=Thu 2021-10-14 14:58:04 EDT
ExecMainStartTimestampMonotonic=19178138232539
ExecMainExitTimestampMonotonic=0
ExecMainPID=172709
ExecMainCode=0
ExecMainStatus=0
ExecStart={ path=/usr/local/slurm/sbin/slurmd ; argv[]=/usr/local/slurm/sbin/slurmd $SLURMD_OPTIONS ; ignore_errors=no ; start_time=[Thu 2021-10-14 14:58:03 EDT] ; stop_time=[Thu 2021-10-14 14:58:04 EDT] ; pid=172704 ; code=exited ; status=0 }
ExecReload={ path=/bin/kill ; argv[]=/bin/kill -HUP $MAINPID ; ignore_errors=no ; start_time=[n/a] ; stop_time=[n/a] ; pid=0 ; code=(null) ; status=0/0 }
Slice=system.slice
ControlGroup=/system.slice/slurmd.service
MemoryCurrent=17121280
TasksCurrent=67
Delegate=yes
CPUAccounting=no
CPUShares=18446744073709551615
StartupCPUShares=18446744073709551615
CPUQuotaPerSecUSec=infinity
BlockIOAccounting=no
BlockIOWeight=18446744073709551615
StartupBlockIOWeight=18446744073709551615
MemoryAccounting=no
MemoryLimit=18446744073709551615
DevicePolicy=auto
TasksAccounting=no
TasksMax=18446744073709551615
EnvironmentFile=/etc/sysconfig/slurmd (ignore_errors=yes)
UMask=0022
LimitCPU=18446744073709551615
LimitFSIZE=18446744073709551615
LimitDATA=18446744073709551615
LimitSTACK=18446744073709551615
LimitCORE=18446744073709551615
LimitRSS=18446744073709551615
LimitNOFILE=51200
LimitAS=18446744073709551615
LimitNPROC=514533
LimitMEMLOCK=18446744073709551615
LimitLOCKS=18446744073709551615
LimitSIGPENDING=514533
LimitMSGQUEUE=819200
LimitNICE=0
LimitRTPRIO=0
LimitRTTIME=18446744073709551615
OOMScoreAdjust=0
Nice=0
IOScheduling=0
CPUSchedulingPolicy=0
CPUSchedulingPriority=0
TimerSlackNSec=50000
CPUSchedulingResetOnFork=no
NonBlocking=no
StandardInput=null
StandardOutput=journal
StandardError=inherit
TTYReset=no
TTYVHangup=no
TTYVTDisallocate=no
SyslogPriority=30
SyslogLevelPrefix=yes
SecureBits=0
CapabilityBoundingSet=18446744073709551615
AmbientCapabilities=0
MountFlags=0
PrivateTmp=no
PrivateNetwork=no
PrivateDevices=no
ProtectHome=no
ProtectSystem=no
SameProcessGroup=no
IgnoreSIGPIPE=yes
NoNewPrivileges=no
SystemCallErrorNumber=0
RuntimeDirectoryMode=0755
KillMode=process
KillSignal=15
SendSIGKILL=yes
SendSIGHUP=no
Id=slurmd.service
Names=slurmd.service
Requires=basic.target
Wants=system.slice
WantedBy=multi-user.target
Conflicts=shutdown.target
Before=shutdown.target
After=network.target basic.target system.slice systemd-journald.socket munge.service multi-user.target
Description=Slurm node daemon
LoadState=loaded
ActiveState=active
SubState=running
FragmentPath=/usr/lib/systemd/system/slurmd.service
UnitFileState=enabled
UnitFilePreset=disabled
InactiveExitTimestamp=Thu 2021-10-14 14:58:03 EDT
InactiveExitTimestampMonotonic=19178137721959
ActiveEnterTimestamp=Thu 2021-10-14 14:58:04 EDT
ActiveEnterTimestampMonotonic=19178138232625
ActiveExitTimestamp=Thu 2021-10-14 14:58:03 EDT
ActiveExitTimestampMonotonic=19178137713209
InactiveEnterTimestamp=Thu 2021-10-14 14:58:03 EDT
InactiveEnterTimestampMonotonic=19178137717692
CanStart=yes
CanStop=yes
CanReload=yes
CanIsolate=no
StopWhenUnneeded=no
RefuseManualStart=no
RefuseManualStop=no
AllowIsolate=no
DefaultDependencies=yes
OnFailureJobMode=replace
IgnoreOnIsolate=no
IgnoreOnSnapshot=no
NeedDaemonReload=no
JobTimeoutUSec=0
JobTimeoutAction=none
ConditionResult=yes
AssertResult=yes
ConditionTimestamp=Thu 2021-10-14 14:58:03 EDT
ConditionTimestampMonotonic=19178137717966
AssertTimestamp=Thu 2021-10-14 14:58:03 EDT
AssertTimestampMonotonic=19178137721659
Transient=no
root@rdsxen66: ~ # systemctl status slurmd
* slurmd.service - Slurm node daemon
Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
Active: active (running) since Thu 2021-10-14 14:58:04 EDT; 1 weeks 5 days ago
Process: 172704 ExecStart=/usr/local/slurm/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 172709 (slurmd)
Tasks: 67
Memory: 16.3M
CGroup: /system.slice/slurmd.service
|- 24454 slurmstepd: [1176940.batch]
|- 24519 /bin/bash /var/spool/slurmd/job1176940/slurm_script
|- 25017 /bin/csh -f /apps/tpm/lsf/launch -n 96 eagle -i ac.cw.80.m...
|- 25150 /usr/local/slurm/bin/srun -n 96 /apps/tpm/SIERRA/Eagle3_co...
|- 25152 /usr/local/slurm/bin/srun -n 96 /apps/tpm/SIERRA/Eagle3_co...
|- 25162 slurmstepd: [1176940.1]
|- 25168 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
|- 25169 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
|- 25170 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
|- 25171 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
|- 25172 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
|- 25173 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
|- 25174 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
|- 25175 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
|- 25176 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
|- 25177 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
|- 25178 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
|- 25179 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
|- 25180 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
|- 25181 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
|- 25182 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
|- 25183 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
|- 25184 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
|- 25185 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
|- 25186 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
|- 25187 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
|- 25188 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
|- 25189 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
|- 25190 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
|- 25191 /apps/tpm/SIERRA/Eagle3_compile_I19_IMPI/install/NewSierra...
`-172709 /usr/local/slurm/sbin/slurmd
Oct 14 14:58:03 rdsxen66 systemd[1]: Starting Slurm node daemon...
Oct 14 14:58:04 rdsxen66 systemd[1]: PID file /var/run/slurmd.pid not readab...t.
Oct 14 14:58:04 rdsxen66 systemd[1]: Started Slurm node daemon.
Hint: Some lines were ellipsized, use -l to show in full.
root@rdsxen66: ~ #
(In reply to Bill Benedetto from comment #34) > We started out with a slurm install-generated slurmd.service file. > We've made some minor changes. For future reference, I suggest using a drop-in config instead of modifying the file directly. This will make upgrades in the future easier but is of course at the discretion of your site. > ExecStart=/usr/local/slurm/sbin/slurmd $SLURMD_OPTIONS Is this slurmd a symlink? (In reply to Nate Rini from comment #33) > Please attach a current slurm.conf and friends. If possible for the prod and test clusters. On Wed, 2021-10-27 at 18:53 +0000, bugs@schedmd.com wrote: External Email....WARNING....Think before you click or respond....WARNING Comment # 35<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c35&data=04%7C01%7Cbbenedetto%40goodyear.com%7Cc8f27514cdcd4d70f6f908d9997b05f3%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637709575960987150%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=lxuQcIea%2BNfPRVbDdYVC8mnSngGUYib3vm24iCfl3RI%3D&reserved=0> onbug 12656<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656&data=04%7C01%7Cbbenedetto%40goodyear.com%7Cc8f27514cdcd4d70f6f908d9997b05f3%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637709575960987150%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=V0mh4Xk3nyyhLV6GU9npFQp1QjRTdvcKwQ2r2KoIOoY%3D&reserved=0> from Nate Rini<mailto:nate@schedmd.com> (In reply to Bill Benedetto from comment #34<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c34&data=04%7C01%7Cbbenedetto%40goodyear.com%7Cc8f27514cdcd4d70f6f908d9997b05f3%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637709575960997105%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=jfOf8hTNDN5m%2Fit5PaMTgCoCPSQMvvrJcC%2FgUMTikng%3D&reserved=0>) > We started out with a slurm install-generated slurmd.service file. > We've made some minor changes. For future reference, I suggest using a drop-in config instead of modifying the file directly. This will make upgrades in the future easier but is of course at the discretion of your site. Like I said, we started with a slurm installation-generated one. We rarely update the slurmd.service files. If it works, we're happy. I CAN see, though, that it would make sense to do what you say as things may change. > ExecStart=/usr/local/slurm/sbin/slurmd $SLURMD_OPTIONS Is this slurmd a symlink? root@rdsxen66<mailto:root@rdsxen66>: ~ # ls -Fl /usr/local/slurm lrwxrwxrwx 1 root root 25 Mar 27 2021 /usr/local/slurm -> /apps/share/slurm/current/ root@rdsxen66<mailto:root@rdsxen66>: ~ # ls -Fl /apps/share/slurm total 168 drwxr-xr-x 9 root root 8192 Mar 4 2020 19.05.5/ drwxr-xr-x 8 root root 152 Apr 1 2020 20.02.1/ drwxr-xr-x 8 root root 8192 Sep 30 2020 20.02.2/ drwxr-xr-x 9 lda6434 C15 8192 Oct 13 2020 20.02.4-1/ drwxr-xr-x 9 lda6434 C15 8192 Mar 18 2021 20.02.6-1/ drwxr-xr-x 9 lda6434 C15 8192 Mar 11 2021 20.11.4-1/ drwxr-xr-x 9 lda6434 C15 8192 Mar 27 2021 20.11.5-1/ drwxr-xr-x 3 lda6434 C15 152 Jul 15 13:32 20.11.5-1-desktop/ drwxr-xr-x 2 lda6434 C15 8192 Jun 2 05:50 20.11.5-systemd/ drwxr-xr-x 9 lda6434 C15 8192 Jun 4 05:40 20.11.7-1/ drwxr-xr-x 9 lda6434 C15 8192 Jul 15 03:51 20.11.8-1/ drwxr-xr-x 3 lda6434 C15 8192 Sep 9 02:05 GY-bin/ drwxr-xr-x 3 root root 8192 Dec 3 2020 GY-bin-RCS/ drwxr-xr-x 2 lda6434 C15 8192 Apr 13 2021 GY-bin-from-Jenkins/ drwxr-xr-x 3 lda6434 root 8192 Dec 1 2020 GY-bin-svn.OLD/ drwxr-xr-x 2 root root 8192 Oct 7 14:45 RCS/ drwxr-xr-x 3 lda6434 root 8192 Aug 26 11:59 conf/ drwxr-xr-x 3 root root 8192 Oct 22 2019 contribs/ lrwxrwxrwx 1 root root 9 Aug 21 12:51 current -> 20.11.8-1/ lrwxrwxrwx 1 lda6434 C15 17 Jul 15 14:42 current-desktop -> 20.11.5-1-desktop/ lrwxrwxrwx 1 root root 9 Mar 27 2021 current.210821 -> 20.11.5-1/ lrwxrwxrwx 1 root root 9 Oct 13 2020 old_current -> 20.02.4-1/ lrwxrwxrwx 1 root root 9 Oct 13 2020 old_prod -> 20.02.4-1/ lrwxrwxrwx 1 root root 7 May 8 2020 prior -> 20.02.2/ lrwxrwxrwx 1 root root 9 Aug 21 12:51 prod -> 20.11.8-1/ lrwxrwxrwx 1 root root 9 Mar 27 2021 prod.210821 -> 20.11.5-1/ -r-xr-xr-x 1 root root 4435 Dec 12 2019 setup_desktop_client* -r-xr-xr-x 1 root root 5054 Oct 7 14:45 setup_desktop_compute* -rwxr-xr-x 1 root root 5434 Mar 25 2021 setup_desktop_compute_test* -r-xr-xr-x 1 root root 6924 Dec 12 2019 setup_head* -rwxr-xr-x 1 root root 7203 Mar 25 2021 setup_head_test* lrwxrwxrwx 1 root root 9 Jul 21 10:11 test -> 20.11.8-1/ lrwxrwxrwx 1 root root 9 Jun 7 12:07 test-prior -> 20.11.7-1/ root@rdsxen66<mailto:root@rdsxen66>: ~ # ________________________________ You are receiving this mail because: * You are on the CC list for the bug. * You are watching the reporter of the bug. (In reply to Bill Benedetto from comment #37) > > ExecStart=/usr/local/slurm/sbin/slurmd $SLURMD_OPTIONS > Is this slurmd a symlink? > > root@rdsxen66<mailto:root@rdsxen66>: ~ # ls -Fl /usr/local/slurm > lrwxrwxrwx 1 root root 25 Mar 27 2021 /usr/local/slurm -> > /apps/share/slurm/current/ Is your site interested in running two sets of slurmd on the test machines? While there is work on the test machine, a simple reservation can be placed on the prod cluster slurm or the nodes can be marked down. This would make capacity increases simple while still allowing your site to test. Since both clusters already share the same IB fabric, and I can only assume NFS mounts, they are already pretty intertwined. Created attachment 21972 [details]
slurm.conf as of 27-Oct-2021
This should be the same as "21770: new slurm.conf".
(Or close to it, TBH)
Created attachment 21973 [details]
slurm.conf for test environment as of 27-Oct-2021
Please also attach:
> /apps/share/slurm/conf/slurm_common.conf
How does your site handle name resolution? /etc/hosts?
(In reply to Nate Rini from comment #41) > Please also attach: > > /apps/share/slurm/conf/slurm_common.conf > > How does your site handle name resolution? /etc/hosts? Primarily DNS. /etc/hosts tends to be pretty small everywhere. Created attachment 21992 [details]
slurm_common.conf
Please also provide:
> gres.conf
> topology.conf
> acctgather.conf
Also is it possible to get some background on this?
> ##########################################⏎
> # THIS CAUSES PACKING TO WORK - BUT WE REALLY WANT cons_tres⏎
> #SelectType=select/cons_res⏎
> ##########################################⏎
> # THIS CAUSES WEIRD SPLITTING PROBLEMS!! #⏎
> SelectType=select/cons_tres # for gpu & abaqus⏎
> ##########################################⏎
(In reply to Nate Rini from comment #44) > Please also provide: > > gres.conf > > topology.conf > > acctgather.conf We don't got no acctgather.conf file.... I'll attach the others directly. Created attachment 21995 [details]
gres.conf
Created attachment 21996 [details]
topology.conf
(In reply to Bill Benedetto from comment #46) > (In reply to Nate Rini from comment #44) > > Please also provide: > > > gres.conf > > > topology.conf > > > acctgather.conf > > We don't got no acctgather.conf file.... Understood > I'll attach the others directly. To make things easier the future, please consider just tarballing up the all of the config files for support tickets. Attaching a single file is usually easier than multiple files individually and we have no problem opening tarballs (or zips). concerning comment #45. When we started using slurm, we had SelecType=select/cons_res. So our nodes have 24 cores. And if we submitted a 32 way job for example, it used 24 on the first node and 8 on a second node. More jobs would fill in the other 16 cores (plus more nodes if needed). We always called this node packing. But our Lux counterpart really wanted cons_tres for GPU tracking etc. But that made packing nodes do weird stuff. A 32 way job would get 16 on the first node and 16 on the second node. Or worse, 16 on one node, 15 on a second one and 1 on a 3rd node. This seemed pretty inefficient and (we believe) would hurt performance to some level. We started adding -N 2 for say a 32 way job to keep it to at least 2 nodes. I filed a ticket for this and the latest slurm 21.08 is supposed to have fixed this so cons_tres packs like cons_res. We haven't tested 21.08 as yet. (In reply to Tom Wurgler from comment #50) > I filed a ticket for this and the latest slurm 21.08 is supposed to have > fixed this so cons_tres packs like cons_res. We haven't tested 21.08 as yet. Great, I'll defer to that ticket then. Generally prefer to not mix issues in a single ticket but I wanted to make sure that wasn't outstanding on our part. Is Slurm installed from source or from RPM? Given its placement on NFS, I assume from source but I want to verify first. (In reply to Nate Rini from comment #52) > Is Slurm installed from source or from RPM? Given its placement on NFS, I > assume from source but I want to verify first. From source. We use Jenkins so that it's built the same way every time, regardless which one of us builds it. Hi Here is a list of some questions we'd like info on during our meeting Thursday. 1) Jobs killed when adding nodes----> this ticket 2) In general, changing slurm.conf procedure 3) Node weighting 4) How to force a pending job to run on specific nodes 5) General critique of our config files and setup Sorry to be so late with this. thanks tom (In reply to Tom Wurgler from comment #54) > 1) Jobs killed when adding nodes----> this ticket Do you have a list of the jobs that failed during the config change? The slurmctld log attached had a good number of authentication errors which means I need to see the slurmd (and maybe slurmstepd) logs at the time of the failures. Is it possible to get them? (In reply to Tom Wurgler from comment #54) > 3) Node weighting Can you please provide more details on what you mean here. Is this just the node weight parameter in slurm.conf? Created attachment 22076 [details]
slurmd.log for one of the failed jobs
one slurmd.log file from a failed job
Created attachment 22077 [details]
second slurmd.log file from different failed job
second slurmd.log
(In reply to Nate Rini from comment #56) > (In reply to Tom Wurgler from comment #54) > > 3) Node weighting > > Can you please provide more details on what you mean here. Is this just the > node weight parameter in slurm.conf? I need to defer to Patrick Hock (admin in Luxembourg). There was some reason we couldn't just weight the nodes. Also, another topic for discussion if there is time is potential purging/backup of the database. (In reply to Tom Wurgler from comment #57) > Created attachment 22076 [details] > slurmd.log for one of the failed jobs > > one slurmd.log file from a failed job Is this one of the failed jobs? > [2021-10-25T20:03:28.697] launch task StepId=1177190.0 request from UID:1084 GID:2910 HOST:163.243.23.94 PORT:46610 > [2021-10-25T20:03:29.865] [1177190.0] task/cgroup: _memcg_initialize: /slurm/uid_1084/job_1177190: alloc=0MB mem.limit=128655MB memsw.limit=unlimited > [2021-10-25T20:03:29.865] [1177190.0] task/cgroup: _memcg_initialize: /slurm/uid_1084/job_1177190/step_0: alloc=0MB mem.limit=128655MB memsw.limit=unlimited > [2021-10-27T15:47:25.429] [1177190.0] error: Failed to send MESSAGE_TASK_EXIT: Connection refused > [2021-10-27T15:47:28.394] [1177190.0] error: Failed to send MESSAGE_TASK_EXIT: Connection refused > [2021-10-27T15:47:28.395] [1177190.0] done with job (In reply to Nate Rini from comment #60) > Is this one of the failed jobs? Please provide this output for at least one of the failed jobs: > sacct -o all -p -D -j $JOBID Created attachment 22084 [details]
sacct info for job 1161668
Looks like the slurmctld log for job 1161668 is incomplete. Please grep that number out of the slurmctld logs and attach the output. Created attachment 22085 [details]
output from grep requested
(In reply to Tom Wurgler from comment #65) > Created attachment 22085 [details] > output from grep requested This log shows that the user (or a Slurm aware job) requested the job killed and not Slurm's internal checks: > slurmctld.log-20211020:[2021-10-15T14:36:03.414] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=1161668 uid 2256 Please provide slurmd logs from the other job nodes: > NodeList=rdsxen[303,326] Not all these jobs exited slurm. They "died" as in went what we call stale. In the partition, tying up resources, but not continuing. I went in and had to remove them and the users had to resubmit their jobs. Created attachment 22086 [details]
rdsxen303 slurmd.log
Created attachment 22087 [details]
rdsxen326 slurmd.log
(In reply to Tom Wurgler from comment #67) > They "died" as in went what we call stale. In the partition, tying up > resources, but not continuing. I went in and had to remove them and the > users had to resubmit their jobs. So they hung? How does your site determine if they died? The Slurm logs presented so far are basically silent on the why beyond the explicit request to kill the jobs. We have a script running every 30 minutes on all running jobs to check the directory for files updated in the last 30 minutes. If no files are found, it sends the admins mail. The code we use updates files regularly. The jobs that were hung were hung overnight and into the next morning with no updates. And these are std jobs we run every day. (In reply to Tom Wurgler from comment #71) > We have a script running every 30 minutes on all running jobs to check the > directory for files updated in the last 30 minutes. If no files are found, > it sends the admins mail. The code we use updates files regularly. > > The jobs that were hung were hung overnight and into the next morning with > no updates. And these are std jobs we run every day. Are any logs available from the jobs? Do these hangs only happen while restarting slurmctld? Based on the existence of the script, I suspect there are more triggers for this. Are there any relevant logs from dmesg in the time window of the hang, particularly an NFS hang? followup from our meeting: * I tested how `sbatch --wait` handled the loss of slurmctld and it just waits for slurmctld to come back. * node weights and topology plugin were fixed by bug#9729 in slurm-21.08 release. * Please submit an RFE ticket explicitly requesting how the node weights and topology are calculated. I couldn't find any existing documentation for this. Are instructions needed in helping to find the logs? Sorry....took some vacation time. I will try to get logs today yet From: bugs@schedmd.com <bugs@schedmd.com> Sent: Wednesday, November 10, 2021 11:54 AM To: Tom Wurgler <twurgl@goodyear.com> Subject: [EXT] [Bug 12656] Adding nodes to partitions External Email....WARNING....Think before you click or respond....WARNING Comment # 74<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c74&data=04%7C01%7Ctwurgl%40goodyear.com%7C90ff7322f7444bb021dd08d9a46ac346%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637721600741113474%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=DYnxDn4HP%2BCT8yCsafSJR5vzJwVN1pqpYHEpL9uIovw%3D&reserved=0> on bug 12656<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656&data=04%7C01%7Ctwurgl%40goodyear.com%7C90ff7322f7444bb021dd08d9a46ac346%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637721600741123424%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=0tlg10CgBZqSg2NmxftMZR62aT8%2BMx4eZCwVoWZS39c%3D&reserved=0> from Nate Rini<mailto:nate@schedmd.com> Are instructions needed in helping to find the logs? ________________________________ You are receiving this mail because: * You reported the bug. * You are watching someone on the CC list of the bug. (In reply to Tom Wurgler from comment #75) > Sorry....took some vacation time. I will try to get logs today yet We are happy to work on your schedule here. We just generally ask that the severity be lowered to the appropriate levels as the situation changes. Created attachment 22249 [details]
list of jobs that actuall failed, not hung jobs and their slurmd.log files
slurmd.log files from first node of mulitnode jobs.
failed_jobs has job numbers etc
(In reply to Tom Wurgler from comment #77) > Created attachment 22249 [details] > list of jobs that actuall failed, not hung jobs and their slurmd.log files > > slurmd.log files from first node of mulitnode jobs. > failed_jobs has job numbers etc Please also grep the slurmctld logs for this too: > grep -E '1164734|1164738|1164753|1164788|1164830|1164926|1164948|1165048|1165063|1165064|1165065' $PATH_TO_LOG (In reply to Nate Rini from comment #79) > Please also grep the slurmctld logs for this too: > > grep -E '1164734|1164738|1164753|1164788|1164830|1164926|1164948|1165048|1165063|1165064|1165065' $PATH_TO_LOG Grepping all of your logs would be preferable. (In reply to Tom Wurgler from comment #77) > Created attachment 22249 [details] > list of jobs that actually failed, not hung jobs and their slurmd.log files Looking at the exact errors: > [1165065.batch] error: Could not open stdout file /hpc/scratch/a026560/DEW/202110140054.DewLT.tool/LT_Inflate_Deflect/eagle/torsional_2.000/LT285REFCONST_AKR_A1530149_CrdPrd_1_00_000_3d_full_tor_2.000.lsfout: No such file or directoryO setup failed: No such file or directory The error for all of these jobs is the output directory doesn't exist which would cause any job to fail. Are there any jobs that did have this same error? If this is just a user error, please ignore comment#80 and comment#79. It is not a user error. The file should have been created during the run, and with the run being interrupted it failed. From: bugs@schedmd.com <bugs@schedmd.com> Sent: Monday, November 15, 2021 9:40 PM To: Tom Wurgler <twurgl@goodyear.com> Subject: [EXT] [Bug 12656] Adding nodes to partitions External Email....WARNING....Think before you click or respond....WARNING Comment # 81<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c81&data=04%7C01%7Ctwurgl%40goodyear.com%7Ca695306ea68e432fe7e508d9a8aa7071%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726272287449086%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=2oCSk1LBamo4pgJ2wQVP%2F1mCpCZ3xUBW%2FJSYfEGxFHU%3D&reserved=0> on bug 12656<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656&data=04%7C01%7Ctwurgl%40goodyear.com%7Ca695306ea68e432fe7e508d9a8aa7071%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726272287459038%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=xlTZ4CjSz4KfzlfCvUP3qTFW7qgW3fV6wiOVYw7qn3k%3D&reserved=0> from Nate Rini<mailto:nate@schedmd.com> (In reply to Tom Wurgler from comment #77<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c77&data=04%7C01%7Ctwurgl%40goodyear.com%7Ca695306ea68e432fe7e508d9a8aa7071%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726272287459038%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=etX5mMl2Ns%2BuPXHfzdN%2Fujrp8JV0hR66mJH2a%2FLOmww%3D&reserved=0>) > Created attachment 22249 [details]<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fattachment.cgi%3Fid%3D22249&data=04%7C01%7Ctwurgl%40goodyear.com%7Ca695306ea68e432fe7e508d9a8aa7071%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726272287468997%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=7j9wdnU1SrrQUpFu9ijIvxz%2FJG%2FMKHDIry9YK5Q7TsA%3D&reserved=0> [details]<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fattachment.cgi%3Fid%3D22249%26action%3Dedit&data=04%7C01%7Ctwurgl%40goodyear.com%7Ca695306ea68e432fe7e508d9a8aa7071%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726272287468997%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=e1f7NayA%2BNu%2FhSrELoV0dsD4PnnKynfiAK5oW8CXlSQ%3D&reserved=0> > list of jobs that actually failed, not hung jobs and their slurmd.log files Looking at the exact errors: > [1165065.batch] error: Could not open stdout file /hpc/scratch/a026560/DEW/202110140054.DewLT.tool/LT_Inflate_Deflect/eagle/torsional_2.000/LT285REFCONST_AKR_A1530149_CrdPrd_1_00_000_3d_full_tor_2.000.lsfout: No such file or directoryO setup failed: No such file or directory The error for all of these jobs is the output directory doesn't exist which would cause any job to fail. Are there any jobs that did have this same error? If this is just a user error, please ignore comment#80<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c80&data=04%7C01%7Ctwurgl%40goodyear.com%7Ca695306ea68e432fe7e508d9a8aa7071%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726272287478950%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=iI1cCD8Kucey9cSMB%2F0Dq37v8U8D1CWyVy0Z2nQR9gM%3D&reserved=0> and comment#79<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c79&data=04%7C01%7Ctwurgl%40goodyear.com%7Ca695306ea68e432fe7e508d9a8aa7071%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726272287478950%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=4Cw3COaWgQHSnisnTX75XHaJSEiU9g9WXBHcguWerNg%3D&reserved=0>. ________________________________ You are receiving this mail because: * You reported the bug. * You are watching someone on the CC list of the bug. (In reply to Tom Wurgler from comment #82) > It is not a user error. The file should have been created during the run, > and with the run being interrupted it failed. Where in the job should it have been created? Slurm will not create a missing directory, only a file. It is created during the fea job. From: bugs@schedmd.com <bugs@schedmd.com> Sent: Tuesday, November 16, 2021 11:07 AM To: Tom Wurgler <twurgl@goodyear.com> Subject: [EXT] [Bug 12656] Adding nodes to partitions External Email....WARNING....Think before you click or respond....WARNING Comment # 83<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c83&data=04%7C01%7Ctwurgl%40goodyear.com%7Cb0cedae5c12841584eda08d9a91b205d%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726756234253654%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=5e5349u%2F%2BVpl6RNe59xwhtV6CNn7vEEhoPeOQUWOY%2BY%3D&reserved=0> on bug 12656<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656&data=04%7C01%7Ctwurgl%40goodyear.com%7Cb0cedae5c12841584eda08d9a91b205d%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726756234263613%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=j3%2BIvQFPRMvLSnI3P%2BhS6ZRC9InxKoFxBCibmxgG7Vc%3D&reserved=0> from Nate Rini<mailto:nate@schedmd.com> (In reply to Tom Wurgler from comment #82<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c82&data=04%7C01%7Ctwurgl%40goodyear.com%7Cb0cedae5c12841584eda08d9a91b205d%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726756234263613%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=IPjEb1nAvv8uPdWQQn6mrISQsPn%2B3lWvxVw3Cn9mDv8%3D&reserved=0>) > It is not a user error. The file should have been created during the run, > and with the run being interrupted it failed. Where in the job should it have been created? Slurm will not create a missing directory, only a file. ________________________________ You are receiving this mail because: * You reported the bug. * You are watching someone on the CC list of the bug. (In reply to Tom Wurgler from comment #84) > It is created during the fea job. I'm not aware of what a 'fea' job is. Is this a job that runs before the current job or in a step that runs before? Possibly in a prolog or jobsubmit script? The user submit a parallel FEA (finite element analysis) job to Slurm. I don't know when/how that file is created. From: bugs@schedmd.com <bugs@schedmd.com> Sent: Tuesday, November 16, 2021 1:04 PM To: Tom Wurgler <twurgl@goodyear.com> Subject: [EXT] [Bug 12656] Adding nodes to partitions External Email....WARNING....Think before you click or respond....WARNING Comment # 85<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c85&data=04%7C01%7Ctwurgl%40goodyear.com%7C94a57a8915e04bb61a5f08d9a92b7b99%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726826484221725%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=zE1ftUG8bXzBkJChksDltIw%2FSoDBZWJ2gaAPkwD8vQI%3D&reserved=0> on bug 12656<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656&data=04%7C01%7Ctwurgl%40goodyear.com%7C94a57a8915e04bb61a5f08d9a92b7b99%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726826484221725%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=k07PLrr35wlEe1hrbfpB3Xx9YeXxHeQ%2F4vYMoPNuOlU%3D&reserved=0> from Nate Rini<mailto:nate@schedmd.com> (In reply to Tom Wurgler from comment #84<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c84&data=04%7C01%7Ctwurgl%40goodyear.com%7C94a57a8915e04bb61a5f08d9a92b7b99%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726826484231679%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=okDExVaZDwrhej7czKBM0Kb5Q67X8DqtvJxECbWqKGw%3D&reserved=0>) > It is created during the fea job. I'm not aware of what a 'fea' job is. Is this a job that runs before the current job or in a step that runs before? Possibly in a prolog or jobsubmit script? ________________________________ You are receiving this mail because: * You reported the bug. * You are watching someone on the CC list of the bug. (In reply to Tom Wurgler from comment #86) > The user submit a parallel FEA (finite element analysis) job to Slurm. I > don't know when/how that file is created. On their batch script possibly on the first line, can they call 'mkdir -p /hpc/scratch/a026560/DEW/202110140054.DewLT.tool/LT_Inflate_Deflect/eagle/torsional_2.000/' and try again? Slurm will not create a directory (or directory tree) for stdout/stderr/stdin. I didn't ask this user, but I'd bet they have since reran the job successfully. The code they run does all the correct stuff normally. But the job got interrupted. From: bugs@schedmd.com <bugs@schedmd.com> Sent: Tuesday, November 16, 2021 1:18 PM To: Tom Wurgler <twurgl@goodyear.com> Subject: [EXT] [Bug 12656] Adding nodes to partitions External Email....WARNING....Think before you click or respond....WARNING Comment # 87<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c87&data=04%7C01%7Ctwurgl%40goodyear.com%7C7f3b9b5f968a460aec4208d9a92d6eaa%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726834862460250%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2B7bX6IMzpAkpnH0zlagGNQa9yGWFhFPdAvzPqdTv%2Ffc%3D&reserved=0> on bug 12656<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656&data=04%7C01%7Ctwurgl%40goodyear.com%7C7f3b9b5f968a460aec4208d9a92d6eaa%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726834862470203%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Celi7rwaijo3rfQtJqOWKQ3CC%2BTPQ2mG6wa8Fke3gq0%3D&reserved=0> from Nate Rini<mailto:nate@schedmd.com> (In reply to Tom Wurgler from comment #86<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c86&data=04%7C01%7Ctwurgl%40goodyear.com%7C7f3b9b5f968a460aec4208d9a92d6eaa%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726834862470203%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=BG985zE%2FDfD%2FoN0VtF3OAoJ3O5MMdXXaPaKgFbUk%2FyE%3D&reserved=0>) > The user submit a parallel FEA (finite element analysis) job to Slurm. I > don't know when/how that file is created. On their batch script possibly on the first line, can they call 'mkdir -p /hpc/scratch/a026560/DEW/202110140054.DewLT.tool/LT_Inflate_Deflect/eagle/torsional_2.000/' and try again? Slurm will not create a directory (or directory tree) for stdout/stderr/stdin. ________________________________ You are receiving this mail because: * You reported the bug. * You are watching someone on the CC list of the bug. (In reply to Tom Wurgler from comment #88) > I didn't ask this user, but I'd bet they have since reran the job > successfully. > The code they run does all the correct stuff normally. But the job got > interrupted. I'm afraid we do not have enough information to debug the issue. Is it possible to try a HA failover again with all the logging activated for slurmctld and slurmd of these test jobs? I am current re-imaging some nodes in the prod slurm. I drained them, then downed them. Now I am doing the imaging. I believe that was all I had to do to this point. Now when it is time to resume them, I want to resume 2 chassis worth of nodes as is (they remain in prod Slurm, in the same partitions etc). But that first chassis I want to remove from Prod Slurm and re-add it back into our test Slurm. How to I safely remove them from prod Slurm? You spoke of "future" state. Do I put that in the prod slurm.conf? Should I just remove those nodes from slurm.conf? Then we were planning on trying to recreate my issues with adding and removing nodes in the test environment with some test jobs. Thanks From: bugs@schedmd.com <bugs@schedmd.com> Sent: Tuesday, November 16, 2021 1:32 PM To: Tom Wurgler <twurgl@goodyear.com> Subject: [EXT] [Bug 12656] Adding nodes to partitions External Email....WARNING....Think before you click or respond....WARNING Comment # 89<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c89&data=04%7C01%7Ctwurgl%40goodyear.com%7C936be2b44e3b45d649de08d9a92f6a68%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726843383709885%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=c%2BWtmSvjOEgFTjCscXIpomwSztDf4FT%2Fmg6SUST%2FmpE%3D&reserved=0> on bug 12656<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656&data=04%7C01%7Ctwurgl%40goodyear.com%7C936be2b44e3b45d649de08d9a92f6a68%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726843383719845%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=4mNdS35NUtsuhNyamVkxzpe5DOuZ6lkHB3DlE4U1fXo%3D&reserved=0> from Nate Rini<mailto:nate@schedmd.com> (In reply to Tom Wurgler from comment #88<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c88&data=04%7C01%7Ctwurgl%40goodyear.com%7C936be2b44e3b45d649de08d9a92f6a68%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726843383719845%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=3CwShjdAkrNAU7FTnxh1O8fZzZua1wAHRoHBCX8Gnhc%3D&reserved=0>) > I didn't ask this user, but I'd bet they have since reran the job > successfully. > The code they run does all the correct stuff normally. But the job got > interrupted. I'm afraid we do not have enough information to debug the issue. Is it possible to try a HA failover again with all the logging activated for slurmctld and slurmd of these test jobs? ________________________________ You are receiving this mail because: * You reported the bug. * You are watching someone on the CC list of the bug. (In reply to Tom Wurgler from comment #90) > How to I safely remove them from prod Slurm? You spoke of "future" state. > Do I put that in the prod slurm.conf? Should I just remove those nodes from > slurm.conf? Leave nodes configured in the slurm.conf for both clusters. Set the `state=future` to tell Slurm that these nodes will be added at some point in the future. You can just update slurm.conf and reconfigure to do this or use scontrol to apply the state manually and update slurm.conf for the next cycle. For the nodes themselves, just make sure they have the correct slurm.conf for the cluster you want them to run under currently. > Then we were planning on trying to recreate my issues with adding and > removing nodes in the test environment with some test jobs. Please activate at least debug3 on slurmctld and slurmd for the test. Hi We did this in the slurm.conf: NodeName=DEFAULT RealMemory=128000 CPUs=24 Sockets=2 CoresPerSocket=12 ThreadsPerCore=1 State=UNKNOWN NodeName=rdsxen[1,3-16] NodeAddr=rdsxen[1,3-16] Feature=ib,ib_9984,openmpi,xenon,intel GRES=fv:1 State=FUTURE And did the scontrol reconfigure. No errors, but the nodes still show up as "down" in the scontrol show node=rdsxen1 Now if we do scontrol update node=rdsxen1 state=future on the commandline, then did an scontrol show node=rdsxen1, it says "Node rdsxen1 not found". So how to do this? Thanks From: bugs@schedmd.com <bugs@schedmd.com> Sent: Tuesday, November 16, 2021 1:53 PM To: Tom Wurgler <twurgl@goodyear.com> Subject: [EXT] [Bug 12656] Adding nodes to partitions External Email....WARNING....Think before you click or respond....WARNING Comment # 91<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c91&data=04%7C01%7Ctwurgl%40goodyear.com%7C838ec5f2ec9b40eaace408d9a93260f0%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726856107018356%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=uw3HEogB8e%2BDSTqy1%2BBaHzBZ69%2BhBGM1ozo%2BqLHaS0k%3D&reserved=0> on bug 12656<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656&data=04%7C01%7Ctwurgl%40goodyear.com%7C838ec5f2ec9b40eaace408d9a93260f0%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726856107028307%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=vtE%2FJZdJN3jRBOqw3uCaQxoSnL756wPwqjv0muJsoFA%3D&reserved=0> from Nate Rini<mailto:nate@schedmd.com> (In reply to Tom Wurgler from comment #90<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c90&data=04%7C01%7Ctwurgl%40goodyear.com%7C838ec5f2ec9b40eaace408d9a93260f0%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637726856107038261%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=tUxtCHpM6P5BriSkC7u65XBkIETaDN%2BWcD3PzPdQgeA%3D&reserved=0>) > How to I safely remove them from prod Slurm? You spoke of "future" state. > Do I put that in the prod slurm.conf? Should I just remove those nodes from > slurm.conf? Leave nodes configured in the slurm.conf for both clusters. Set the `state=future` to tell Slurm that these nodes will be added at some point in the future. You can just update slurm.conf and reconfigure to do this or use scontrol to apply the state manually and update slurm.conf for the next cycle. For the nodes themselves, just make sure they have the correct slurm.conf for the cluster you want them to run under currently. > Then we were planning on trying to recreate my issues with adding and > removing nodes in the test environment with some test jobs. Please activate at least debug3 on slurmctld and slurmd for the test. ________________________________ You are receiving this mail because: * You reported the bug. * You are watching someone on the CC list of the bug. (In reply to Tom Wurgler from comment #92) > NodeName=DEFAULT RealMemory=128000 CPUs=24 Sockets=2 CoresPerSocket=12 > ThreadsPerCore=1 State=UNKNOWN > NodeName=rdsxen[1,3-16] NodeAddr=rdsxen[1,3-16] > Feature=ib,ib_9984,openmpi,xenon,intel GRES=fv:1 State=FUTURE > > And did the scontrol reconfigure. No errors, but the nodes still show up as > "down" in the scontrol show node=rdsxen1 "down" will stop any jobs from falling on the nodes. Can we get the output of 'scontrol show node $NODE' after doing the reconfigure? > Now if we do scontrol update node=rdsxen1 state=future on the commandline, > then did an scontrol show node=rdsxen1, it says "Node rdsxen1 not found". This is expected per the slurm.conf man page: > Until these nodes are made available, they will not be seen using any Slurm commands or nor will any attempt be made to contact them. Adding the State=FUTURE was the only change we made. The "future" nodes are still in the various partitions etc. We did the scontrol reconfigure t901353@rds4020:gica > scontrol show node rdsxen1 NodeName=rdsxen1 Arch=x86_64 CoresPerSocket=12 CPUAlloc=0 CPUTot=24 CPULoad=0.01 AvailableFeatures=ib,ib_9984,openmpi,xenon,intel ActiveFeatures=ib,ib_9984,openmpi,xenon,intel Gres=fv:1 NodeAddr=rdsxen1 NodeHostName=rdsxen1 Version=20.11.8 OS=Linux 3.10.0-1160.15.2.el7.x86_64 #1 SMP Thu Jan 21 16:15:07 EST 2021 RealMemory=128000 AllocMem=0 FreeMem=123820 Sockets=2 Boards=1 State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=linlarge,medium,support BootTime=2021-11-16T14:00:06 SlurmdStartTime=2021-11-16T14:01:20 CfgTRES=cpu=24,mem=125G,billing=24 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=testme [root@2021-11-18T10:39:55] Comment=(null) You have new mail in /var/spool/mail/t901353 From: bugs@schedmd.com <bugs@schedmd.com> Sent: Thursday, November 18, 2021 11:04 AM To: Tom Wurgler <twurgl@goodyear.com> Subject: [EXT] [Bug 12656] Adding nodes to partitions External Email....WARNING....Think before you click or respond....WARNING Comment # 93<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c93&data=04%7C01%7Ctwurgl%40goodyear.com%7Cc991352c73904d05fe5708d9aaad0439%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637728482358217900%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=PFxwKfcoHZabC%2B1rz5DyU0vFQa6gxu9ENbNkEWzFoWo%3D&reserved=0> on bug 12656<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656&data=04%7C01%7Ctwurgl%40goodyear.com%7Cc991352c73904d05fe5708d9aaad0439%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637728482358217900%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=mhaSX3h3HwbWgT7JXEDzz%2BFUokfZV8%2BPsCgsvehYNec%3D&reserved=0> from Nate Rini<mailto:nate@schedmd.com> (In reply to Tom Wurgler from comment #92<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c92&data=04%7C01%7Ctwurgl%40goodyear.com%7Cc991352c73904d05fe5708d9aaad0439%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637728482358227855%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=4fTevS6xtY8T1JxsXABVpQD645V%2Fy5ORmPQYYpNRPkI%3D&reserved=0>) > NodeName=DEFAULT RealMemory=128000 CPUs=24 Sockets=2 CoresPerSocket=12 > ThreadsPerCore=1 State=UNKNOWN > NodeName=rdsxen[1,3-16] NodeAddr=rdsxen[1,3-16] > Feature=ib,ib_9984,openmpi,xenon,intel GRES=fv:1 State=FUTURE > > And did the scontrol reconfigure. No errors, but the nodes still show up as > "down" in the scontrol show node=rdsxen1 "down" will stop any jobs from falling on the nodes. Can we get the output of 'scontrol show node $NODE' after doing the reconfigure? > Now if we do scontrol update node=rdsxen1 state=future on the commandline, > then did an scontrol show node=rdsxen1, it says "Node rdsxen1 not found". This is expected per the slurm.conf man page: > Until these nodes are made available, they will not be seen using any Slurm commands or nor will any attempt be made to contact them. ________________________________ You are receiving this mail because: * You reported the bug. * You are watching someone on the CC list of the bug. (In reply to Tom Wurgler from comment #94) > Adding the State=FUTURE was the only change we made. The "future" nodes are > still in the various partitions etc. > We did the > scontrol reconfigure I had to check this in the code but looks like a reconfigure will not update the states. However, a restart would. Using scontrol to manually update the states or just restarting the slurmctld daemon will be required. Please choose which you prefer. Well, I will suggest this is a bug. The point of attempting nodes being future is that it won't mess up when adding nodes and doing a restart etc. So I really would wish that reconfigure did the deed. So now we have that chassis in future state via the command line and in the slurm.conf as well. At some point we'll do a restart and it will take effect long term. Now every time we do a reconfigure, the nodes come back but are still in the drain state. So at least jobs won't start on those nodes. But they will be running the test env slurmd anyway. From: bugs@schedmd.com <bugs@schedmd.com> Sent: Thursday, November 18, 2021 11:43 AM To: Tom Wurgler <twurgl@goodyear.com> Subject: [EXT] [Bug 12656] Adding nodes to partitions External Email....WARNING....Think before you click or respond....WARNING Comment # 95<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c95&data=04%7C01%7Ctwurgl%40goodyear.com%7C780f4f9f22dd472887eb08d9aab2822e%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637728505929015179%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=iVxmnUD8gP18sGVuAz5CdC0TQ6TnG7CshxfnXU9fWR4%3D&reserved=0> on bug 12656<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656&data=04%7C01%7Ctwurgl%40goodyear.com%7C780f4f9f22dd472887eb08d9aab2822e%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637728505929025132%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=VbYg5%2FFo9uf2BQbLnJTIedyTxOnaq8R%2Fppe18DzAodI%3D&reserved=0> from Nate Rini<mailto:nate@schedmd.com> (In reply to Tom Wurgler from comment #94<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c94&data=04%7C01%7Ctwurgl%40goodyear.com%7C780f4f9f22dd472887eb08d9aab2822e%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637728505929035093%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=k9U5yu8sU5KelKMtPc36EitmFkTeJAHsIg9V5nJ%2FC9g%3D&reserved=0>) > Adding the State=FUTURE was the only change we made. The "future" nodes are > still in the various partitions etc. > We did the > scontrol reconfigure I had to check this in the code but looks like a reconfigure will not update the states. However, a restart would. Using scontrol to manually update the states or just restarting the slurmctld daemon will be required. Please choose which you prefer. ________________________________ You are receiving this mail because: * You reported the bug. * You are watching someone on the CC list of the bug. (In reply to Tom Wurgler from comment #96) > Well, I will suggest this is a bug. The point of attempting nodes being > future is that it won't mess up when adding nodes and doing a restart etc. > So I really would wish that reconfigure did the deed. > So now we have that chassis in future state via the command line and in the > slurm.conf as well. At some point we'll do a restart and it will take > effect long term. Now every time we do a reconfigure, the nodes come back > but are still in the drain state. So at least jobs won't start on those > nodes. But they will be running the test env slurmd anyway. Please provide a status update after the test. Created attachment 22323 [details]
duplicated the problem with test slurm environment
We have duplicated the problem in our test env.
Please find the slurmctld.log and the slurmd.log along with a 00README file in the attachment.
Thanks
tom
(In reply to Tom Wurgler from comment #98) > Created attachment 22323 [details] > [2021-11-18T12:26:08.768] error: WARNING: switches lack access to 2 nodes: alnxrsch1,rdsxenhn Please make sure to fix your topology.conf Which host is 10.103.142.199? That is alnx165, the test slurm master node running the db and slurmctld Need to leave for the day, Bill will work with you tomorrow. Thanks From: bugs@schedmd.com <bugs@schedmd.com> Sent: Thursday, November 18, 2021 4:53 PM To: Tom Wurgler <twurgl@goodyear.com> Subject: [EXT] [Bug 12656] Adding nodes to partitions External Email....WARNING....Think before you click or respond....WARNING Comment # 100<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656%23c100&data=04%7C01%7Ctwurgl%40goodyear.com%7Ca01eccc20bb54fd39f0f08d9aaddc966%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637728691818932513%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=0ILtQPa17%2BpxvzlXSFypxMJuQre%2Bb2erxeZ%2BolnDMCo%3D&reserved=0> on bug 12656<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12656&data=04%7C01%7Ctwurgl%40goodyear.com%7Ca01eccc20bb54fd39f0f08d9aaddc966%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637728691818942475%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=GVYd9FU14PhhGE6QQsIuyQkKT3624JITC%2BeaE22W7YY%3D&reserved=0> from Nate Rini<mailto:nate@schedmd.com> Which host is 10.103.142.199? ________________________________ You are receiving this mail because: * You reported the bug. * You are watching someone on the CC list of the bug. (In reply to Nate Rini from comment #99) > (In reply to Tom Wurgler from comment #98) > > Created attachment 22323 [details] > > [2021-11-18T12:26:08.768] error: WARNING: switches lack access to 2 nodes: alnxrsch1,rdsxenhn > > Please make sure to fix your topology.conf Please update and attach new logs once topology.conf is resynced with the hardware. (In reply to Nate Rini from comment #103) > (In reply to Nate Rini from comment #99) > > (In reply to Tom Wurgler from comment #98) > > > Created attachment 22323 [details] > > > [2021-11-18T12:26:08.768] error: WARNING: switches lack access to 2 nodes: alnxrsch1,rdsxenhn > > > > Please make sure to fix your topology.conf > > Please update and attach new logs once topology.conf is resynced with the > hardware. I can updated the topology.conf file and make those warnings go away. But I don't see the point. Is the topology.conf file so key to the operation of slurm that having it not cover all hosts can cause this type of catastrophic failure?? If so, then you should update the documentation to say that. And change it from saying WARNING to ERROR ERROR ERROR. --- The last set of attachments that Tom sent are from our test environment, not production. I just now went through those logs and removed everything before the time of the test so that only the log messages from the test itself are shown. I'll attach those directly. Regardless, this shows that we were able to duplicate the error in our test environment with only a handful of systems. - Bill Created attachment 22389 [details]
Cleaned up logs from test systems, showing that we can recreate the issue.
(In reply to Bill Benedetto from comment #104) > I can updated the topology.conf file and make those warnings go away. But I > don't see the point. In general, we prefer to isolate issues which means not having any warnings/errors in the logs that need not be there. > Is the topology.conf file so key to the operation of slurm that having it > not cover all hosts can cause this type of catastrophic failure?? When 'RoutePlugin=route/topology' is not configured and neither is TreeWidth, Slurm will default TreeWidth to 50. If a job has more than 50 nodes, then Slurm will randomly choose one slurmd (which may or may not be in the job) to act as part of the message tree. In this case, it will not use nodes that lack switch access. Not a critical issue but rather have it not be there to avoid any surprises. > error: Node alnx101 appears to have a different slurm.conf than the slurmctld. However, this error may cause jobs on the listed nodes to be misplaced. slurmctld will still place jobs in this case (this changes in 21.08 release) and may result in the job getting more resources allocated than available. This is unrelated to the current issue though. > Regardless, this shows that we were able to duplicate the error in our test > environment with only a handful of systems. > > [316.2] error: *** STEP 316.2 ON rdsxen16 CANCELLED AT 2021-11-18T16:23:32 *** The job was canceled by a user. There are no relevant errors in the slurmd log such as a NODE_FAIL event. We will need a higher level of debug logging from the job and slurmd. Please add this argument to the srun call in the job: > --slurmd-debug=debug3 or set this in slurm.conf on the test node and restart slurmd to activate the log change: > SlurmdDebug=debug3 Please add this argument to the srun call in either case: > -vvvvvvv Please attach the slurmd log and the srun log from the job when the issue replicates. Any updates? There haven't been any updates in a month. I'm going to time this ticket out. Please reply and we can continue debugging. Thanks, --Nate |