Ticket 16577

Summary: Health Check failure not marking node offline.
Product: Slurm Reporter: Brad Viviano <viviano.brad>
Component: RegressionAssignee: Director of Support <support>
Status: RESOLVED INVALID QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 23.02.1   
Hardware: Linux   
OS: Linux   
Site: EPA Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Brad Viviano 2023-04-24 07:17:19 MDT
Hello,
   We're in the process of upgrading Slurm from 21.08 to 23.01 and while doing testing, we're noting that our HealthCheckProgram isn't marking a node offline when there is an error, like it does with 21.08:

[root@a0n14 ~]# scontrol --version
slurm 23.02.1

[root@a0n14 ~]# grep Health /var/spool/slurm/conf-cache/slurm.conf
HealthCheckProgram=/usr/sbin/nhc
HealthCheckInterval=30
HealthCheckNodeState=ANY

[root@a0n14 ~]# /usr/sbin/nhc
ERROR:  nhc:  Health check failed:  check_fs_mount:  /work not mounted

[root@a0n14 ~]# tail /var/log/slurm/slurmd 
[2023-04-24T09:04:38.878] error: health_check failed: rc:1 output:ERROR:  nhc:  Health check failed:  check_fs_mount:  /work not mounted

[2023-04-24T09:05:08.898] error: health_check failed: rc:1 output:ERROR:  nhc:  Health check failed:  check_fs_mount:  /work not mounted

[2023-04-24T09:05:38.920] error: health_check failed: rc:1 output:ERROR:  nhc:  Health check failed:  check_fs_mount:  /work not mounted

[2023-04-24T09:06:08.939] error: health_check failed: rc:1 output:ERROR:  nhc:  Health check failed:  check_fs_mount:  /work not mounted

[2023-04-24T09:06:12.769] error: Unable to register: Unable to contact slurm controller (connect failure)

[root@a0n14 ~]# scontrol show node a0n14
NodeName=a0n14 CoresPerSocket=16 
   CPUAlloc=0 CPUEfctv=32 CPUTot=32 CPULoad=0.08
   AvailableFeatures=debug,broadwell
   ActiveFeatures=debug,broadwell
   Gres=(null)
   NodeAddr=a0n14 NodeHostName=a0n14 
   RealMemory=257000 AllocMem=0 FreeMem=252343 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=200 Owner=N/A MCS_label=N/A
   Partitions=debug 
   BootTime=None SlurmdStartTime=None
   LastBusyTime=2023-04-24T08:33:04 ResumeAfterTime=None
   CfgTRES=cpu=32,mem=257000M,billing=32
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


Under 21.02 we don't have this problem:

[root@a0n12 ~]# scontrol --version
slurm 21.08.7

[root@a0n12 ~]# grep Health /etc/slurm/slurm.conf
HealthCheckProgram=/usr/sbin/nhc
HealthCheckInterval=30
HealthCheckNodeState=ANY

[root@a0n12 ~]# tail /var/log/slurmd 
[2023-04-24T09:04:02.595] error: /usr/sbin/nhc: exited with status 0x0100
[2023-04-24T09:10:33.145] error: /usr/sbin/nhc: exited with status 0x0100

[root@a0n12 ~]# /usr/sbin/nhc
ERROR:  nhc:  Health check failed:  check_fs_mount:  /work not mounted

[root@a0n12 ~]# scontrol show node a0n12
NodeName=a0n12 Arch=x86_64 CoresPerSocket=16 
   CPUAlloc=0 CPUTot=32 CPULoad=0.01
   AvailableFeatures=debug,broadwell
   ActiveFeatures=debug,broadwell
   Gres=(null)
   NodeAddr=a0n12 NodeHostName=a0n12 Version=21.08.7
   OS=Linux 3.10.0-1160.76.1.el7.x86_64 #1 SMP Tue Jul 26 14:15:37 UTC 2022 
   RealMemory=257000 AllocMem=0 FreeMem=247243 Sockets=2 Boards=1
   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=200 Owner=N/A MCS_label=N/A
   Partitions=admin,scavenger,debug 
   BootTime=2023-02-11T09:10:24 SlurmdStartTime=2023-02-11T09:13:29
   LastBusyTime=2023-04-24T09:04:33
   CfgTRES=cpu=32,mem=257000M,billing=32
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=NHC: check_fs_mount:  /work not mounted [root@2023-04-24T09:10:32]



We are switching from a config based to configless based setup with 23.01 but otherwise the slurm.conf is basically the same.  Please advise is there is any known issue with the Health Program check and 23.01.
Comment 1 Brad Viviano 2023-04-24 07:38:46 MDT
Never mind, I think I have a bug in my script I need to figure out.