Ticket 16331 - "slurmstepd: error: _get_joules_task: can't get info from slurmd" on Some nodes after upgrade to 23.02
Summary: "slurmstepd: error: _get_joules_task: can't get info from slurmd" on Some nod...
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmstepd (show other tickets)
Version: 23.02.0
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Felip Moll
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2023-03-21 05:10 MDT by Ahmed Elmazaty
Modified: 2023-03-29 12:07 MDT (History)
1 user (show)

See Also:
Site: EBI
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 23.02.2
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
hl-codon-04-02 slurmd.log (316.32 KB, text/plain)
2023-03-21 08:10 MDT, Ahmed Elmazaty
Details
hl-codon-37-02 slurmd.log (458.82 KB, text/plain)
2023-03-22 08:50 MDT, Ahmed Elmazaty
Details
bug16331_2302_test_v1.patch (1.10 KB, patch)
2023-03-23 07:27 MDT, Felip Moll
Details | Diff
hl-codon-37-02 slurmd.log after applying patch (47.83 KB, text/plain)
2023-03-23 09:04 MDT, Ahmed Elmazaty
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Ahmed Elmazaty 2023-03-21 05:10:18 MDT
Hello,

We have two types of nodes in our cluster: Dell and Gigabyte
We use "AcctGatherEnergyType=acct_gather_energy/ipmi" in our slurm.conf.

Everything was working fine in 22.05.6.

When we upgraded to 23.02.0. Dell nodes are still working fine. However Gigabyte nodes report lots of errors with each submitted job.

slurmstepd: error: _get_joules_task: can't get info from slurmd
slurmstepd: error: slurm_get_node_energy: Zero Bytes were transmitted or received


On a Gigabyte node:
[root@codon-dm-08 ~]# ipmitool mc info
Device ID                 : 32
Device Revision           : 1
Firmware Revision         : 12.49
IPMI Version              : 2.0
Manufacturer ID           : 15370
Manufacturer Name         : Unknown (0x3C0A)
Product ID                : 1882 (0x075a)
Product Name              : Unknown (0x75A)
Device Available          : yes
Provides Device SDRs      : yes
Additional Device Support :
    Sensor Device
    SDR Repository Device
    SEL Device
    FRU Inventory Device
    IPMB Event Receiver
    IPMB Event Generator
    Chassis Device
Aux Firmware Rev Info     :
    0x06
    0x00
    0x00
    0x00

Can you please help with this?

Thanks
Ahmed
Comment 1 Felip Moll 2023-03-21 07:32:20 MDT
(In reply to Ahmed Elmazaty from comment #0)
> Hello,
> 
> We have two types of nodes in our cluster: Dell and Gigabyte
> We use "AcctGatherEnergyType=acct_gather_energy/ipmi" in our slurm.conf.
> 
> Everything was working fine in 22.05.6.
> 
> When we upgraded to 23.02.0. Dell nodes are still working fine. However
> Gigabyte nodes report lots of errors with each submitted job.
> 
> slurmstepd: error: _get_joules_task: can't get info from slurmd
> slurmstepd: error: slurm_get_node_energy: Zero Bytes were transmitted or
> received
> 
> 
> On a Gigabyte node:
> [root@codon-dm-08 ~]# ipmitool mc info
> Device ID                 : 32
> Device Revision           : 1
> Firmware Revision         : 12.49
> IPMI Version              : 2.0
> Manufacturer ID           : 15370
> Manufacturer Name         : Unknown (0x3C0A)
> Product ID                : 1882 (0x075a)
> Product Name              : Unknown (0x75A)
> Device Available          : yes
> Provides Device SDRs      : yes
> Additional Device Support :
>     Sensor Device
>     SDR Repository Device
>     SEL Device
>     FRU Inventory Device
>     IPMB Event Receiver
>     IPMB Event Generator
>     Chassis Device
> Aux Firmware Rev Info     :
>     0x06
>     0x00
>     0x00
>     0x00
> 
> Can you please help with this?
> 
> Thanks
> Ahmed

Hi Ahmed,

Can I see a slurmd log on these nodes?
Can I see how do you start slurmd too?


Thanks
Comment 2 Ahmed Elmazaty 2023-03-21 08:10:33 MDT
Created attachment 29436 [details]
hl-codon-04-02 slurmd.log
Comment 3 Ahmed Elmazaty 2023-03-21 08:13:22 MDT
Hi Felip,

I've attached slurmd.log for one of the affected nodes.
I've commented out "AcctGatherEnergyType=acct_gather_energy/ipmi" for now to avoid getting more errors.


Here is our systemd unit file for starting slurmd on compute nodes:

[root@hl-codon-04-02 ~]# cat /usr/lib/systemd/system/slurmd.service
[Unit]
Description=Slurm node daemon
After=munge.service network-online.target remote-fs.target
Wants=network-online.target


[Service]
Type=simple
EnvironmentFile=-/etc/sysconfig/slurmd
EnvironmentFile=-/etc/default/slurmd
ExecStart=/ebi/slurm/codon/install/slurm-23.02.0/sbin/slurmd -D -s $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
LimitNOFILE=131072
LimitMEMLOCK=infinity
LimitSTACK=infinity
Delegate=yes
TasksMax=infinity

# Uncomment the following lines to disable logging through journald.
# NOTE: It may be preferable to set these through an override file instead.
#StandardOutput=null
#StandardError=null

[Install]
WantedBy=multi-user.target
Comment 4 Felip Moll 2023-03-21 09:07:51 MDT
From NHC we see the ipmi interface returns no data:

[2023-03-21T09:21:05.931] error: health_check failed: rc:1 output:ERROR:  nhc:  Health check failed:  IPMI: No data available
Get Device ID command failed
Get Chassis Power Status failed: Invalid command
[2023-03-21T09:21:10.371] error: Can't get energy data. No power sensors are available. Try later.

I see no NHC errors after you started 23.02.0 but I am guessing nhc is not configured for this version but it was for 22.05.

[2023-03-21T09:00:45.114] slurmd version 23.02.0 started

From Slurm plugin we see the power sensor is not found (if ipmi is not working then the sensor is not found):

[2023-03-21T08:45:14.563] acct_gather_energy/ipmi: _find_power_sensor: Power sensor not found.
[2023-03-21T08:45:14.563] error: ipmi_monitoring_sensor_readings_by_record_id: invalid parameters

Can you check if ipmi is working in your node?
Comment 6 Ahmed Elmazaty 2023-03-21 09:26:20 MDT
Hi Felip,

IPMI is working on the node.

[root@hl-codon-04-02 ~]# ipmitool mc info
Device ID                 : 32
Device Revision           : 1
Firmware Revision         : 12.60
IPMI Version              : 2.0
Manufacturer ID           : 15370
Manufacturer Name         : Unknown (0x3C0A)
Product ID                : 1882 (0x075a)
Product Name              : Unknown (0x75A)
Device Available          : yes
Provides Device SDRs      : yes
Additional Device Support :
    Sensor Device
    SDR Repository Device
    SEL Device
    FRU Inventory Device
    IPMB Event Receiver
    IPMB Event Generator
    Chassis Device
Aux Firmware Rev Info     :
    0x1b
    0x00
    0x00
    0x00


The NHC errors in the logs were when I tried to cold reset BMC to check if this fixes the issue. However it didn't help.

Best regards,
Ahmed
Comment 7 Felip Moll 2023-03-21 09:48:22 MDT
> The NHC errors in the logs were when I tried to cold reset BMC to check if
> this fixes the issue. However it didn't help.
> 
> Best regards,
> Ahmed

Can you query the sensor list?

ipmitool sdr list

Can you also upload here your acct_gather.conf?

Thanks
Comment 8 Ahmed Elmazaty 2023-03-21 09:55:46 MDT
(In reply to Felip Moll from comment #7)
> > The NHC errors in the logs were when I tried to cold reset BMC to check if
> > this fixes the issue. However it didn't help.
> > 
> > Best regards,
> > Ahmed
> 
> Can you query the sensor list?
> 
> ipmitool sdr list
[root@hl-codon-04-02 ~]# ipmitool sdr list
Watchdog         | 0x00              | ok
SEL              | 0x00              | ok
CPU0_Status      | 0x00              | ok
CPU1_Status      | 0x00              | ok
CPU0_TEMP        | 35 degrees C      | ok
CPU1_TEMP        | 33 degrees C      | ok
DIMMG0_TEMP      | 30 degrees C      | ok
DIMMG1_TEMP      | 32 degrees C      | ok
DIMMG2_TEMP      | 28 degrees C      | ok
DIMMG3_TEMP      | 29 degrees C      | ok
CPU0_DTS         | 63 degrees C      | ok
CPU1_DTS         | 65 degrees C      | ok
GPU_PROC         | no reading        | ns
M2_G1_AMB_TEMP   | no reading        | ns
MB_TEMP1         | 34 degrees C      | ok
MB_TEMP2         | 35 degrees C      | ok
NVMeG0_TEMP      | no reading        | ns
OCP20_TEMP       | no reading        | ns
PCIE_TEMP        | 47 degrees C      | ok
PCH_TEMP         | 36 degrees C      | ok
P_12V            | 11.77 Volts       | ok
P_1V05_AUX_PCH   | 1.02 Volts        | ok
P_1V8_AUX_PCH    | 1.77 Volts        | ok
P_3V3            | 3.30 Volts        | ok
P_5V             | 5.09 Volts        | ok
P_5V_STBY        | 5.06 Volts        | ok
P_VBAT           | 3.01 Volts        | ok
P_VCCIN_CPU0     | 1.75 Volts        | ok
P_VCCIN_CPU1     | 1.75 Volts        | ok
P_VCCIO_P0       | 0.97 Volts        | ok
P_VCCIO_P1       | 0.97 Volts        | ok
P_VNN_PCH_AUX    | 0.97 Volts        | ok
VR_P0_TEMP       | 32 degrees C      | ok
VR_P1_TEMP       | 31 degrees C      | ok
VR_DIMMG0_TEMP   | 33 degrees C      | ok
VR_DIMMG1_TEMP   | 33 degrees C      | ok
VR_DIMMG2_TEMP   | 27 degrees C      | ok
VR_DIMMG3_TEMP   | 30 degrees C      | ok
VR_P0_VOUT       | 1.82 Volts        | ok
VR_P1_VOUT       | 1.82 Volts        | ok
VR_DIMMG0_VOUT   | 1.25 Volts        | ok
VR_DIMMG1_VOUT   | 1.25 Volts        | ok
VR_DIMMG2_VOUT   | 1.26 Volts        | ok
VR_DIMMG3_VOUT   | 1.25 Volts        | ok


> 
> Can you also upload here your acct_gather.conf?

We do not have acct_gather.conf. 

> 
> Thanks
Comment 9 Felip Moll 2023-03-21 10:37:45 MDT
There's no power sensor in this list, so that's probably the reason why Slurm cannot find it. How do you get the power readings for this node?

Is it a DCMI sensor the one in this node? Please call:

ipmi-dcmi --get-system-power-statistics
ipmitool sdr type "Power Supply"

Please, also compare the sdr list with a node which is working well.
Comment 10 Ahmed Elmazaty 2023-03-21 10:51:46 MDT
(In reply to Felip Moll from comment #9)
> There's no power sensor in this list, so that's probably the reason why
> Slurm cannot find it. How do you get the power readings for this node?
> 
> Is it a DCMI sensor the one in this node? Please call:
> 
> ipmi-dcmi --get-system-power-statistics
> ipmitool sdr type "Power Supply"

[root@hl-codon-04-02 ~]# ipmi-dcmi --get-system-power-statistics
Current Power                        : 108 Watts
Minimum Power over sampling duration : 52 watts
Maximum Power over sampling duration : 348 watts
Average Power over sampling duration : 126 watts
Time Stamp                           : 03/21/2023 - 16:40:47
Statistics reporting time period     : 895519704 milliseconds
Power Measurement                    : Active
[root@hl-codon-04-02 ~]# ipmitool sdr type "Power Supply"
The second command returns nothing
> 
> Please, also compare the sdr list with a node which is working well.

sdr list is much longer on Dell nodes
on Dell nodes:
[root@hl-codon-100-01 ~]# ipmitool sdr type "Power Supply"
PSU Mismatch     | 17h | ns  | 144.96 | Disabled
Status           | 52h | ok  | 10.1 | Presence detected
Status           | 53h | ok  | 10.2 | Presence detected
PSU Redundancy   | 18h | ok  | 144.96 | Fully Redundant
Comment 11 Felip Moll 2023-03-21 11:02:43 MDT
(In reply to Ahmed Elmazaty from comment #10)
> (In reply to Felip Moll from comment #9)
> > There's no power sensor in this list, so that's probably the reason why
> > Slurm cannot find it. How do you get the power readings for this node?
> > 
> > Is it a DCMI sensor the one in this node? Please call:
> > 
> > ipmi-dcmi --get-system-power-statistics
> > ipmitool sdr type "Power Supply"
> 
> [root@hl-codon-04-02 ~]# ipmi-dcmi --get-system-power-statistics
> Current Power                        : 108 Watts
> Minimum Power over sampling duration : 52 watts
> Maximum Power over sampling duration : 348 watts
> Average Power over sampling duration : 126 watts
> Time Stamp                           : 03/21/2023 - 16:40:47
> Statistics reporting time period     : 895519704 milliseconds
> Power Measurement                    : Active
> [root@hl-codon-04-02 ~]# ipmitool sdr type "Power Supply"
> The second command returns nothing
> > 
> > Please, also compare the sdr list with a node which is working well.
> 
> sdr list is much longer on Dell nodes
> on Dell nodes:
> [root@hl-codon-100-01 ~]# ipmitool sdr type "Power Supply"
> PSU Mismatch     | 17h | ns  | 144.96 | Disabled
> Status           | 52h | ok  | 10.1 | Presence detected
> Status           | 53h | ok  | 10.2 | Presence detected
> PSU Redundancy   | 18h | ok  | 144.96 | Fully Redundant


So here's the issue, these Gigabytes nodes have DCMI power sensors instead of normal ones like the Dell ones.

In bug 9629 some work was done to add support for DCMI sensors in Slurm. Fortunately it was put into 23.02 version, so the only thing you should need is to configure acct_gather.conf with:

IPMIPowerSensors=Node=DCMI

If you are not interested in getting power metrics then commenting out the ipmi plugin is the workaround to remove the error in these nodes, as you've already done.

Please read the man page of acct_gather.conf for more information about this setting.
Comment 12 Ahmed Elmazaty 2023-03-22 02:48:01 MDT
Thanks a lot Felip!
Setting EnergyIPMIPowerSensors to Node=DCMI seems to indeed fix the issue.
Thanks again for your help!
Best regards,
Ahmed
Comment 13 Felip Moll 2023-03-22 03:18:40 MDT
I'm glad it helped!.

Best regards,

Resolving the issue.
Comment 14 Ahmed Elmazaty 2023-03-22 05:56:17 MDT
Hi Felip, 
Opening this ticket again.
I didn't notice these errors because they didn't interrupt jobs. However log files on all nodes contain the following errors since this change has been applied.

[2023-03-22T08:41:59.980] error: _get_dcmi_power_reading: IPMI DCMI context not initialized
[2023-03-22T08:42:29.980] error: _get_dcmi_power_reading: IPMI DCMI context not initialized
[2023-03-22T08:42:59.981] error: _get_dcmi_power_reading: IPMI DCMI context not initialized
[2023-03-22T08:43:29.980] error: _get_dcmi_power_reading: IPMI DCMI context not initialized
[2023-03-22T08:43:59.980] error: _get_dcmi_power_reading: IPMI DCMI context not initialized

Best regards,
Ahmed
Comment 15 Felip Moll 2023-03-22 06:56:28 MDT
(In reply to Ahmed Elmazaty from comment #14)
> Hi Felip, 
> Opening this ticket again.
> I didn't notice these errors because they didn't interrupt jobs. However log
> files on all nodes contain the following errors since this change has been
> applied.
> 
> [2023-03-22T08:41:59.980] error: _get_dcmi_power_reading: IPMI DCMI context
> not initialized
> [2023-03-22T08:42:29.980] error: _get_dcmi_power_reading: IPMI DCMI context
> not initialized
> [2023-03-22T08:42:59.981] error: _get_dcmi_power_reading: IPMI DCMI context
> not initialized
> [2023-03-22T08:43:29.980] error: _get_dcmi_power_reading: IPMI DCMI context
> not initialized
> [2023-03-22T08:43:59.980] error: _get_dcmi_power_reading: IPMI DCMI context
> not initialized
> 
> Best regards,
> Ahmed

Hi Ahmed,

I understand you set this parameter *only* in the nodes which have DCMI and that you are receiving this error in such nodes.

Also, did you restart slurmd on these nodes?
Comment 16 Ahmed Elmazaty 2023-03-22 07:00:56 MDT
(In reply to Felip Moll from comment #15)
> (In reply to Ahmed Elmazaty from comment #14)
> > Hi Felip, 
> > Opening this ticket again.
> > I didn't notice these errors because they didn't interrupt jobs. However log
> > files on all nodes contain the following errors since this change has been
> > applied.
> > 
> > [2023-03-22T08:41:59.980] error: _get_dcmi_power_reading: IPMI DCMI context
> > not initialized
> > [2023-03-22T08:42:29.980] error: _get_dcmi_power_reading: IPMI DCMI context
> > not initialized
> > [2023-03-22T08:42:59.981] error: _get_dcmi_power_reading: IPMI DCMI context
> > not initialized
> > [2023-03-22T08:43:29.980] error: _get_dcmi_power_reading: IPMI DCMI context
> > not initialized
> > [2023-03-22T08:43:59.980] error: _get_dcmi_power_reading: IPMI DCMI context
> > not initialized
> > 
> > Best regards,
> > Ahmed
> 
> Hi Ahmed,
> 
> I understand you set this parameter *only* in the nodes which have DCMI and
> that you are receiving this error in such nodes.
> 
> Also, did you restart slurmd on these nodes?

Hi Felip,

/etc/slurm is shared FS that is available on all nodes.
The error is generated on all nodes as well. Both GBs and Dells.

I've restarted slurmd on all nodes after applying the change
Comment 17 Felip Moll 2023-03-22 07:03:36 MDT
(In reply to Ahmed Elmazaty from comment #16)
> (In reply to Felip Moll from comment #15)
> > (In reply to Ahmed Elmazaty from comment #14)
> > > Hi Felip, 
> > > Opening this ticket again.
> > > I didn't notice these errors because they didn't interrupt jobs. However log
> > > files on all nodes contain the following errors since this change has been
> > > applied.
> > > 
> > > [2023-03-22T08:41:59.980] error: _get_dcmi_power_reading: IPMI DCMI context
> > > not initialized
> > > [2023-03-22T08:42:29.980] error: _get_dcmi_power_reading: IPMI DCMI context
> > > not initialized
> > > [2023-03-22T08:42:59.981] error: _get_dcmi_power_reading: IPMI DCMI context
> > > not initialized
> > > [2023-03-22T08:43:29.980] error: _get_dcmi_power_reading: IPMI DCMI context
> > > not initialized
> > > [2023-03-22T08:43:59.980] error: _get_dcmi_power_reading: IPMI DCMI context
> > > not initialized
> > > 
> > > Best regards,
> > > Ahmed
> > 
> > Hi Ahmed,
> > 
> > I understand you set this parameter *only* in the nodes which have DCMI and
> > that you are receiving this error in such nodes.
> > 
> > Also, did you restart slurmd on these nodes?
> 
> Hi Felip,
> 
> /etc/slurm is shared FS that is available on all nodes.
> The error is generated on all nodes as well. Both GBs and Dells.
> 
> I've restarted slurmd on all nodes after applying the change

Ah, that's an issue because Dell nodes doesn't need this setting. They can work if they also have DCMI sensors. Can you check that?

ipmi-dcmi --get-system-power-statistics
Comment 18 Ahmed Elmazaty 2023-03-22 07:52:40 MDT
(In reply to Felip Moll from comment #17)
> (In reply to Ahmed Elmazaty from comment #16)
> > (In reply to Felip Moll from comment #15)
> > > (In reply to Ahmed Elmazaty from comment #14)
> > > > Hi Felip, 
> > > > Opening this ticket again.
> > > > I didn't notice these errors because they didn't interrupt jobs. However log
> > > > files on all nodes contain the following errors since this change has been
> > > > applied.
> > > > 
> > > > [2023-03-22T08:41:59.980] error: _get_dcmi_power_reading: IPMI DCMI context
> > > > not initialized
> > > > [2023-03-22T08:42:29.980] error: _get_dcmi_power_reading: IPMI DCMI context
> > > > not initialized
> > > > [2023-03-22T08:42:59.981] error: _get_dcmi_power_reading: IPMI DCMI context
> > > > not initialized
> > > > [2023-03-22T08:43:29.980] error: _get_dcmi_power_reading: IPMI DCMI context
> > > > not initialized
> > > > [2023-03-22T08:43:59.980] error: _get_dcmi_power_reading: IPMI DCMI context
> > > > not initialized
> > > > 
> > > > Best regards,
> > > > Ahmed
> > > 
> > > Hi Ahmed,
> > > 
> > > I understand you set this parameter *only* in the nodes which have DCMI and
> > > that you are receiving this error in such nodes.
> > > 
> > > Also, did you restart slurmd on these nodes?
> > 
> > Hi Felip,
> > 
> > /etc/slurm is shared FS that is available on all nodes.
> > The error is generated on all nodes as well. Both GBs and Dells.
> > 
> > I've restarted slurmd on all nodes after applying the change
> 
> Ah, that's an issue because Dell nodes doesn't need this setting. They can
> work if they also have DCMI sensors. Can you check that?
> 
> ipmi-dcmi --get-system-power-statistics

The command works on Dell nodes

[root@hl-codon-100-01 ~]# ipmi-dcmi --get-system-power-statistics
Current Power                        : 175 Watts
Minimum Power over sampling duration : 5 watts
Maximum Power over sampling duration : 540 watts
Average Power over sampling duration : 194 watts
Time Stamp                           : 03/22/2023 - 13:51:49
Statistics reporting time period     : 1000 milliseconds
Power Measurement                    : Active
Comment 19 Felip Moll 2023-03-22 08:18:24 MDT
(In reply to Ahmed Elmazaty from comment #18)
> (In reply to Felip Moll from comment #17)
> > (In reply to Ahmed Elmazaty from comment #16)
> > > (In reply to Felip Moll from comment #15)
> > > > (In reply to Ahmed Elmazaty from comment #14)
> > > > > Hi Felip, 
> > > > > Opening this ticket again.
> > > > > I didn't notice these errors because they didn't interrupt jobs. However log
> > > > > files on all nodes contain the following errors since this change has been
> > > > > applied.
> > > > > 
> > > > > [2023-03-22T08:41:59.980] error: _get_dcmi_power_reading: IPMI DCMI context
> > > > > not initialized
> > > > > [2023-03-22T08:42:29.980] error: _get_dcmi_power_reading: IPMI DCMI context
> > > > > not initialized
> > > > > [2023-03-22T08:42:59.981] error: _get_dcmi_power_reading: IPMI DCMI context
> > > > > not initialized
> > > > > [2023-03-22T08:43:29.980] error: _get_dcmi_power_reading: IPMI DCMI context
> > > > > not initialized
> > > > > [2023-03-22T08:43:59.980] error: _get_dcmi_power_reading: IPMI DCMI context
> > > > > not initialized
> > > > > 
> > > > > Best regards,
> > > > > Ahmed
> > > > 
> > > > Hi Ahmed,
> > > > 
> > > > I understand you set this parameter *only* in the nodes which have DCMI and
> > > > that you are receiving this error in such nodes.
> > > > 
> > > > Also, did you restart slurmd on these nodes?
> > > 
> > > Hi Felip,
> > > 
> > > /etc/slurm is shared FS that is available on all nodes.
> > > The error is generated on all nodes as well. Both GBs and Dells.
> > > 
> > > I've restarted slurmd on all nodes after applying the change
> > 
> > Ah, that's an issue because Dell nodes doesn't need this setting. They can
> > work if they also have DCMI sensors. Can you check that?
> > 
> > ipmi-dcmi --get-system-power-statistics
> 
> The command works on Dell nodes
> 
> [root@hl-codon-100-01 ~]# ipmi-dcmi --get-system-power-statistics
> Current Power                        : 175 Watts
> Minimum Power over sampling duration : 5 watts
> Maximum Power over sampling duration : 540 watts
> Average Power over sampling duration : 194 watts
> Time Stamp                           : 03/22/2023 - 13:51:49
> Statistics reporting time period     : 1000 milliseconds
> Power Measurement                    : Active

Ok, that's good.

I will need the following:

Enable this on one node (just change slurm.conf and restart a specific slurmd):

SlurmdDebug=debug2
DebugFlags=Energy

Then run a job in this node, check if the error still shows up. Then send me the slurmd log of the node. After you can restore these two parameters if you want.

I will investigate why it is happening.

Does a 'scontrol show node' show the power?
Comment 20 Ahmed Elmazaty 2023-03-22 08:50:22 MDT
(In reply to Felip Moll from comment #19)
> (In reply to Ahmed Elmazaty from comment #18)
> 
> Ok, that's good.
> 
> I will need the following:
> 
> Enable this on one node (just change slurm.conf and restart a specific
> slurmd):
> 
> SlurmdDebug=debug2
> DebugFlags=Energy
> 
> Then run a job in this node, check if the error still shows up. Then send me
> the slurmd log of the node. After you can restore these two parameters if
> you want.
Yes the error still appears in the logs

[2023-03-22T14:43:20.786] error: _get_dcmi_power_reading: IPMI DCMI context not initialized
[2023-03-22T14:43:24.226] error: _get_dcmi_power_reading: IPMI DCMI context not initialized

I've attached the log file

> 
> I will investigate why it is happening.
> 
> Does a 'scontrol show node' show the power?
Yes it does. 
# scontrol show node hl-codon-37-02
NodeName=hl-codon-37-02 Arch=x86_64 CoresPerSocket=24
   CPUAlloc=0 CPUEfctv=48 CPUTot=48 CPULoad=0.59
   AvailableFeatures=codon,gigabyte,local_170g,nogpu,nobmem,intel,cascadelake
   ActiveFeatures=codon,gigabyte,local_170g,nogpu,nobmem,intel,cascadelake
   Gres=(null)
   NodeAddr=hl-codon-37-02 NodeHostName=hl-codon-37-02 Version=23.02.0
   OS=Linux 4.18.0-348.20.1.el8_5.x86_64 #1 SMP Thu Mar 10 20:59:28 UTC 2022
   RealMemory=380000 AllocMem=0 FreeMem=94243 Sockets=2 Boards=1
   CPUSpecList=
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=50 Owner=N/A MCS_label=N/A
   Partitions=ALL,standard
   BootTime=2023-03-09T12:40:18 SlurmdStartTime=2023-03-22T14:44:49
   LastBusyTime=2023-03-22T14:43:29 ResumeAfterTime=None
   CfgTRES=cpu=48,mem=380000M,billing=48
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=80 AveWatts=81
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Comment 21 Ahmed Elmazaty 2023-03-22 08:50:54 MDT
Created attachment 29462 [details]
hl-codon-37-02 slurmd.log
Comment 22 Felip Moll 2023-03-23 06:37:44 MDT
Ahmed,

Are you able to apply a small test patch to slurmd to one of the nodes?
Comment 25 Ahmed Elmazaty 2023-03-23 07:16:13 MDT
(In reply to Felip Moll from comment #22)
> Ahmed,
> 
> Are you able to apply a small test patch to slurmd to one of the nodes?

Hi Felip,
Yeah sure.
Let me know how we can proceed
Comment 26 Felip Moll 2023-03-23 07:27:01 MDT
Created attachment 29475 [details]
bug16331_2302_test_v1.patch

Ahmed,

The idea is to apply this patch to slurmd. You need to download it in the slurm source code directory and apply it. Then compile and install slurm in the node. It is a minor patch which tries to initialize the dcmi context when it is not. I think there's a bug because the context is not shared among threads and we are not initializing it in new threads. So I added an initialization in _thread_init() function for this context.

1. Download this patch in slurm source code directory
2. Apply the patch with "git am -3 < bug16331_2302_test_v1.patch"
3. Compile.
4. Pick a node and set it to DRAIN and put it into a reservation. Ensure there are no jobs running.
5. Stop slurmd and install the new build on this node.
6. Start slurmd, check the logs for any errors. If everything seems happy, undrain the node.
7. Send a job to the reserved node. (A 'srun -w <the node> sleep 1000' should suffice).
8. Send me the slurmd logs.

Please, ensure the SlurmdDebug=debug2 and DebugFlags=Energy are still set when you start slurmd.

Thanks
Comment 27 Ahmed Elmazaty 2023-03-23 07:36:20 MDT
(In reply to Felip Moll from comment #26)
> Created attachment 29475 [details]
> bug16331_2302_test_v1.patch
> 
> Ahmed,
> 
> The idea is to apply this patch to slurmd. You need to download it in the
> slurm source code directory and apply it. Then compile and install slurm in
> the node. It is a minor patch which tries to initialize the dcmi context
> when it is not. I think there's a bug because the context is not shared
> among threads and we are not initializing it in new threads. So I added an
> initialization in _thread_init() function for this context.
> 
> 1. Download this patch in slurm source code directory
> 2. Apply the patch with "git am -3 < bug16331_2302_test_v1.patch"
> 3. Compile.
> 4. Pick a node and set it to DRAIN and put it into a reservation. Ensure
> there are no jobs running.
> 5. Stop slurmd and install the new build on this node.
> 6. Start slurmd, check the logs for any errors. If everything seems happy,
> undrain the node.
> 7. Send a job to the reserved node. (A 'srun -w <the node> sleep 1000'
> should suffice).
> 8. Send me the slurmd logs.
> 
> Please, ensure the SlurmdDebug=debug2 and DebugFlags=Energy are still set
> when you start slurmd.
> 
> Thanks

Hi Felip,

By "compile", you mean run configure, make and make install?
If so, shall I use the the same "prefix" I use for my current installation? Or I need to install it in another location?

Best regards,
Ahmed
Comment 28 Felip Moll 2023-03-23 07:39:07 MDT
> 
> Hi Felip,
> 
> By "compile", you mean run configure, make and make install?
> If so, shall I use the the same "prefix" I use for my current installation?
> Or I need to install it in another location?
> 
> Best regards,
> Ahmed

Yeah. You should use exactly the same. You can even use the old source directory if you kept it and in that case you could even skip the configure.
Comment 29 Ahmed Elmazaty 2023-03-23 07:51:36 MDT
(In reply to Felip Moll from comment #28)
> > 
> > Hi Felip,
> > 
> > By "compile", you mean run configure, make and make install?
> > If so, shall I use the the same "prefix" I use for my current installation?
> > Or I need to install it in another location?
> > 
> > Best regards,
> > Ahmed
> 
> Yeah. You should use exactly the same. You can even use the old source
> directory if you kept it and in that case you could even skip the configure.

Thanks,
Will other nodes be affected? As we build SLURM in a shared location.
Comment 30 Felip Moll 2023-03-23 08:09:18 MDT
(In reply to Ahmed Elmazaty from comment #29)
> (In reply to Felip Moll from comment #28)
> > > 
> > > Hi Felip,
> > > 
> > > By "compile", you mean run configure, make and make install?
> > > If so, shall I use the the same "prefix" I use for my current installation?
> > > Or I need to install it in another location?
> > > 
> > > Best regards,
> > > Ahmed
> > 
> > Yeah. You should use exactly the same. You can even use the old source
> > directory if you kept it and in that case you could even skip the configure.
> 
> Thanks,
> Will other nodes be affected? As we build SLURM in a shared location.

Well, if the binaries are in a shared location yes.

Ideally you just should change that specific node. In that case my suggestion is to change the prefix then and start slurmd from this new location only in this node.
Comment 31 Ahmed Elmazaty 2023-03-23 08:12:53 MDT
I am getting this error while trying to apply it to the source directory.
I used "wget" to download the source

# git am -3 < bug16331_2302_test_v1.patch
fatal: not a git repository (or any parent up to mount point /ebi)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
Comment 32 Felip Moll 2023-03-23 08:37:55 MDT
(In reply to Ahmed Elmazaty from comment #31)
> I am getting this error while trying to apply it to the source directory.
> I used "wget" to download the source
> 
> # git am -3 < bug16331_2302_test_v1.patch
> fatal: not a git repository (or any parent up to mount point /ebi)
> Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).

Ah, if you used wget and not a git clone, just apply it with patch tool:

]$ patch -p1 < bug16331_2302_test_v1.patch
Comment 33 Ahmed Elmazaty 2023-03-23 09:04:53 MDT
Created attachment 29478 [details]
hl-codon-37-02 slurmd.log after applying patch
Comment 34 Felip Moll 2023-03-23 15:40:09 MDT
(In reply to Ahmed Elmazaty from comment #33)
> Created attachment 29478 [details]
> hl-codon-37-02 slurmd.log after applying patch

If I am not wrong, after you restarted here:

[2023-03-23T14:48:17.396] error: _get_dcmi_power_reading: IPMI DCMI context not initialized
[2023-03-23T14:48:18.408] [751034.batch] private-tmpdir: removed /slurm/751034.batch.0 (5 files) in 0.000195 seconds
[2023-03-23T14:48:18.414] [751034.batch] done with job
[2023-03-23T14:53:34.635] Slurmd shutdown completing

No more errors have shown up right? it does seems that the patch fixes the issue, do you agree?
Comment 35 Ahmed Elmazaty 2023-03-24 03:18:50 MDT
(In reply to Felip Moll from comment #34)
> (In reply to Ahmed Elmazaty from comment #33)
> > Created attachment 29478 [details]
> > hl-codon-37-02 slurmd.log after applying patch
> 
> If I am not wrong, after you restarted here:
> 
> [2023-03-23T14:48:17.396] error: _get_dcmi_power_reading: IPMI DCMI context
> not initialized
> [2023-03-23T14:48:18.408] [751034.batch] private-tmpdir: removed
> /slurm/751034.batch.0 (5 files) in 0.000195 seconds
> [2023-03-23T14:48:18.414] [751034.batch] done with job
> [2023-03-23T14:53:34.635] Slurmd shutdown completing
> 
> No more errors have shown up right? it does seems that the patch fixes the
> issue, do you agree?

Hi Felip,

True. Checking the log files, no more errors have been reported regarding this issue.
So what's next? Shall we migrate all other nodes (including controllers) to this new installation? Or it's better to wait till it's added to one of your upcoming releases?

Thanks
Comment 36 Felip Moll 2023-03-27 05:51:57 MDT
(In reply to Ahmed Elmazaty from comment #35)
> (In reply to Felip Moll from comment #34)
> > (In reply to Ahmed Elmazaty from comment #33)
> > > Created attachment 29478 [details]
> > > hl-codon-37-02 slurmd.log after applying patch
> > 
> > If I am not wrong, after you restarted here:
> > 
> > [2023-03-23T14:48:17.396] error: _get_dcmi_power_reading: IPMI DCMI context
> > not initialized
> > [2023-03-23T14:48:18.408] [751034.batch] private-tmpdir: removed
> > /slurm/751034.batch.0 (5 files) in 0.000195 seconds
> > [2023-03-23T14:48:18.414] [751034.batch] done with job
> > [2023-03-23T14:53:34.635] Slurmd shutdown completing
> > 
> > No more errors have shown up right? it does seems that the patch fixes the
> > issue, do you agree?
> 
> Hi Felip,
> 
> True. Checking the log files, no more errors have been reported regarding
> this issue.
> So what's next? Shall we migrate all other nodes (including controllers) to
> this new installation? Or it's better to wait till it's added to one of your
> upcoming releases?
> 
> Thanks

Good. 

The ideal plan would be: First revert this node installation. Then I will prepare a patch that will be reviewed by our QA team and make it available in the next Slurm release. Then you can install this new release when it comes out.

Thanks for testing!.
Comment 39 Felip Moll 2023-03-29 12:07:57 MDT
Hi Ahmed,

The fix has been accepted and included in 23.02.2, the next upcoming release. Yesterday 23.02.1 was released but the patch didn't arrive on time.

I'd suggest to wait till .2. In the meantime you can disable ipmi on dcmi nodes or apply this patch manually to all of your slurmds.

commit d37eefe81dac1e2aa0b030c3de1dca41b787504d
Author: Felip Moll <felip.moll@schedmd.com>
Date:   Thu Mar 23 14:18:43 2023 +0100

    Fix IPMI DCMI sensor initialization
    
    The ipmi context needs to be unique to each thread because of its internal
    references to memory, this means each thread must initialize a new context.
    
    This is the same that ipmi_monitoring library does when using freeipmi.
    
    Bug 16331

I am resolving the bug.

Thanks for reporting!.