Hello, We have two types of nodes in our cluster: Dell and Gigabyte We use "AcctGatherEnergyType=acct_gather_energy/ipmi" in our slurm.conf. Everything was working fine in 22.05.6. When we upgraded to 23.02.0. Dell nodes are still working fine. However Gigabyte nodes report lots of errors with each submitted job. slurmstepd: error: _get_joules_task: can't get info from slurmd slurmstepd: error: slurm_get_node_energy: Zero Bytes were transmitted or received On a Gigabyte node: [root@codon-dm-08 ~]# ipmitool mc info Device ID : 32 Device Revision : 1 Firmware Revision : 12.49 IPMI Version : 2.0 Manufacturer ID : 15370 Manufacturer Name : Unknown (0x3C0A) Product ID : 1882 (0x075a) Product Name : Unknown (0x75A) Device Available : yes Provides Device SDRs : yes Additional Device Support : Sensor Device SDR Repository Device SEL Device FRU Inventory Device IPMB Event Receiver IPMB Event Generator Chassis Device Aux Firmware Rev Info : 0x06 0x00 0x00 0x00 Can you please help with this? Thanks Ahmed
(In reply to Ahmed Elmazaty from comment #0) > Hello, > > We have two types of nodes in our cluster: Dell and Gigabyte > We use "AcctGatherEnergyType=acct_gather_energy/ipmi" in our slurm.conf. > > Everything was working fine in 22.05.6. > > When we upgraded to 23.02.0. Dell nodes are still working fine. However > Gigabyte nodes report lots of errors with each submitted job. > > slurmstepd: error: _get_joules_task: can't get info from slurmd > slurmstepd: error: slurm_get_node_energy: Zero Bytes were transmitted or > received > > > On a Gigabyte node: > [root@codon-dm-08 ~]# ipmitool mc info > Device ID : 32 > Device Revision : 1 > Firmware Revision : 12.49 > IPMI Version : 2.0 > Manufacturer ID : 15370 > Manufacturer Name : Unknown (0x3C0A) > Product ID : 1882 (0x075a) > Product Name : Unknown (0x75A) > Device Available : yes > Provides Device SDRs : yes > Additional Device Support : > Sensor Device > SDR Repository Device > SEL Device > FRU Inventory Device > IPMB Event Receiver > IPMB Event Generator > Chassis Device > Aux Firmware Rev Info : > 0x06 > 0x00 > 0x00 > 0x00 > > Can you please help with this? > > Thanks > Ahmed Hi Ahmed, Can I see a slurmd log on these nodes? Can I see how do you start slurmd too? Thanks
Created attachment 29436 [details] hl-codon-04-02 slurmd.log
Hi Felip, I've attached slurmd.log for one of the affected nodes. I've commented out "AcctGatherEnergyType=acct_gather_energy/ipmi" for now to avoid getting more errors. Here is our systemd unit file for starting slurmd on compute nodes: [root@hl-codon-04-02 ~]# cat /usr/lib/systemd/system/slurmd.service [Unit] Description=Slurm node daemon After=munge.service network-online.target remote-fs.target Wants=network-online.target [Service] Type=simple EnvironmentFile=-/etc/sysconfig/slurmd EnvironmentFile=-/etc/default/slurmd ExecStart=/ebi/slurm/codon/install/slurm-23.02.0/sbin/slurmd -D -s $SLURMD_OPTIONS ExecReload=/bin/kill -HUP $MAINPID KillMode=process LimitNOFILE=131072 LimitMEMLOCK=infinity LimitSTACK=infinity Delegate=yes TasksMax=infinity # Uncomment the following lines to disable logging through journald. # NOTE: It may be preferable to set these through an override file instead. #StandardOutput=null #StandardError=null [Install] WantedBy=multi-user.target
From NHC we see the ipmi interface returns no data: [2023-03-21T09:21:05.931] error: health_check failed: rc:1 output:ERROR: nhc: Health check failed: IPMI: No data available Get Device ID command failed Get Chassis Power Status failed: Invalid command [2023-03-21T09:21:10.371] error: Can't get energy data. No power sensors are available. Try later. I see no NHC errors after you started 23.02.0 but I am guessing nhc is not configured for this version but it was for 22.05. [2023-03-21T09:00:45.114] slurmd version 23.02.0 started From Slurm plugin we see the power sensor is not found (if ipmi is not working then the sensor is not found): [2023-03-21T08:45:14.563] acct_gather_energy/ipmi: _find_power_sensor: Power sensor not found. [2023-03-21T08:45:14.563] error: ipmi_monitoring_sensor_readings_by_record_id: invalid parameters Can you check if ipmi is working in your node?
Hi Felip, IPMI is working on the node. [root@hl-codon-04-02 ~]# ipmitool mc info Device ID : 32 Device Revision : 1 Firmware Revision : 12.60 IPMI Version : 2.0 Manufacturer ID : 15370 Manufacturer Name : Unknown (0x3C0A) Product ID : 1882 (0x075a) Product Name : Unknown (0x75A) Device Available : yes Provides Device SDRs : yes Additional Device Support : Sensor Device SDR Repository Device SEL Device FRU Inventory Device IPMB Event Receiver IPMB Event Generator Chassis Device Aux Firmware Rev Info : 0x1b 0x00 0x00 0x00 The NHC errors in the logs were when I tried to cold reset BMC to check if this fixes the issue. However it didn't help. Best regards, Ahmed
> The NHC errors in the logs were when I tried to cold reset BMC to check if > this fixes the issue. However it didn't help. > > Best regards, > Ahmed Can you query the sensor list? ipmitool sdr list Can you also upload here your acct_gather.conf? Thanks
(In reply to Felip Moll from comment #7) > > The NHC errors in the logs were when I tried to cold reset BMC to check if > > this fixes the issue. However it didn't help. > > > > Best regards, > > Ahmed > > Can you query the sensor list? > > ipmitool sdr list [root@hl-codon-04-02 ~]# ipmitool sdr list Watchdog | 0x00 | ok SEL | 0x00 | ok CPU0_Status | 0x00 | ok CPU1_Status | 0x00 | ok CPU0_TEMP | 35 degrees C | ok CPU1_TEMP | 33 degrees C | ok DIMMG0_TEMP | 30 degrees C | ok DIMMG1_TEMP | 32 degrees C | ok DIMMG2_TEMP | 28 degrees C | ok DIMMG3_TEMP | 29 degrees C | ok CPU0_DTS | 63 degrees C | ok CPU1_DTS | 65 degrees C | ok GPU_PROC | no reading | ns M2_G1_AMB_TEMP | no reading | ns MB_TEMP1 | 34 degrees C | ok MB_TEMP2 | 35 degrees C | ok NVMeG0_TEMP | no reading | ns OCP20_TEMP | no reading | ns PCIE_TEMP | 47 degrees C | ok PCH_TEMP | 36 degrees C | ok P_12V | 11.77 Volts | ok P_1V05_AUX_PCH | 1.02 Volts | ok P_1V8_AUX_PCH | 1.77 Volts | ok P_3V3 | 3.30 Volts | ok P_5V | 5.09 Volts | ok P_5V_STBY | 5.06 Volts | ok P_VBAT | 3.01 Volts | ok P_VCCIN_CPU0 | 1.75 Volts | ok P_VCCIN_CPU1 | 1.75 Volts | ok P_VCCIO_P0 | 0.97 Volts | ok P_VCCIO_P1 | 0.97 Volts | ok P_VNN_PCH_AUX | 0.97 Volts | ok VR_P0_TEMP | 32 degrees C | ok VR_P1_TEMP | 31 degrees C | ok VR_DIMMG0_TEMP | 33 degrees C | ok VR_DIMMG1_TEMP | 33 degrees C | ok VR_DIMMG2_TEMP | 27 degrees C | ok VR_DIMMG3_TEMP | 30 degrees C | ok VR_P0_VOUT | 1.82 Volts | ok VR_P1_VOUT | 1.82 Volts | ok VR_DIMMG0_VOUT | 1.25 Volts | ok VR_DIMMG1_VOUT | 1.25 Volts | ok VR_DIMMG2_VOUT | 1.26 Volts | ok VR_DIMMG3_VOUT | 1.25 Volts | ok > > Can you also upload here your acct_gather.conf? We do not have acct_gather.conf. > > Thanks
There's no power sensor in this list, so that's probably the reason why Slurm cannot find it. How do you get the power readings for this node? Is it a DCMI sensor the one in this node? Please call: ipmi-dcmi --get-system-power-statistics ipmitool sdr type "Power Supply" Please, also compare the sdr list with a node which is working well.
(In reply to Felip Moll from comment #9) > There's no power sensor in this list, so that's probably the reason why > Slurm cannot find it. How do you get the power readings for this node? > > Is it a DCMI sensor the one in this node? Please call: > > ipmi-dcmi --get-system-power-statistics > ipmitool sdr type "Power Supply" [root@hl-codon-04-02 ~]# ipmi-dcmi --get-system-power-statistics Current Power : 108 Watts Minimum Power over sampling duration : 52 watts Maximum Power over sampling duration : 348 watts Average Power over sampling duration : 126 watts Time Stamp : 03/21/2023 - 16:40:47 Statistics reporting time period : 895519704 milliseconds Power Measurement : Active [root@hl-codon-04-02 ~]# ipmitool sdr type "Power Supply" The second command returns nothing > > Please, also compare the sdr list with a node which is working well. sdr list is much longer on Dell nodes on Dell nodes: [root@hl-codon-100-01 ~]# ipmitool sdr type "Power Supply" PSU Mismatch | 17h | ns | 144.96 | Disabled Status | 52h | ok | 10.1 | Presence detected Status | 53h | ok | 10.2 | Presence detected PSU Redundancy | 18h | ok | 144.96 | Fully Redundant
(In reply to Ahmed Elmazaty from comment #10) > (In reply to Felip Moll from comment #9) > > There's no power sensor in this list, so that's probably the reason why > > Slurm cannot find it. How do you get the power readings for this node? > > > > Is it a DCMI sensor the one in this node? Please call: > > > > ipmi-dcmi --get-system-power-statistics > > ipmitool sdr type "Power Supply" > > [root@hl-codon-04-02 ~]# ipmi-dcmi --get-system-power-statistics > Current Power : 108 Watts > Minimum Power over sampling duration : 52 watts > Maximum Power over sampling duration : 348 watts > Average Power over sampling duration : 126 watts > Time Stamp : 03/21/2023 - 16:40:47 > Statistics reporting time period : 895519704 milliseconds > Power Measurement : Active > [root@hl-codon-04-02 ~]# ipmitool sdr type "Power Supply" > The second command returns nothing > > > > Please, also compare the sdr list with a node which is working well. > > sdr list is much longer on Dell nodes > on Dell nodes: > [root@hl-codon-100-01 ~]# ipmitool sdr type "Power Supply" > PSU Mismatch | 17h | ns | 144.96 | Disabled > Status | 52h | ok | 10.1 | Presence detected > Status | 53h | ok | 10.2 | Presence detected > PSU Redundancy | 18h | ok | 144.96 | Fully Redundant So here's the issue, these Gigabytes nodes have DCMI power sensors instead of normal ones like the Dell ones. In bug 9629 some work was done to add support for DCMI sensors in Slurm. Fortunately it was put into 23.02 version, so the only thing you should need is to configure acct_gather.conf with: IPMIPowerSensors=Node=DCMI If you are not interested in getting power metrics then commenting out the ipmi plugin is the workaround to remove the error in these nodes, as you've already done. Please read the man page of acct_gather.conf for more information about this setting.
Thanks a lot Felip! Setting EnergyIPMIPowerSensors to Node=DCMI seems to indeed fix the issue. Thanks again for your help! Best regards, Ahmed
I'm glad it helped!. Best regards, Resolving the issue.
Hi Felip, Opening this ticket again. I didn't notice these errors because they didn't interrupt jobs. However log files on all nodes contain the following errors since this change has been applied. [2023-03-22T08:41:59.980] error: _get_dcmi_power_reading: IPMI DCMI context not initialized [2023-03-22T08:42:29.980] error: _get_dcmi_power_reading: IPMI DCMI context not initialized [2023-03-22T08:42:59.981] error: _get_dcmi_power_reading: IPMI DCMI context not initialized [2023-03-22T08:43:29.980] error: _get_dcmi_power_reading: IPMI DCMI context not initialized [2023-03-22T08:43:59.980] error: _get_dcmi_power_reading: IPMI DCMI context not initialized Best regards, Ahmed
(In reply to Ahmed Elmazaty from comment #14) > Hi Felip, > Opening this ticket again. > I didn't notice these errors because they didn't interrupt jobs. However log > files on all nodes contain the following errors since this change has been > applied. > > [2023-03-22T08:41:59.980] error: _get_dcmi_power_reading: IPMI DCMI context > not initialized > [2023-03-22T08:42:29.980] error: _get_dcmi_power_reading: IPMI DCMI context > not initialized > [2023-03-22T08:42:59.981] error: _get_dcmi_power_reading: IPMI DCMI context > not initialized > [2023-03-22T08:43:29.980] error: _get_dcmi_power_reading: IPMI DCMI context > not initialized > [2023-03-22T08:43:59.980] error: _get_dcmi_power_reading: IPMI DCMI context > not initialized > > Best regards, > Ahmed Hi Ahmed, I understand you set this parameter *only* in the nodes which have DCMI and that you are receiving this error in such nodes. Also, did you restart slurmd on these nodes?
(In reply to Felip Moll from comment #15) > (In reply to Ahmed Elmazaty from comment #14) > > Hi Felip, > > Opening this ticket again. > > I didn't notice these errors because they didn't interrupt jobs. However log > > files on all nodes contain the following errors since this change has been > > applied. > > > > [2023-03-22T08:41:59.980] error: _get_dcmi_power_reading: IPMI DCMI context > > not initialized > > [2023-03-22T08:42:29.980] error: _get_dcmi_power_reading: IPMI DCMI context > > not initialized > > [2023-03-22T08:42:59.981] error: _get_dcmi_power_reading: IPMI DCMI context > > not initialized > > [2023-03-22T08:43:29.980] error: _get_dcmi_power_reading: IPMI DCMI context > > not initialized > > [2023-03-22T08:43:59.980] error: _get_dcmi_power_reading: IPMI DCMI context > > not initialized > > > > Best regards, > > Ahmed > > Hi Ahmed, > > I understand you set this parameter *only* in the nodes which have DCMI and > that you are receiving this error in such nodes. > > Also, did you restart slurmd on these nodes? Hi Felip, /etc/slurm is shared FS that is available on all nodes. The error is generated on all nodes as well. Both GBs and Dells. I've restarted slurmd on all nodes after applying the change
(In reply to Ahmed Elmazaty from comment #16) > (In reply to Felip Moll from comment #15) > > (In reply to Ahmed Elmazaty from comment #14) > > > Hi Felip, > > > Opening this ticket again. > > > I didn't notice these errors because they didn't interrupt jobs. However log > > > files on all nodes contain the following errors since this change has been > > > applied. > > > > > > [2023-03-22T08:41:59.980] error: _get_dcmi_power_reading: IPMI DCMI context > > > not initialized > > > [2023-03-22T08:42:29.980] error: _get_dcmi_power_reading: IPMI DCMI context > > > not initialized > > > [2023-03-22T08:42:59.981] error: _get_dcmi_power_reading: IPMI DCMI context > > > not initialized > > > [2023-03-22T08:43:29.980] error: _get_dcmi_power_reading: IPMI DCMI context > > > not initialized > > > [2023-03-22T08:43:59.980] error: _get_dcmi_power_reading: IPMI DCMI context > > > not initialized > > > > > > Best regards, > > > Ahmed > > > > Hi Ahmed, > > > > I understand you set this parameter *only* in the nodes which have DCMI and > > that you are receiving this error in such nodes. > > > > Also, did you restart slurmd on these nodes? > > Hi Felip, > > /etc/slurm is shared FS that is available on all nodes. > The error is generated on all nodes as well. Both GBs and Dells. > > I've restarted slurmd on all nodes after applying the change Ah, that's an issue because Dell nodes doesn't need this setting. They can work if they also have DCMI sensors. Can you check that? ipmi-dcmi --get-system-power-statistics
(In reply to Felip Moll from comment #17) > (In reply to Ahmed Elmazaty from comment #16) > > (In reply to Felip Moll from comment #15) > > > (In reply to Ahmed Elmazaty from comment #14) > > > > Hi Felip, > > > > Opening this ticket again. > > > > I didn't notice these errors because they didn't interrupt jobs. However log > > > > files on all nodes contain the following errors since this change has been > > > > applied. > > > > > > > > [2023-03-22T08:41:59.980] error: _get_dcmi_power_reading: IPMI DCMI context > > > > not initialized > > > > [2023-03-22T08:42:29.980] error: _get_dcmi_power_reading: IPMI DCMI context > > > > not initialized > > > > [2023-03-22T08:42:59.981] error: _get_dcmi_power_reading: IPMI DCMI context > > > > not initialized > > > > [2023-03-22T08:43:29.980] error: _get_dcmi_power_reading: IPMI DCMI context > > > > not initialized > > > > [2023-03-22T08:43:59.980] error: _get_dcmi_power_reading: IPMI DCMI context > > > > not initialized > > > > > > > > Best regards, > > > > Ahmed > > > > > > Hi Ahmed, > > > > > > I understand you set this parameter *only* in the nodes which have DCMI and > > > that you are receiving this error in such nodes. > > > > > > Also, did you restart slurmd on these nodes? > > > > Hi Felip, > > > > /etc/slurm is shared FS that is available on all nodes. > > The error is generated on all nodes as well. Both GBs and Dells. > > > > I've restarted slurmd on all nodes after applying the change > > Ah, that's an issue because Dell nodes doesn't need this setting. They can > work if they also have DCMI sensors. Can you check that? > > ipmi-dcmi --get-system-power-statistics The command works on Dell nodes [root@hl-codon-100-01 ~]# ipmi-dcmi --get-system-power-statistics Current Power : 175 Watts Minimum Power over sampling duration : 5 watts Maximum Power over sampling duration : 540 watts Average Power over sampling duration : 194 watts Time Stamp : 03/22/2023 - 13:51:49 Statistics reporting time period : 1000 milliseconds Power Measurement : Active
(In reply to Ahmed Elmazaty from comment #18) > (In reply to Felip Moll from comment #17) > > (In reply to Ahmed Elmazaty from comment #16) > > > (In reply to Felip Moll from comment #15) > > > > (In reply to Ahmed Elmazaty from comment #14) > > > > > Hi Felip, > > > > > Opening this ticket again. > > > > > I didn't notice these errors because they didn't interrupt jobs. However log > > > > > files on all nodes contain the following errors since this change has been > > > > > applied. > > > > > > > > > > [2023-03-22T08:41:59.980] error: _get_dcmi_power_reading: IPMI DCMI context > > > > > not initialized > > > > > [2023-03-22T08:42:29.980] error: _get_dcmi_power_reading: IPMI DCMI context > > > > > not initialized > > > > > [2023-03-22T08:42:59.981] error: _get_dcmi_power_reading: IPMI DCMI context > > > > > not initialized > > > > > [2023-03-22T08:43:29.980] error: _get_dcmi_power_reading: IPMI DCMI context > > > > > not initialized > > > > > [2023-03-22T08:43:59.980] error: _get_dcmi_power_reading: IPMI DCMI context > > > > > not initialized > > > > > > > > > > Best regards, > > > > > Ahmed > > > > > > > > Hi Ahmed, > > > > > > > > I understand you set this parameter *only* in the nodes which have DCMI and > > > > that you are receiving this error in such nodes. > > > > > > > > Also, did you restart slurmd on these nodes? > > > > > > Hi Felip, > > > > > > /etc/slurm is shared FS that is available on all nodes. > > > The error is generated on all nodes as well. Both GBs and Dells. > > > > > > I've restarted slurmd on all nodes after applying the change > > > > Ah, that's an issue because Dell nodes doesn't need this setting. They can > > work if they also have DCMI sensors. Can you check that? > > > > ipmi-dcmi --get-system-power-statistics > > The command works on Dell nodes > > [root@hl-codon-100-01 ~]# ipmi-dcmi --get-system-power-statistics > Current Power : 175 Watts > Minimum Power over sampling duration : 5 watts > Maximum Power over sampling duration : 540 watts > Average Power over sampling duration : 194 watts > Time Stamp : 03/22/2023 - 13:51:49 > Statistics reporting time period : 1000 milliseconds > Power Measurement : Active Ok, that's good. I will need the following: Enable this on one node (just change slurm.conf and restart a specific slurmd): SlurmdDebug=debug2 DebugFlags=Energy Then run a job in this node, check if the error still shows up. Then send me the slurmd log of the node. After you can restore these two parameters if you want. I will investigate why it is happening. Does a 'scontrol show node' show the power?
(In reply to Felip Moll from comment #19) > (In reply to Ahmed Elmazaty from comment #18) > > Ok, that's good. > > I will need the following: > > Enable this on one node (just change slurm.conf and restart a specific > slurmd): > > SlurmdDebug=debug2 > DebugFlags=Energy > > Then run a job in this node, check if the error still shows up. Then send me > the slurmd log of the node. After you can restore these two parameters if > you want. Yes the error still appears in the logs [2023-03-22T14:43:20.786] error: _get_dcmi_power_reading: IPMI DCMI context not initialized [2023-03-22T14:43:24.226] error: _get_dcmi_power_reading: IPMI DCMI context not initialized I've attached the log file > > I will investigate why it is happening. > > Does a 'scontrol show node' show the power? Yes it does. # scontrol show node hl-codon-37-02 NodeName=hl-codon-37-02 Arch=x86_64 CoresPerSocket=24 CPUAlloc=0 CPUEfctv=48 CPUTot=48 CPULoad=0.59 AvailableFeatures=codon,gigabyte,local_170g,nogpu,nobmem,intel,cascadelake ActiveFeatures=codon,gigabyte,local_170g,nogpu,nobmem,intel,cascadelake Gres=(null) NodeAddr=hl-codon-37-02 NodeHostName=hl-codon-37-02 Version=23.02.0 OS=Linux 4.18.0-348.20.1.el8_5.x86_64 #1 SMP Thu Mar 10 20:59:28 UTC 2022 RealMemory=380000 AllocMem=0 FreeMem=94243 Sockets=2 Boards=1 CPUSpecList= State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=50 Owner=N/A MCS_label=N/A Partitions=ALL,standard BootTime=2023-03-09T12:40:18 SlurmdStartTime=2023-03-22T14:44:49 LastBusyTime=2023-03-22T14:43:29 ResumeAfterTime=None CfgTRES=cpu=48,mem=380000M,billing=48 AllocTRES= CapWatts=n/a CurrentWatts=80 AveWatts=81 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Created attachment 29462 [details] hl-codon-37-02 slurmd.log
Ahmed, Are you able to apply a small test patch to slurmd to one of the nodes?
(In reply to Felip Moll from comment #22) > Ahmed, > > Are you able to apply a small test patch to slurmd to one of the nodes? Hi Felip, Yeah sure. Let me know how we can proceed
Created attachment 29475 [details] bug16331_2302_test_v1.patch Ahmed, The idea is to apply this patch to slurmd. You need to download it in the slurm source code directory and apply it. Then compile and install slurm in the node. It is a minor patch which tries to initialize the dcmi context when it is not. I think there's a bug because the context is not shared among threads and we are not initializing it in new threads. So I added an initialization in _thread_init() function for this context. 1. Download this patch in slurm source code directory 2. Apply the patch with "git am -3 < bug16331_2302_test_v1.patch" 3. Compile. 4. Pick a node and set it to DRAIN and put it into a reservation. Ensure there are no jobs running. 5. Stop slurmd and install the new build on this node. 6. Start slurmd, check the logs for any errors. If everything seems happy, undrain the node. 7. Send a job to the reserved node. (A 'srun -w <the node> sleep 1000' should suffice). 8. Send me the slurmd logs. Please, ensure the SlurmdDebug=debug2 and DebugFlags=Energy are still set when you start slurmd. Thanks
(In reply to Felip Moll from comment #26) > Created attachment 29475 [details] > bug16331_2302_test_v1.patch > > Ahmed, > > The idea is to apply this patch to slurmd. You need to download it in the > slurm source code directory and apply it. Then compile and install slurm in > the node. It is a minor patch which tries to initialize the dcmi context > when it is not. I think there's a bug because the context is not shared > among threads and we are not initializing it in new threads. So I added an > initialization in _thread_init() function for this context. > > 1. Download this patch in slurm source code directory > 2. Apply the patch with "git am -3 < bug16331_2302_test_v1.patch" > 3. Compile. > 4. Pick a node and set it to DRAIN and put it into a reservation. Ensure > there are no jobs running. > 5. Stop slurmd and install the new build on this node. > 6. Start slurmd, check the logs for any errors. If everything seems happy, > undrain the node. > 7. Send a job to the reserved node. (A 'srun -w <the node> sleep 1000' > should suffice). > 8. Send me the slurmd logs. > > Please, ensure the SlurmdDebug=debug2 and DebugFlags=Energy are still set > when you start slurmd. > > Thanks Hi Felip, By "compile", you mean run configure, make and make install? If so, shall I use the the same "prefix" I use for my current installation? Or I need to install it in another location? Best regards, Ahmed
> > Hi Felip, > > By "compile", you mean run configure, make and make install? > If so, shall I use the the same "prefix" I use for my current installation? > Or I need to install it in another location? > > Best regards, > Ahmed Yeah. You should use exactly the same. You can even use the old source directory if you kept it and in that case you could even skip the configure.
(In reply to Felip Moll from comment #28) > > > > Hi Felip, > > > > By "compile", you mean run configure, make and make install? > > If so, shall I use the the same "prefix" I use for my current installation? > > Or I need to install it in another location? > > > > Best regards, > > Ahmed > > Yeah. You should use exactly the same. You can even use the old source > directory if you kept it and in that case you could even skip the configure. Thanks, Will other nodes be affected? As we build SLURM in a shared location.
(In reply to Ahmed Elmazaty from comment #29) > (In reply to Felip Moll from comment #28) > > > > > > Hi Felip, > > > > > > By "compile", you mean run configure, make and make install? > > > If so, shall I use the the same "prefix" I use for my current installation? > > > Or I need to install it in another location? > > > > > > Best regards, > > > Ahmed > > > > Yeah. You should use exactly the same. You can even use the old source > > directory if you kept it and in that case you could even skip the configure. > > Thanks, > Will other nodes be affected? As we build SLURM in a shared location. Well, if the binaries are in a shared location yes. Ideally you just should change that specific node. In that case my suggestion is to change the prefix then and start slurmd from this new location only in this node.
I am getting this error while trying to apply it to the source directory. I used "wget" to download the source # git am -3 < bug16331_2302_test_v1.patch fatal: not a git repository (or any parent up to mount point /ebi) Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
(In reply to Ahmed Elmazaty from comment #31) > I am getting this error while trying to apply it to the source directory. > I used "wget" to download the source > > # git am -3 < bug16331_2302_test_v1.patch > fatal: not a git repository (or any parent up to mount point /ebi) > Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set). Ah, if you used wget and not a git clone, just apply it with patch tool: ]$ patch -p1 < bug16331_2302_test_v1.patch
Created attachment 29478 [details] hl-codon-37-02 slurmd.log after applying patch
(In reply to Ahmed Elmazaty from comment #33) > Created attachment 29478 [details] > hl-codon-37-02 slurmd.log after applying patch If I am not wrong, after you restarted here: [2023-03-23T14:48:17.396] error: _get_dcmi_power_reading: IPMI DCMI context not initialized [2023-03-23T14:48:18.408] [751034.batch] private-tmpdir: removed /slurm/751034.batch.0 (5 files) in 0.000195 seconds [2023-03-23T14:48:18.414] [751034.batch] done with job [2023-03-23T14:53:34.635] Slurmd shutdown completing No more errors have shown up right? it does seems that the patch fixes the issue, do you agree?
(In reply to Felip Moll from comment #34) > (In reply to Ahmed Elmazaty from comment #33) > > Created attachment 29478 [details] > > hl-codon-37-02 slurmd.log after applying patch > > If I am not wrong, after you restarted here: > > [2023-03-23T14:48:17.396] error: _get_dcmi_power_reading: IPMI DCMI context > not initialized > [2023-03-23T14:48:18.408] [751034.batch] private-tmpdir: removed > /slurm/751034.batch.0 (5 files) in 0.000195 seconds > [2023-03-23T14:48:18.414] [751034.batch] done with job > [2023-03-23T14:53:34.635] Slurmd shutdown completing > > No more errors have shown up right? it does seems that the patch fixes the > issue, do you agree? Hi Felip, True. Checking the log files, no more errors have been reported regarding this issue. So what's next? Shall we migrate all other nodes (including controllers) to this new installation? Or it's better to wait till it's added to one of your upcoming releases? Thanks
(In reply to Ahmed Elmazaty from comment #35) > (In reply to Felip Moll from comment #34) > > (In reply to Ahmed Elmazaty from comment #33) > > > Created attachment 29478 [details] > > > hl-codon-37-02 slurmd.log after applying patch > > > > If I am not wrong, after you restarted here: > > > > [2023-03-23T14:48:17.396] error: _get_dcmi_power_reading: IPMI DCMI context > > not initialized > > [2023-03-23T14:48:18.408] [751034.batch] private-tmpdir: removed > > /slurm/751034.batch.0 (5 files) in 0.000195 seconds > > [2023-03-23T14:48:18.414] [751034.batch] done with job > > [2023-03-23T14:53:34.635] Slurmd shutdown completing > > > > No more errors have shown up right? it does seems that the patch fixes the > > issue, do you agree? > > Hi Felip, > > True. Checking the log files, no more errors have been reported regarding > this issue. > So what's next? Shall we migrate all other nodes (including controllers) to > this new installation? Or it's better to wait till it's added to one of your > upcoming releases? > > Thanks Good. The ideal plan would be: First revert this node installation. Then I will prepare a patch that will be reviewed by our QA team and make it available in the next Slurm release. Then you can install this new release when it comes out. Thanks for testing!.
Hi Ahmed, The fix has been accepted and included in 23.02.2, the next upcoming release. Yesterday 23.02.1 was released but the patch didn't arrive on time. I'd suggest to wait till .2. In the meantime you can disable ipmi on dcmi nodes or apply this patch manually to all of your slurmds. commit d37eefe81dac1e2aa0b030c3de1dca41b787504d Author: Felip Moll <felip.moll@schedmd.com> Date: Thu Mar 23 14:18:43 2023 +0100 Fix IPMI DCMI sensor initialization The ipmi context needs to be unique to each thread because of its internal references to memory, this means each thread must initialize a new context. This is the same that ipmi_monitoring library does when using freeipmi. Bug 16331 I am resolving the bug. Thanks for reporting!.