Ticket 17938 - slurmd should log _get_dcmi_power_reading errors only once at startup
Summary: slurmd should log _get_dcmi_power_reading errors only once at startup
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 23.02.6
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Carlos Tripiana Montes
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2023-10-18 06:08 MDT by Ole.H.Nielsen@fysik.dtu.dk
Modified: 2023-10-31 03:08 MDT (History)
1 user (show)

See Also:
Site: DTU Physics
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 23.02.7, 23.11.0rc1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Ole.H.Nielsen@fysik.dtu.dk 2023-10-18 06:08:11 MDT
We're testing the IPMI power monitoring in slurm.conf with AcctGatherEnergyType=acct_gather_energy/ipmi and a acct_gather.conf file:

EnergyIPMIPowerSensors=Node=DCMI
EnergyIPMIFrequency=60
EnergyIPMICalcAdjustment=yes

When we use the FreeIPMI development version 1.7.0 this works well as discussed in bug 17639.

However, we have 196 old Huawei XH620 V3 nodes whose BMC doesn't seem to support the IPMI DCMI extensions, as shown by this command:

$ ipmi-dcmi --get-system-power-statistics
ipmi_cmd_dcmi_get_power_reading: command invalid or unsupported

When we use the above acct_gather_energy/ipmi in local config files on the Huawei nodes, slurmd.log correctly logs errors:

[2023-10-18T13:34:10.185] error: _get_dcmi_power_reading: get DCMI power reading failed

The problem arises for us when slurmd keeps logging this error every EnergyIPMIFrequency=60 seconds!  It is not as if IPMI DCMI is expected to suddenly start working (it's a permanent error), so the repeated log lines quickly turn into spam! 

Request for a fix:

1. The _get_dcmi_power_reading error should be printed to slurmd.log only once when slurmd is started,
2. or printed if an increased debug level has been set.
3. The CurrentWatts value for the node should be set to N/A or zero if the _get_dcmi_power_reading error occurs.

The impact of this issue is *medium* severity because we can't really implement acct_gather_energy/ipmi in the cluster as long as there are any nodes which keep spamming the slurmd.log.

Another observation is that "scontrol show node" keeps showing the old power Watts numbers from our previous RAPL power monitoring, but that is a separate issue which is being addressed in bug 17759.

FYI: Another nearby Slurm site has brand new Xfusion FusionOne HPC 1288H V6 servers (essentially rebranded Huawei servers) which display the issue with malfunctioning IPMI DCMI extensions, even though the server's BMC is documented to support DCMI 1.5!  So new hardware can also have issues with IPMI DCMI.

Thanks,
Ole
Comment 1 Carlos Tripiana Montes 2023-10-19 08:42:06 MDT
Hey Ole,

I see... I'll check. I understand your point, and I think it makes sense.

I'll back to you tomorrow with more information. I am now checking what it can be done.

Regards,
Carlos.
Comment 4 Carlos Tripiana Montes 2023-10-24 04:34:18 MDT
Hi Ole,

The cluster is production-like one? Or this is being tested in a test cluster?

I am saying so because I have a patch proposal ready for review. But obviously, I have no access to a real DCMI IMPI-like hardware, so I cannot test it easily.

IF, and only if, you can apply this patch in a test HW, and see if the fix works fine, then I'll deliver you an early access patch, subjected to future formal review.

It is a fairly simple patch, but I need to be 100% sure you aren't going to break production.

Are you willing to test it, on a test enviroment?

Thanks,
Carlos.
Comment 5 Ole.H.Nielsen@fysik.dtu.dk 2023-10-24 04:48:58 MDT
Hi Carlos,

(In reply to Carlos Tripiana Montes from comment #4)
> The cluster is production-like one? Or this is being tested in a test
> cluster?

Unfortunately we don't have a test-cluster available.  We only have the production cluster.

> I am saying so because I have a patch proposal ready for review. But
> obviously, I have no access to a real DCMI IMPI-like hardware, so I cannot
> test it easily.
> 
> IF, and only if, you can apply this patch in a test HW, and see if the fix
> works fine, then I'll deliver you an early access patch, subjected to future
> formal review.
> 
> It is a fairly simple patch, but I need to be 100% sure you aren't going to
> break production.

My Slurm installation is RPM based, so I don't even know how to apply a patch and build RPMs :-(  If I could build RPMs, I could install it only on a single test node.

Does SchedMD have some internal test servers with IPMI which could be used?

Thanks,
Ole
Comment 6 Carlos Tripiana Montes 2023-10-24 05:01:25 MDT
Thanks Ole,

We will get back to you as soon as the fixes are pushed to the repo.

We have a test machine in Lehi but IDK now if this one support DCMI or not.

Anyway, we will figure out. Don't Worry.

Regards,
Carlos.
Comment 7 Carlos Tripiana Montes 2023-10-27 03:56:20 MDT
> 3. The CurrentWatts value for the node should be set to N/A or zero if the
> _get_dcmi_power_reading error occurs.

What do you have here right now? Which value is exposed?
Comment 8 Ole.H.Nielsen@fysik.dtu.dk 2023-10-30 03:38:39 MDT
(In reply to Carlos Tripiana Montes from comment #7)
> > 3. The CurrentWatts value for the node should be set to N/A or zero if the
> > _get_dcmi_power_reading error occurs.
> 
> What do you have here right now? Which value is exposed?

I tested a Huawei node where IPMI DCMI doesn't work and made a local node slurm.conf with AcctGatherEnergyType=acct_gather_energy/ipmi
Then we get every minute a (spam) line in slurmd.log:

[2023-10-30T10:35:22.151] error: _get_dcmi_power_reading: get DCMI power reading failed

Actually, the power reading is correct with zero values:

$ scontrol show node x006
NodeName=x006 Arch=x86_64 CoresPerSocket=12 
   CPUAlloc=24 CPUEfctv=24 CPUTot=24 CPULoad=24.03
(lines deleted)
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

I appears that no fix is required for CurrentWatts and AveWatts after all!

Thanks,
Ole
Comment 9 Carlos Tripiana Montes 2023-10-30 06:11:00 MDT
Ole,

> Actually, the power reading is correct with zero values:
> 
> $ scontrol show node x006
> NodeName=x006 Arch=x86_64 CoresPerSocket=12 
>    CPUAlloc=24 CPUEfctv=24 CPUTot=24 CPULoad=24.03
> (lines deleted)
>    CurrentWatts=0 AveWatts=0
>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> 
> I appears that no fix is required for CurrentWatts and AveWatts after all!

I was looking after this answer. I have checked the code and felt no fix were needed. Thanks for confirming.
Comment 13 Carlos Tripiana Montes 2023-10-31 02:54:52 MDT
Good morning Ole,

This is fixed in master and 23.02 branches, commits c49cd5a8a7..a8b07aea38, and will be part of next 23.02.7 and future 23.11.0 releases.

The fix is fairly simple, and matches the behaviour of the current non-DCMI logic.

Let's close this bug as resolved. If you experience any issues, please, reopen it.

Regards,
Carlos.
Comment 14 Ole.H.Nielsen@fysik.dtu.dk 2023-10-31 03:02:43 MDT
Hi Carlos,

(In reply to Carlos Tripiana Montes from comment #13)
> This is fixed in master and 23.02 branches, commits c49cd5a8a7..a8b07aea38,
> and will be part of next 23.02.7 and future 23.11.0 releases.
> 
> The fix is fairly simple, and matches the behaviour of the current non-DCMI
> logic.

Thanks a lot, this is a great fix to the DCMI logging issue.  I look forward to using it with 23.02.7.

Best regards,
Ole
Comment 15 Carlos Tripiana Montes 2023-10-31 03:08:51 MDT
Thanks,

The final version uses a similar approach as the non-DCMI code. It logs the repeated issue up to 5 times and stops from doing so. Plus now logs the specific internal issue as well.