We're testing the IPMI power monitoring in slurm.conf with AcctGatherEnergyType=acct_gather_energy/ipmi and a acct_gather.conf file: EnergyIPMIPowerSensors=Node=DCMI EnergyIPMIFrequency=60 EnergyIPMICalcAdjustment=yes When we use the FreeIPMI development version 1.7.0 this works well as discussed in bug 17639. However, we have 196 old Huawei XH620 V3 nodes whose BMC doesn't seem to support the IPMI DCMI extensions, as shown by this command: $ ipmi-dcmi --get-system-power-statistics ipmi_cmd_dcmi_get_power_reading: command invalid or unsupported When we use the above acct_gather_energy/ipmi in local config files on the Huawei nodes, slurmd.log correctly logs errors: [2023-10-18T13:34:10.185] error: _get_dcmi_power_reading: get DCMI power reading failed The problem arises for us when slurmd keeps logging this error every EnergyIPMIFrequency=60 seconds! It is not as if IPMI DCMI is expected to suddenly start working (it's a permanent error), so the repeated log lines quickly turn into spam! Request for a fix: 1. The _get_dcmi_power_reading error should be printed to slurmd.log only once when slurmd is started, 2. or printed if an increased debug level has been set. 3. The CurrentWatts value for the node should be set to N/A or zero if the _get_dcmi_power_reading error occurs. The impact of this issue is *medium* severity because we can't really implement acct_gather_energy/ipmi in the cluster as long as there are any nodes which keep spamming the slurmd.log. Another observation is that "scontrol show node" keeps showing the old power Watts numbers from our previous RAPL power monitoring, but that is a separate issue which is being addressed in bug 17759. FYI: Another nearby Slurm site has brand new Xfusion FusionOne HPC 1288H V6 servers (essentially rebranded Huawei servers) which display the issue with malfunctioning IPMI DCMI extensions, even though the server's BMC is documented to support DCMI 1.5! So new hardware can also have issues with IPMI DCMI. Thanks, Ole
Hey Ole, I see... I'll check. I understand your point, and I think it makes sense. I'll back to you tomorrow with more information. I am now checking what it can be done. Regards, Carlos.
Hi Ole, The cluster is production-like one? Or this is being tested in a test cluster? I am saying so because I have a patch proposal ready for review. But obviously, I have no access to a real DCMI IMPI-like hardware, so I cannot test it easily. IF, and only if, you can apply this patch in a test HW, and see if the fix works fine, then I'll deliver you an early access patch, subjected to future formal review. It is a fairly simple patch, but I need to be 100% sure you aren't going to break production. Are you willing to test it, on a test enviroment? Thanks, Carlos.
Hi Carlos, (In reply to Carlos Tripiana Montes from comment #4) > The cluster is production-like one? Or this is being tested in a test > cluster? Unfortunately we don't have a test-cluster available. We only have the production cluster. > I am saying so because I have a patch proposal ready for review. But > obviously, I have no access to a real DCMI IMPI-like hardware, so I cannot > test it easily. > > IF, and only if, you can apply this patch in a test HW, and see if the fix > works fine, then I'll deliver you an early access patch, subjected to future > formal review. > > It is a fairly simple patch, but I need to be 100% sure you aren't going to > break production. My Slurm installation is RPM based, so I don't even know how to apply a patch and build RPMs :-( If I could build RPMs, I could install it only on a single test node. Does SchedMD have some internal test servers with IPMI which could be used? Thanks, Ole
Thanks Ole, We will get back to you as soon as the fixes are pushed to the repo. We have a test machine in Lehi but IDK now if this one support DCMI or not. Anyway, we will figure out. Don't Worry. Regards, Carlos.
> 3. The CurrentWatts value for the node should be set to N/A or zero if the > _get_dcmi_power_reading error occurs. What do you have here right now? Which value is exposed?
(In reply to Carlos Tripiana Montes from comment #7) > > 3. The CurrentWatts value for the node should be set to N/A or zero if the > > _get_dcmi_power_reading error occurs. > > What do you have here right now? Which value is exposed? I tested a Huawei node where IPMI DCMI doesn't work and made a local node slurm.conf with AcctGatherEnergyType=acct_gather_energy/ipmi Then we get every minute a (spam) line in slurmd.log: [2023-10-30T10:35:22.151] error: _get_dcmi_power_reading: get DCMI power reading failed Actually, the power reading is correct with zero values: $ scontrol show node x006 NodeName=x006 Arch=x86_64 CoresPerSocket=12 CPUAlloc=24 CPUEfctv=24 CPUTot=24 CPULoad=24.03 (lines deleted) CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s I appears that no fix is required for CurrentWatts and AveWatts after all! Thanks, Ole
Ole, > Actually, the power reading is correct with zero values: > > $ scontrol show node x006 > NodeName=x006 Arch=x86_64 CoresPerSocket=12 > CPUAlloc=24 CPUEfctv=24 CPUTot=24 CPULoad=24.03 > (lines deleted) > CurrentWatts=0 AveWatts=0 > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > > I appears that no fix is required for CurrentWatts and AveWatts after all! I was looking after this answer. I have checked the code and felt no fix were needed. Thanks for confirming.
Good morning Ole, This is fixed in master and 23.02 branches, commits c49cd5a8a7..a8b07aea38, and will be part of next 23.02.7 and future 23.11.0 releases. The fix is fairly simple, and matches the behaviour of the current non-DCMI logic. Let's close this bug as resolved. If you experience any issues, please, reopen it. Regards, Carlos.
Hi Carlos, (In reply to Carlos Tripiana Montes from comment #13) > This is fixed in master and 23.02 branches, commits c49cd5a8a7..a8b07aea38, > and will be part of next 23.02.7 and future 23.11.0 releases. > > The fix is fairly simple, and matches the behaviour of the current non-DCMI > logic. Thanks a lot, this is a great fix to the DCMI logging issue. I look forward to using it with 23.02.7. Best regards, Ole
Thanks, The final version uses a similar approach as the non-DCMI code. It logs the repeated issue up to 5 times and stops from doing so. Plus now logs the specific internal issue as well.