Summary: | slurmd should log _get_dcmi_power_reading errors only once at startup | ||
---|---|---|---|
Product: | Slurm | Reporter: | Ole.H.Nielsen <Ole.H.Nielsen> |
Component: | slurmd | Assignee: | Carlos Tripiana Montes <tripiana> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | felip.moll |
Version: | 23.02.6 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | DTU Physics | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | 23.02.7, 23.11.0rc1 | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Description
Ole.H.Nielsen@fysik.dtu.dk
2023-10-18 06:08:11 MDT
Hey Ole, I see... I'll check. I understand your point, and I think it makes sense. I'll back to you tomorrow with more information. I am now checking what it can be done. Regards, Carlos. Hi Ole, The cluster is production-like one? Or this is being tested in a test cluster? I am saying so because I have a patch proposal ready for review. But obviously, I have no access to a real DCMI IMPI-like hardware, so I cannot test it easily. IF, and only if, you can apply this patch in a test HW, and see if the fix works fine, then I'll deliver you an early access patch, subjected to future formal review. It is a fairly simple patch, but I need to be 100% sure you aren't going to break production. Are you willing to test it, on a test enviroment? Thanks, Carlos. Hi Carlos, (In reply to Carlos Tripiana Montes from comment #4) > The cluster is production-like one? Or this is being tested in a test > cluster? Unfortunately we don't have a test-cluster available. We only have the production cluster. > I am saying so because I have a patch proposal ready for review. But > obviously, I have no access to a real DCMI IMPI-like hardware, so I cannot > test it easily. > > IF, and only if, you can apply this patch in a test HW, and see if the fix > works fine, then I'll deliver you an early access patch, subjected to future > formal review. > > It is a fairly simple patch, but I need to be 100% sure you aren't going to > break production. My Slurm installation is RPM based, so I don't even know how to apply a patch and build RPMs :-( If I could build RPMs, I could install it only on a single test node. Does SchedMD have some internal test servers with IPMI which could be used? Thanks, Ole Thanks Ole, We will get back to you as soon as the fixes are pushed to the repo. We have a test machine in Lehi but IDK now if this one support DCMI or not. Anyway, we will figure out. Don't Worry. Regards, Carlos. > 3. The CurrentWatts value for the node should be set to N/A or zero if the
> _get_dcmi_power_reading error occurs.
What do you have here right now? Which value is exposed?
(In reply to Carlos Tripiana Montes from comment #7) > > 3. The CurrentWatts value for the node should be set to N/A or zero if the > > _get_dcmi_power_reading error occurs. > > What do you have here right now? Which value is exposed? I tested a Huawei node where IPMI DCMI doesn't work and made a local node slurm.conf with AcctGatherEnergyType=acct_gather_energy/ipmi Then we get every minute a (spam) line in slurmd.log: [2023-10-30T10:35:22.151] error: _get_dcmi_power_reading: get DCMI power reading failed Actually, the power reading is correct with zero values: $ scontrol show node x006 NodeName=x006 Arch=x86_64 CoresPerSocket=12 CPUAlloc=24 CPUEfctv=24 CPUTot=24 CPULoad=24.03 (lines deleted) CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s I appears that no fix is required for CurrentWatts and AveWatts after all! Thanks, Ole Ole,
> Actually, the power reading is correct with zero values:
>
> $ scontrol show node x006
> NodeName=x006 Arch=x86_64 CoresPerSocket=12
> CPUAlloc=24 CPUEfctv=24 CPUTot=24 CPULoad=24.03
> (lines deleted)
> CurrentWatts=0 AveWatts=0
> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
> I appears that no fix is required for CurrentWatts and AveWatts after all!
I was looking after this answer. I have checked the code and felt no fix were needed. Thanks for confirming.
Good morning Ole, This is fixed in master and 23.02 branches, commits c49cd5a8a7..a8b07aea38, and will be part of next 23.02.7 and future 23.11.0 releases. The fix is fairly simple, and matches the behaviour of the current non-DCMI logic. Let's close this bug as resolved. If you experience any issues, please, reopen it. Regards, Carlos. Hi Carlos, (In reply to Carlos Tripiana Montes from comment #13) > This is fixed in master and 23.02 branches, commits c49cd5a8a7..a8b07aea38, > and will be part of next 23.02.7 and future 23.11.0 releases. > > The fix is fairly simple, and matches the behaviour of the current non-DCMI > logic. Thanks a lot, this is a great fix to the DCMI logging issue. I look forward to using it with 23.02.7. Best regards, Ole Thanks, The final version uses a similar approach as the non-DCMI code. It logs the repeated issue up to 5 times and stops from doing so. Plus now logs the specific internal issue as well. |