Summary: | slurmstepd: error: _prec_extra: Could not find task_cpuacct_cg, this should never happen | ||
---|---|---|---|
Product: | Slurm | Reporter: | Steve Ford <fordste5> |
Component: | slurmstepd | Assignee: | Gavin D. Howard <gavin> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | aptivhpcsupport, cinek, marc.caubet, marshall, nate |
Version: | 19.05.1 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=8656 https://bugs.schedmd.com/show_bug.cgi?id=8763 |
||
Site: | MSU | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | Google sites: | --- |
HPCnow Sites: | --- | HPE Sites: | --- |
IBM Sites: | --- | NOAA SIte: | --- |
NoveTech Sites: | --- | Nvidia HWinf-CS Sites: | --- |
OCF Sites: | --- | Recursion Pharma Sites: | --- |
SFW Sites: | --- | SNIC sites: | --- |
Tzag Elita Sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | 20.02.3 | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: |
lac-014 slurmd log
cgroup.conf |
Hey Steve, "This should never happen" bugs are always fun. These errors mean that some of the accounting information from cgroups isn't being gathered by jobacct_gather, but it's not impacting the actual job in any way. From your slurmd log file, it's mostly the extern and batch steps, but a handful of cases are step 0 of some job. I haven't been able to reproduce it yet. I suspect a race condition, possibly triggered by some configuration-specific thing. Can you upload your cgroup.conf? I have your slurm.conf file from a recent ticket (7580), so I don't need that. - Marshall Created attachment 11324 [details]
cgroup.conf
Just a quick update - I can reproduce this occasionally simply by submitting a bunch of jobs. I still don't know why it's happening, though. The good news is that it only appears to happen at the start of the job, it's almost always the extern step or batch step, it's just the cgroup accounting information that isn't gathered, and the cgroup accounting information is gathered in every subsequent try. So, you aren't really losing any data. Feel free to ignore these error messages; I'll keep trying to track down and fix the bug. Hello, I just wanted to let you know that I have taken charge of this bug, and I am looking into it. We haven't forgotten. Just a note that I am still working on this bug. It has proven hard for me to reproduce since it is a race condition, but I have not forgotten. Same problem seen in our cluster at PSI. Just let us know if you need any extra information regarding configurations or logs. Relevant setting in slurm.conf: ProctrackType=proctrack/cgroup JobAcctGatherType=jobacct_gather/cgroup TaskPlugin=task/affinity,task/cgroup Apologies for the long delay, but this has now been fixed and committed. See https://github.com/SchedMD/slurm/commit/4c030c03778e65534178449cedca9bbe483bd0ec . |
Created attachment 11323 [details] lac-014 slurmd log We are seeing some unusual errors on one of our nodes: slurmstepd: error: _prec_extra: Could not find task_cpuacct_cg, this should never happen slurmstepd: error: _prec_extra: Could not find task_memory_cg, this should never happen Any idea what could cause these?