| Summary: | sacct -o ReqMem output wrong when running slurm-17-02 and outputting data collected with slurm-16-05 | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Josh Samuelson <josh> |
| Component: | Accounting | Assignee: | Danny Auble <da> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 17.02.6 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | University of Nebraska–Lincoln | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 17.02.8 17.11.0-pre3 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
Josh Samuelson
2017-09-12 15:36:43 MDT
(In reply to Josh Samuelson from comment #0) > Greetings, > > Once or twice a year, we'll query the Slurm accounting database to collect > various stats on how are cluster is used for administrative reporting > purposes. It was brought to my attention from one of our staff that many > jobs historic memory requests were off the chart, wrong. After we looked at > the date sorted records for a while, we realized that it seemed to correct > itself the day we updated from slurm-16-05 to slurm-17-02. > > I did some digging and I think the once good historic data that is now > reporting bad is related to this work: > > https://github.com/SchedMD/slurm/commit/ > acc75cd1897269dac648c94b0d633aac26a164b4 > > In prior versions of Slurm, if a job used --mem-per-cpu, Slurm would add to > that value MEM_PER_CPU, aka 0x80000000. If the 2^31 bit is set in the job's > memory record, it would know it's a mem-per-cpu request and not a mem per > node request. After the update, the 2^63 bit isn't set (new value of > MEM_PER_CPU), so it assumes the 2^31st bit is a legit node memory request. > > It's possible I missed something from the RELEASE_NOTES when I did the > upgrade to handle the accounting database. I guess I kinda assumed the > slurmdbd would handle any schema and data transformations if needed. > > I searched to see if anyone else has ran into this but wasn't successful in > matches, so perhaps it was something I did. > > So my questions: > > 1) Obvious one, did I miss an update step? No - looks like I did when doing the conversion work. I obviously missed the flag being used in the accounting database, and should have worked out a conversion process to automatically run on the upgrade. > 2) In a test VM with a copy of our database, I ran the following which > seemed to correct the data. Does this appear safe to run on our production > database? > > update clustername_job_table set mem_req = 0x8000000000000000 | (mem_req ^ > 0x80000000) where (mem_req & 0xffffffff80000000) = 0x80000000; Assuming you have no systems with > 2TB of RAM, that should be fine. This is cosmetically fixed in 17.02 with commit 7bf6ade891f3. It was completely fixed in 17.11 in commit 989a92827bc17. Thanks for the SQL, we used it when fixing this correctly. Please reopen if you notice anything I missed. |