Ticket 12961

Summary: Can't find parent id 1234 for assoc 56789
Product: Slurm Reporter: Martin Siegert <siegert>
Component: DatabaseAssignee: Scott Hilton <scott>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 20.11.7   
Hardware: Linux   
OS: Linux   
Site: Simon Fraser University Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Martin Siegert 2021-12-03 10:36:48 MST
Every time we restart slurmctld we get error messages:
[2021-11-26T16:04:48.841] error: Can't find parent id 7462 for assoc 68974, this should never happen.
[2021-11-26T16:04:48.841] error: Can't find parent id 7462 for assoc 68974, this should never happen.
[2021-11-26T16:04:50.519] error: Can't find parent id 3197 for assoc 68973, this should never happen.
[2021-11-26T16:04:50.519] error: Can't find parent id 3197 for assoc 68973, this should never happen.
Therefore I ran "slurmdbd -Dvvvvvv -R cedar" on our test system and after many hours this failed with:

call get_parent_limits('assoc_table', 'root', 'slurmredolftrgttemp', 0); select @par_id, @mj, @mja, @mpt, @msj, @mwpj, @mtpj, @mtpn, @mtmpj, @mtrm, @def_qos_id, @qos, @delta_qos, @prio;
slurmdbd: error: Could not execute statement 14 Can't change size of file (Errcode: 1140912112 "Unknown error 1140912112")
slurmdbd: debug4: accounting_storage/as_mysql: _get_parent_id: 0(as_mysql_assoc.c:654) query
select id_assoc from "slurmredolftrgttemp_assoc_table" where user='' and deleted = 0 and acct='def-kmanau_cpu';
slurmdbd: error: no association for parent def-kmanau_cpu on cluster slurmredolftrgttemp
slurmdbd: debug4: accounting_storage/as_mysql: _set_assoc_limits_for_add: 0(as_mysql_assoc.c:733) query
call get_parent_limits('assoc_table', 'def-kmanau_cpu', 'slurmredolftrgttemp', 0); select @par_id, @mj, @mja, @mpt, @msj, @mwpj, @mtpj, @mtpn, @mtmpj, @mtrm, @def_qos_id, @qos, @delta_qos, @prio;
slurmdbd: error: mysql_query failed: 14 Can't change size of file (Errcode: 1140912112 "Unknown error 1140912112")
UPDATE "slurmredolftrgttemp_assoc_table" SET rgt = rgt+2 WHERE rgt > 2 && deleted < 2;UPDATE "slurmredolftrgttemp_assoc_table" SET lft = lft+2 WHERE lft > 2 && deleted < 2;UPDATE "slurmredolftrgttemp_assoc_table" SET deleted = 0 WHERE deleted = 2;
slurmdbd: error: Couldn't do update

What now?

- Martin
Comment 1 Martin Siegert 2021-12-03 10:45:01 MST
Sorry, I should have checked first: the db partition filled up:
/dev/mapper/data-lv1                    146749440  146688684       60756 100% /mariadb
Thus, I guess this can't be fixed, correct?
Comment 3 Scott Hilton 2021-12-06 11:59:25 MST
Martin,

You need to either move it to a location with more space, add space or free up some space.

It is expected that transactions will fail if there is no space to write those transactions.

-Scott
Comment 4 Scott Hilton 2021-12-20 09:57:37 MST
Martin, 

Did that work? Are you still experiencing issues with this?

-Scott
Comment 5 Martin Siegert 2021-12-20 12:13:37 MST
Hi Scott,

we estimate that we need about three times as much space as we have on the machine. We can about double the space but that would not be enough. We also estimate that the "slurmdbd -Dvvvvvv -R cedar" will run for about three days, i.e., will require a massive downtime that we need to schedule.
But beside that you can close the ticket.
And a warning to everybody who runs into the "Can't find parent id 1234 for assoc 98765" problem: this is a huge problem that unfortunately does happen.

- Martin
Comment 6 Scott Hilton 2021-12-27 10:06:04 MST
Martin,

Ok, good luck with the change. 

-Scott