| Summary: | Can't find parent id 1234 for assoc 56789 | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Martin Siegert <siegert> |
| Component: | Database | Assignee: | Scott Hilton <scott> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 20.11.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Simon Fraser University | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
Sorry, I should have checked first: the db partition filled up: /dev/mapper/data-lv1 146749440 146688684 60756 100% /mariadb Thus, I guess this can't be fixed, correct? Martin, You need to either move it to a location with more space, add space or free up some space. It is expected that transactions will fail if there is no space to write those transactions. -Scott Martin, Did that work? Are you still experiencing issues with this? -Scott Hi Scott, we estimate that we need about three times as much space as we have on the machine. We can about double the space but that would not be enough. We also estimate that the "slurmdbd -Dvvvvvv -R cedar" will run for about three days, i.e., will require a massive downtime that we need to schedule. But beside that you can close the ticket. And a warning to everybody who runs into the "Can't find parent id 1234 for assoc 98765" problem: this is a huge problem that unfortunately does happen. - Martin Martin, Ok, good luck with the change. -Scott |
Every time we restart slurmctld we get error messages: [2021-11-26T16:04:48.841] error: Can't find parent id 7462 for assoc 68974, this should never happen. [2021-11-26T16:04:48.841] error: Can't find parent id 7462 for assoc 68974, this should never happen. [2021-11-26T16:04:50.519] error: Can't find parent id 3197 for assoc 68973, this should never happen. [2021-11-26T16:04:50.519] error: Can't find parent id 3197 for assoc 68973, this should never happen. Therefore I ran "slurmdbd -Dvvvvvv -R cedar" on our test system and after many hours this failed with: call get_parent_limits('assoc_table', 'root', 'slurmredolftrgttemp', 0); select @par_id, @mj, @mja, @mpt, @msj, @mwpj, @mtpj, @mtpn, @mtmpj, @mtrm, @def_qos_id, @qos, @delta_qos, @prio; slurmdbd: error: Could not execute statement 14 Can't change size of file (Errcode: 1140912112 "Unknown error 1140912112") slurmdbd: debug4: accounting_storage/as_mysql: _get_parent_id: 0(as_mysql_assoc.c:654) query select id_assoc from "slurmredolftrgttemp_assoc_table" where user='' and deleted = 0 and acct='def-kmanau_cpu'; slurmdbd: error: no association for parent def-kmanau_cpu on cluster slurmredolftrgttemp slurmdbd: debug4: accounting_storage/as_mysql: _set_assoc_limits_for_add: 0(as_mysql_assoc.c:733) query call get_parent_limits('assoc_table', 'def-kmanau_cpu', 'slurmredolftrgttemp', 0); select @par_id, @mj, @mja, @mpt, @msj, @mwpj, @mtpj, @mtpn, @mtmpj, @mtrm, @def_qos_id, @qos, @delta_qos, @prio; slurmdbd: error: mysql_query failed: 14 Can't change size of file (Errcode: 1140912112 "Unknown error 1140912112") UPDATE "slurmredolftrgttemp_assoc_table" SET rgt = rgt+2 WHERE rgt > 2 && deleted < 2;UPDATE "slurmredolftrgttemp_assoc_table" SET lft = lft+2 WHERE lft > 2 && deleted < 2;UPDATE "slurmredolftrgttemp_assoc_table" SET deleted = 0 WHERE deleted = 2; slurmdbd: error: Couldn't do update What now? - Martin