Every time we restart slurmctld we get error messages: [2021-11-26T16:04:48.841] error: Can't find parent id 7462 for assoc 68974, this should never happen. [2021-11-26T16:04:48.841] error: Can't find parent id 7462 for assoc 68974, this should never happen. [2021-11-26T16:04:50.519] error: Can't find parent id 3197 for assoc 68973, this should never happen. [2021-11-26T16:04:50.519] error: Can't find parent id 3197 for assoc 68973, this should never happen. Therefore I ran "slurmdbd -Dvvvvvv -R cedar" on our test system and after many hours this failed with: call get_parent_limits('assoc_table', 'root', 'slurmredolftrgttemp', 0); select @par_id, @mj, @mja, @mpt, @msj, @mwpj, @mtpj, @mtpn, @mtmpj, @mtrm, @def_qos_id, @qos, @delta_qos, @prio; slurmdbd: error: Could not execute statement 14 Can't change size of file (Errcode: 1140912112 "Unknown error 1140912112") slurmdbd: debug4: accounting_storage/as_mysql: _get_parent_id: 0(as_mysql_assoc.c:654) query select id_assoc from "slurmredolftrgttemp_assoc_table" where user='' and deleted = 0 and acct='def-kmanau_cpu'; slurmdbd: error: no association for parent def-kmanau_cpu on cluster slurmredolftrgttemp slurmdbd: debug4: accounting_storage/as_mysql: _set_assoc_limits_for_add: 0(as_mysql_assoc.c:733) query call get_parent_limits('assoc_table', 'def-kmanau_cpu', 'slurmredolftrgttemp', 0); select @par_id, @mj, @mja, @mpt, @msj, @mwpj, @mtpj, @mtpn, @mtmpj, @mtrm, @def_qos_id, @qos, @delta_qos, @prio; slurmdbd: error: mysql_query failed: 14 Can't change size of file (Errcode: 1140912112 "Unknown error 1140912112") UPDATE "slurmredolftrgttemp_assoc_table" SET rgt = rgt+2 WHERE rgt > 2 && deleted < 2;UPDATE "slurmredolftrgttemp_assoc_table" SET lft = lft+2 WHERE lft > 2 && deleted < 2;UPDATE "slurmredolftrgttemp_assoc_table" SET deleted = 0 WHERE deleted = 2; slurmdbd: error: Couldn't do update What now? - Martin
Sorry, I should have checked first: the db partition filled up: /dev/mapper/data-lv1 146749440 146688684 60756 100% /mariadb Thus, I guess this can't be fixed, correct?
Martin, You need to either move it to a location with more space, add space or free up some space. It is expected that transactions will fail if there is no space to write those transactions. -Scott
Martin, Did that work? Are you still experiencing issues with this? -Scott
Hi Scott, we estimate that we need about three times as much space as we have on the machine. We can about double the space but that would not be enough. We also estimate that the "slurmdbd -Dvvvvvv -R cedar" will run for about three days, i.e., will require a massive downtime that we need to schedule. But beside that you can close the ticket. And a warning to everybody who runs into the "Can't find parent id 1234 for assoc 98765" problem: this is a huge problem that unfortunately does happen. - Martin
Martin, Ok, good luck with the change. -Scott