Ticket 16968

Summary: what version of mysql does slurm support to latest slurm 23.02
Product: Slurm Reporter: Agathees <durairaa>
Component: DatabaseAssignee: Tim McMullan <mcmullan>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 2 - High Impact    
Priority: --- CC: mcmullan
Version: 23.02.2   
Hardware: Linux   
OS: Linux   
Site: Genentech (Roche) Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Agathees 2023-06-14 12:26:55 MDT
May i know, the slurm 23.02 does support mysql 8?

[2023-06-14T17:49:53.169] accounting_storage/as_mysql: _check_mysql_concat_is_sane: MySQL server version is: 8.0.23
[2023-06-14T17:49:53.177] error: Database settings not recommended values: innodb_lock_wait_timeout
[2023-06-14T17:49:53.363] accounting_storage/as_mysql: init: Accounting storage MYSQL plugin failed
[2023-06-14T17:49:53.366] error: Couldn't load specified plugin name for accounting_storage/mysql: Plugin init() callback failed
[2023-06-14T17:49:53.366] error: cannot create accounting_storage context for accounting_storage/mysql
[2023-06-14T17:49:53.366] fatal: Unable to initialize accounting_storage/mysql accounting storage plugin
Comment 1 Ben Roberts 2023-06-14 12:39:36 MDT
Hi Agathees,

MySQL 8.0 is a modern, currently supported version and should be ok.
https://slurm.schedmd.com/platforms.html#database

There is an error message about the innodb_lock_wait_timeout.  Do you have the recommended settings configured in your my.cnf file?
https://slurm.schedmd.com/accounting.html#slurm-accounting-configuration-before-build
Comment 2 Ben Roberts 2023-06-14 14:41:52 MDT
Are you still having problems getting slurmdbd running?  Did you make the recommended configuration changes and did that have an effect on things?  I haven't heard any follow up  questions so I'll lower the severity of this ticket for now.  Let us know if you're still having problems.
Comment 3 Agathees 2023-06-15 09:18:00 MDT
innodb_lock_wait_timeout issue has fixed. Thanks!. But still i have facing the accounting_storage/as_mysql plugin issue. Below pasted the output of slurmdbd.log

2023-06-14T17:49:53.363] accounting_storage/as_mysql: init: Accounting storage MYSQL plugin failed
[2023-06-14T17:49:53.366] error: Couldn't load specified plugin name for accounting_storage/mysql: Plugin init() callback failed
[2023-06-14T17:49:53.366] error: cannot create accounting_storage context for accounting_storage/mysql
[2023-06-14T17:49:53.366] fatal: Unable to initialize accounting_storage/mysql accounting storage plugin
[2023-06-15T15:13:16.372] pidfile not locked, assuming no running daemon
[2023-06-15T15:13:16.390] accounting_storage/as_mysql: _check_mysql_concat_is_sane: MySQL server version is: 8.0.23
[2023-06-15T15:13:16.438] accounting_storage/as_mysql: init: Accounting storage MYSQL plugin failed
[2023-06-15T15:13:16.442] error: Couldn't load specified plugin name for accounting_storage/mysql: Plugin init() callback failed
[2023-06-15T15:13:16.442] error: cannot create accounting_storage context for accounting_storage/mysql
[2023-06-15T15:13:16.442] fatal: Unable to initialize accounting_storage/mysql accounting storage plugin
Comment 4 Tim McMullan 2023-06-16 07:15:28 MDT
Can you please attach your slurmdbd.conf file (please sanitize the password field)?

Could you also run this again, but setting "DebugLevel=debug4" in the slurmdbd.conf and attach the full log?

Thanks,
--Tim
Comment 6 Tim McMullan 2023-06-16 16:06:05 MDT
Reducing severity until we can get more information.
Comment 7 Tim McMullan 2023-06-20 05:54:51 MDT
Hi,

I wanted to check in and see if you could get the logs I requested!

Thanks,
--Tim
Comment 8 Tim McMullan 2023-06-23 11:57:38 MDT
I've been unable to reproduce this issue locally and will require further input to be able to resolve this.

Since we haven't heard from you in a while, I'm going to time this out.  If the issue persists and you can upload the requested logs, please let us know and we will continue to troubleshoot this!

Thanks,
--Tim
Comment 9 Agathees 2023-08-24 03:29:37 MDT
Hi Team,

I getting the following DNS record error on the slurmctld service.

● slurmctld.service - Slurm controller daemon
     Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Thu 2023-08-24 09:16:45 UTC; 2s ago
    Process: 1552600 ExecStart=/shared/slurm_SLURM-MASTER-USW2-HPC-SB/sbin/slurmctld -D -s $SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE)
   Main PID: 1552600 (code=exited, status=1/FAILURE)

Aug 24 09:16:45 dphimgh137-usw2.aws.science.roche.com slurmctld[1552600]: slurmctld: error: NOTE: Trying backup state save file. Information may be lost!
Aug 24 09:16:45 dphimgh137-usw2.aws.science.roche.com slurmctld[1552600]: slurmctld: No node state file (/var/spool/slurm/ctld/node_state.old) to recover
Aug 24 09:16:45 dphimgh137-usw2.aws.science.roche.com slurmctld[1552600]: slurmctld: error: Could not open job state file /var/spool/slurm/ctld/job_state: No such file or directory
Aug 24 09:16:45 dphimgh137-usw2.aws.science.roche.com slurmctld[1552600]: slurmctld: error: NOTE: Trying backup state save file. Jobs may be lost!
Aug 24 09:16:45 dphimgh137-usw2.aws.science.roche.com slurmctld[1552600]: slurmctld: No job state file (/var/spool/slurm/ctld/job_state.old) to recover
Aug 24 09:16:45 dphimgh137-usw2.aws.science.roche.com slurmctld[1552600]: slurmctld: error: _find_node_record: lookup failure for node "dphimgh138-usw2"
Aug 24 09:16:45 dphimgh137-usw2.aws.science.roche.com slurmctld[1552600]: slurmctld: build_part_bitmap: invalid node name dphimgh138-usw2 in partition
Aug 24 09:16:45 dphimgh137-usw2.aws.science.roche.com slurmctld[1552600]: slurmctld: fatal: Invalid node names in partition C-72Cpu-139GB
Aug 24 09:16:45 dphimgh137-usw2.aws.science.roche.com systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE
Aug 24 09:16:45 dphimgh137-usw2.aws.science.roche.com systemd[1]: slurmctld.service: Failed with result 'exit-code'.


But i can able to lookup the DNS records from instance. I am not able find the what is the issue. Please find below lookup status. Please help to fix the issue.

root@dphimgh137-usw2:/shared/slurm_SLURM-MASTER-USW2-HPC-SB/etc# nslookup dphimgh138-usw2
Server:         127.0.0.53
Address:        127.0.0.53#53

Non-authoritative answer:
Name:   dphimgh138-usw2.aws.science.roche.com
Address: 10.158.70.138

root@dphimgh137-usw2:/shared/slurm_SLURM-MASTER-USW2-HPC-SB/etc# nslookup 10.158.70.138
138.70.158.10.in-addr.arpa      name = dphimgh138-usw2.aws.science.roche.com.

Authoritative answers can be found from:

Thanks,
Agathees
Comment 10 Tim McMullan 2023-08-24 05:13:49 MDT
Can you please open a new ticket for this problem?  This does not appear to be related to the issue initially stated in this ticket.

In the new ticket, please also include the full slurmctld log file and your slurm.conf.

Thanks,
--Tim
Comment 11 Tim McMullan 2023-08-28 09:21:52 MDT
I'm going to resolve this ticket again since its unrelated to the new issue.

If you continue to see the new issue then please open a new ticket with updated information, the slurm configuration files, as well as the full slurmctld log file and we will help get it resolved!

Thanks,
--Tim
Comment 12 Agathees 2023-09-04 04:05:22 MDT
Please close this ticket. Thanks
Comment 13 Tim McMullan 2023-09-05 06:25:23 MDT
Closing this now.