| Summary: | what version of mysql does slurm support to latest slurm 23.02 | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Agathees <durairaa> |
| Component: | Database | Assignee: | Tim McMullan <mcmullan> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | CC: | mcmullan |
| Version: | 23.02.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Genentech (Roche) | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Agathees
2023-06-14 12:26:55 MDT
Hi Agathees, MySQL 8.0 is a modern, currently supported version and should be ok. https://slurm.schedmd.com/platforms.html#database There is an error message about the innodb_lock_wait_timeout. Do you have the recommended settings configured in your my.cnf file? https://slurm.schedmd.com/accounting.html#slurm-accounting-configuration-before-build Are you still having problems getting slurmdbd running? Did you make the recommended configuration changes and did that have an effect on things? I haven't heard any follow up questions so I'll lower the severity of this ticket for now. Let us know if you're still having problems. innodb_lock_wait_timeout issue has fixed. Thanks!. But still i have facing the accounting_storage/as_mysql plugin issue. Below pasted the output of slurmdbd.log 2023-06-14T17:49:53.363] accounting_storage/as_mysql: init: Accounting storage MYSQL plugin failed [2023-06-14T17:49:53.366] error: Couldn't load specified plugin name for accounting_storage/mysql: Plugin init() callback failed [2023-06-14T17:49:53.366] error: cannot create accounting_storage context for accounting_storage/mysql [2023-06-14T17:49:53.366] fatal: Unable to initialize accounting_storage/mysql accounting storage plugin [2023-06-15T15:13:16.372] pidfile not locked, assuming no running daemon [2023-06-15T15:13:16.390] accounting_storage/as_mysql: _check_mysql_concat_is_sane: MySQL server version is: 8.0.23 [2023-06-15T15:13:16.438] accounting_storage/as_mysql: init: Accounting storage MYSQL plugin failed [2023-06-15T15:13:16.442] error: Couldn't load specified plugin name for accounting_storage/mysql: Plugin init() callback failed [2023-06-15T15:13:16.442] error: cannot create accounting_storage context for accounting_storage/mysql [2023-06-15T15:13:16.442] fatal: Unable to initialize accounting_storage/mysql accounting storage plugin Can you please attach your slurmdbd.conf file (please sanitize the password field)? Could you also run this again, but setting "DebugLevel=debug4" in the slurmdbd.conf and attach the full log? Thanks, --Tim Reducing severity until we can get more information. Hi, I wanted to check in and see if you could get the logs I requested! Thanks, --Tim I've been unable to reproduce this issue locally and will require further input to be able to resolve this. Since we haven't heard from you in a while, I'm going to time this out. If the issue persists and you can upload the requested logs, please let us know and we will continue to troubleshoot this! Thanks, --Tim Hi Team,
I getting the following DNS record error on the slurmctld service.
● slurmctld.service - Slurm controller daemon
Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Thu 2023-08-24 09:16:45 UTC; 2s ago
Process: 1552600 ExecStart=/shared/slurm_SLURM-MASTER-USW2-HPC-SB/sbin/slurmctld -D -s $SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE)
Main PID: 1552600 (code=exited, status=1/FAILURE)
Aug 24 09:16:45 dphimgh137-usw2.aws.science.roche.com slurmctld[1552600]: slurmctld: error: NOTE: Trying backup state save file. Information may be lost!
Aug 24 09:16:45 dphimgh137-usw2.aws.science.roche.com slurmctld[1552600]: slurmctld: No node state file (/var/spool/slurm/ctld/node_state.old) to recover
Aug 24 09:16:45 dphimgh137-usw2.aws.science.roche.com slurmctld[1552600]: slurmctld: error: Could not open job state file /var/spool/slurm/ctld/job_state: No such file or directory
Aug 24 09:16:45 dphimgh137-usw2.aws.science.roche.com slurmctld[1552600]: slurmctld: error: NOTE: Trying backup state save file. Jobs may be lost!
Aug 24 09:16:45 dphimgh137-usw2.aws.science.roche.com slurmctld[1552600]: slurmctld: No job state file (/var/spool/slurm/ctld/job_state.old) to recover
Aug 24 09:16:45 dphimgh137-usw2.aws.science.roche.com slurmctld[1552600]: slurmctld: error: _find_node_record: lookup failure for node "dphimgh138-usw2"
Aug 24 09:16:45 dphimgh137-usw2.aws.science.roche.com slurmctld[1552600]: slurmctld: build_part_bitmap: invalid node name dphimgh138-usw2 in partition
Aug 24 09:16:45 dphimgh137-usw2.aws.science.roche.com slurmctld[1552600]: slurmctld: fatal: Invalid node names in partition C-72Cpu-139GB
Aug 24 09:16:45 dphimgh137-usw2.aws.science.roche.com systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE
Aug 24 09:16:45 dphimgh137-usw2.aws.science.roche.com systemd[1]: slurmctld.service: Failed with result 'exit-code'.
But i can able to lookup the DNS records from instance. I am not able find the what is the issue. Please find below lookup status. Please help to fix the issue.
root@dphimgh137-usw2:/shared/slurm_SLURM-MASTER-USW2-HPC-SB/etc# nslookup dphimgh138-usw2
Server: 127.0.0.53
Address: 127.0.0.53#53
Non-authoritative answer:
Name: dphimgh138-usw2.aws.science.roche.com
Address: 10.158.70.138
root@dphimgh137-usw2:/shared/slurm_SLURM-MASTER-USW2-HPC-SB/etc# nslookup 10.158.70.138
138.70.158.10.in-addr.arpa name = dphimgh138-usw2.aws.science.roche.com.
Authoritative answers can be found from:
Thanks,
Agathees
Can you please open a new ticket for this problem? This does not appear to be related to the issue initially stated in this ticket. In the new ticket, please also include the full slurmctld log file and your slurm.conf. Thanks, --Tim I'm going to resolve this ticket again since its unrelated to the new issue. If you continue to see the new issue then please open a new ticket with updated information, the slurm configuration files, as well as the full slurmctld log file and we will help get it resolved! Thanks, --Tim Please close this ticket. Thanks Closing this now. |