| Summary: | Upgrading the Slurm system from 14.11.6 to 15.08.12 | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Hjalti Sveinsson <hjalti.sveinsson> |
| Component: | Other | Assignee: | Danny Auble <da> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | kolbeinn.josepsson |
| Version: | 16.05.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | deCODE | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | slurmdbd log file | ||
Looks generally correct. If possible, I'd suggest getting your slurmdbd server up and running ahead of time. You'll also need to add the cluster to the slurmdbd with sacctmgr before you start slurmctld. I'd also strongly encourage you to look at using 16.05 rather than 15.08; the 16.05.5 release is due out this afternoon and includes some important changes with respect to how cgroups behave with RHEL7. Mainly, you no longer will need the ReleaseAgent setting; systemd has occasionally removed slurmd's release_agent mount option which can then leaves a lot of stray entries under the cpuset and device cgroup controllers, and this is resolved with some patches that are in 16.05.5 and above. Thank you for your prompt respone. If I do use the 16.05.5 won't I lose jobs or other state information since I am upgrading from 14.11.6? I will see if I can find a new hardware that I can install the OS on and install slurmdbd in beforehand. How do I add the cluster to the slurmdbd with sacctmgr? Can I find some documentation regarding that? The slurmdbd uses the same slurm database as slurmctld right? So I would then restore my current slurm database into the new server and start slurmdbd on that machine and the add the cluster with sacctmgr? regards, Hjalti Sveinsson (In reply to Hjalti Sveinsson from comment #2) > Thank you for your prompt respone. > > If I do use the 16.05.5 won't I lose jobs or other state information since I > am upgrading from 14.11.6? No. Each release can be upgraded to from the two prior versions, so 14.11 is fine. (Releases were 14.11, 15.08, 16.05.) > I will see if I can find a new hardware that I can install the OS on and > install slurmdbd in beforehand. > > How do I add the cluster to the slurmdbd with sacctmgr? Can I find some > documentation regarding that? 'sacctmgr add cluster foo' Please see the 'Database Configuration' section of http://slurm.schedmd.com/accounting.html ; I'd suggest reviewing that before your upgrade. > The slurmdbd uses the same slurm database as slurmctld right? So I would > then restore my current slurm database into the new server and start > slurmdbd on that machine and the add the cluster with sacctmgr? I will need to check on this, I have not done that conversion myself. Okay, thank you. Please let me know when you have the information regarding the database for SlurmDBD, if it can use the existing Slurm database we have or if we need to create a new one. If we need to create a new database, we will use that one for all Slurm accounting information and stop using the old one? kveðja/regards Hjalti _____________________________ Hjalti Þór Sveinsson deCODE genetics ehf System Administrator - IT Operations hjalti.sveinsson@decode.is<mailto:hjalti.sveinsson@decode.is> tel: +(354) 570-1892 www.decode.com<http://www.decode.com/> _____________________________ From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: 29. september 2016 16:21 To: Hjalti Þ. Sveinsson <hjalti.sveinsson@decode.is> Subject: [Bug 3131] Upgrading the Slurm system from 14.11.6 to 15.08.12 Comment # 3<https://bugs.schedmd.com/show_bug.cgi?id=3131#c3> on bug 3131<https://bugs.schedmd.com/show_bug.cgi?id=3131> from Tim Wickberg<mailto:tim@schedmd.com> (In reply to Hjalti Sveinsson from comment #2<show_bug.cgi?id=3131#c2>) > Thank you for your prompt respone. > > If I do use the 16.05.5 won't I lose jobs or other state information since I > am upgrading from 14.11.6? No. Each release can be upgraded to from the two prior versions, so 14.11 is fine. (Releases were 14.11, 15.08, 16.05.) > I will see if I can find a new hardware that I can install the OS on and > install slurmdbd in beforehand. > > How do I add the cluster to the slurmdbd with sacctmgr? Can I find some > documentation regarding that? 'sacctmgr add cluster foo' Please see the 'Database Configuration' section of http://slurm.schedmd.com/accounting.html ; I'd suggest reviewing that before your upgrade. > The slurmdbd uses the same slurm database as slurmctld right? So I would > then restore my current slurm database into the new server and start > slurmdbd on that machine and the add the cluster with sacctmgr? I will need to check on this, I have not done that conversion myself. ________________________________ You are receiving this mail because: * You reported the bug. Did you check if the DB conversion was possible, going from AccountingStorageType=accounting_storage/mysql to AccountingStorageType=accounting_storage/slurmdbd and using the existing mysql database? Will that work are are we going to loose everything and start from scratch if go over to SlurmDBD? regards, Hjalti Sveinsson (In reply to Hjalti Sveinsson from comment #6) > Did you check if the DB conversion was possible, going from > AccountingStorageType=accounting_storage/mysql to > AccountingStorageType=accounting_storage/slurmdbd and using the existing > mysql database? > > Will that work are are we going to loose everything and start from scratch > if go over to SlurmDBD? Yes, your existing data should come across without problems. Sorry for the delay, I just wanted to confirm this before updating you. Is there anything else I can address? Hi, I did the upgrade from 14.11.6 to 16.05.5 like you suggested everything worked okay except from one problem that came up. When running the sacctmgr command to add the cluster I used the cluster name in the slurm.conf file. Clustername in slurm.conf was lphc so I typed in: 'sacctmgr add cluster lphc' That resulted in errors and I was unable to use this name. So I decided to use a different cluster name, 'sacctmgr add cluster lhpc' That worked but now we do not have any history when we run sacct command. The only way for us to get old information is to go to the old server and start slurm service there and run the command there. But we would of course like to be able to run the sacct command on our upgraded system and get all the old information as well. Is there any way for us to run a command on the database that updates the records so the correct clustername is set, like f.e. 'UPDATE table_name SET column1=value1 WHERE some_column=some_value;' Please let me know if we can somehow fix this. regards, Hjalti Sveinsson (In reply to Hjalti Sveinsson from comment #8) > Hi, > > I did the upgrade from 14.11.6 to 16.05.5 like you suggested everything > worked okay except from one problem that came up. When running the sacctmgr > command to add the cluster I used the cluster name in the slurm.conf file. > > Clustername in slurm.conf was lphc so I typed in: > > 'sacctmgr add cluster lphc' > > That resulted in errors and I was unable to use this name. So I decided to > use a different cluster name, > > 'sacctmgr add cluster lhpc' > > That worked but now we do not have any history when we run sacct command. > The only way for us to get old information is to go to the old server and > start slurm service there and run the command there. > > But we would of course like to be able to run the sacct command on our > upgraded system and get all the old information as well. > > Is there any way for us to run a command on the database that updates the > records so the correct clustername is set, like f.e. > > 'UPDATE table_name SET column1=value1 WHERE some_column=some_value;' > > Please let me know if we can somehow fix this. I'm going to look into what may have failed when trying to use the original cluster name, although it sounds like you're past that being an active issue. If you can live with the 'old' data being split in the database, you can use the -M flag with sacct to query the old cluster's data: sacct -M lphc <rest of options> Note that you do not need to have slurmctld running under that old cluster name - sacct is directly querying slurmdbd, which you only need the single instance of. Recombining the records into a single cluster is going to be difficult. Job records are stored in independent tables ($clustername_job_table), and each table has its own auto-incrementing primary key (job_db_inx) that would need to be remapped to prevent collision between the old data and new. That key is also referenced on the controller. It's theoretically possible, but not something I've tried before, and you'd likely need to remove all pending and running jobs from the database to avoid potential accounting data corruption. Hi, thanks for your respone.
when i type in "sacct -M lphc" I get this error:
[root@ru-lhpc-head ~]# sacct -M lphc
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
sacct: error: Unknown error 1054
If we can get that working somehow it would be great and then we don't need to change the cluster name of all the database records.
However we saw that the jobid's kept on rolling and they did not go back to zero and start again.
regards,
Hjalti Sveinsson
(In reply to Hjalti Sveinsson from comment #10) > Hi, thanks for your respone. > > when i type in "sacct -M lphc" I get this error: > > [root@ru-lhpc-head ~]# sacct -M lphc > JobID JobName Partition Account AllocCPUS State ExitCode > ------------ ---------- ---------- ---------- ---------- ---------- -------- > sacct: error: Unknown error 1054 Is there anything in the slurmdbd.log corresponding to that query? Unfortunately I think this is going to require some direct intervention in MySQL to sort out. Can you run a few queries? show tables; select * from cluster_table; I suspect that lphc is missing from cluster_table, but that its tables should all be in place. > If we can get that working somehow it would be great and then we don't need > to change the cluster name of all the database records. > > However we saw that the jobid's kept on rolling and they did not go back to > zero and start again. That's normal - JobIDs come from slurmctld, and aren't directly linked to the job_db_inx values which the MySQL database auto-generates. Hi Tim, following is the error from slurmdb.log: [2016-10-13T15:19:39.084] error: mysql_query failed: 1054 Unknown column 't1.req_cpufreq_min' in 'field list' select t1.id_step, t1.time_start, t1.time_end, t1.time_suspended, t1.step_name, t1.nodelist, t1.node_inx, t1.state, t1.kill_requid, t1.exit_code, t1.nodes_alloc, t1.task_cnt, t1.task_dist, t1.user_sec, t1.user_usec, t1.sys_sec, t1.sys_usec, t1.max_disk_read, t1.max_disk_read_task, t1.max_disk_read_node, t1.ave_disk_read, t1.max_disk_write, t1.max_disk_write_task, t1.max_disk_write_node, t1.ave_disk_write, t1.max_vsize, t1.max_vsize_task, t1.max_vsize_node, t1.ave_vsize, t1.max_rss, t1.max_rss_task, t1.max_rss_node, t1.ave_rss, t1.max_pages, t1.max_pages_task, t1.max_pages_node, t1.ave_pages, t1.min_cpu, t1.min_cpu_task, t1.min_cpu_node, t1.ave_cpu, t1.act_cpufreq, t1.consumed_energy, t1.req_cpufreq_min, t1.req_cpufreq, t1.req_cpufreq_gov, t1.tres_alloc from "lphc_step_table" as t1 where t1.job_db_inx=1336592 [2016-10-13T15:19:39.084] error: Problem getting jobs for cluster lphc [2016-10-13T15:19:39.084] error: Processing last message from connection 8(172.17.147.210) uid(0) Here are the results from mysql queries: MariaDB [slurm_acc]> show tables; +------------------------------+ | Tables_in_slurm_acc | +------------------------------+ | acct_coord_table | | acct_table | | clus_res_table | | cluster_table | | lhpc_assoc_table | | lhpc_assoc_usage_day_table | | lhpc_assoc_usage_hour_table | | lhpc_assoc_usage_month_table | | lhpc_event_table | | lhpc_job_table | | lhpc_last_ran_table | | lhpc_resv_table | | lhpc_step_table | | lhpc_suspend_table | | lhpc_usage_day_table | | lhpc_usage_hour_table | | lhpc_usage_month_table | | lhpc_wckey_table | | lhpc_wckey_usage_day_table | | lhpc_wckey_usage_hour_table | | lhpc_wckey_usage_month_table | | lphc_assoc_table | | lphc_assoc_usage_day_table | | lphc_assoc_usage_hour_table | | lphc_assoc_usage_month_table | | lphc_event_table | | lphc_job_table | | lphc_last_ran_table | | lphc_resv_table | | lphc_step_table | | lphc_suspend_table | | lphc_usage_day_table | | lphc_usage_hour_table | | lphc_usage_month_table | | lphc_wckey_table | | lphc_wckey_usage_day_table | | lphc_wckey_usage_hour_table | | lphc_wckey_usage_month_table | | qos_table | | res_table | | table_defs_table | | tres_table | | txn_table | | user_table | +------------------------------+ MariaDB [slurm_acc]> select * from cluster_table; +---------------+------------+---------+------+----------------+--------------+-----------+-------------+----------------+------------+------------------+-------+ | creation_time | mod_time | deleted | name | control_host | control_port | last_port | rpc_version | classification | dimensions | plugin_id_select | flags | +---------------+------------+---------+------+----------------+--------------+-----------+-------------+----------------+------------+------------------+-------+ | 1475531478 | 1475531594 | 0 | lhpc | 172.17.147.210 | 6817 | 6817 | 7680 | 0 | 1 | 101 | 0 | +---------------+------------+---------+------+----------------+--------------+-----------+-------------+----------------+------------+------------------+-------+ Rgds, Kolbeinn Seems like this case has lost attention? It looks like the lphc tables haven't been automatically converted to the latest format, which is then causing the MySQL queries to fail. I believe slurmdbd hasn't converted the tables as it isn't included in the cluster list. Can you try to add the 'lphc' cluster again through sacctmgr? I'm assuming that will fail again, but if you can attach the slurmdbd.log when you do that it'd help isolate the main issue. [root@ru-lhpc-head ~]# sacctmgr add cluster lphc Adding Cluster(s) Name = lphc Would you like to commit changes? (You have 30 seconds to decide) (N/y): y Database is busy or waiting for lock from other user. There was some output missing: [root@ru-lhpc-head ~]# sacctmgr add cluster lphc Adding Cluster(s) Name = lphc Would you like to commit changes? (You have 30 seconds to decide) (N/y): y Database is busy or waiting for lock from other user. sacctmgr: slurmdbd: reopening connection sacctmgr: error: slurmdbd: Getting response to message type 1405 Problem adding clusters: Unspecified error I have attached the slurmdbd.log as well. regards, Hjalti Sveinsson Created attachment 3626 [details]
slurmdbd log file
slurmdbd log file
Any update on this? would be great to get this resolved. regards, Hjalti Sveinsson Hi Hjaiti, I think the problem here is the upgrade plus the switch from mysql directly to using the dbd. It looks like this could had been avoided by adding the cluster to the database before hand, which isn't required when writing directly to the dbd. It looks like after you added the cluster things started working though [2016-10-21T09:54:30.434] adding column req_cpufreq_min after consumed_energy in table "lphc_step_table" Based on the logs I would expect lphc to now be doing the correct thing. Is that the case? If that isn't the case could you send again the output of sacctmgr list clusters? Or the output from the direct mysql query select * from cluster_table; It looks like the reason your sacctmgr add cluster lphc failed is the query took too long because of the table upgrades Based on the logs it took almost an hour to do. [2016-10-21T09:15:55.003] dropping key sacct from table "lphc_job_table" [2016-10-21T10:08:10.786] dropping column consumed_energy from table "lphc_wckey_usage_month_table" We can update the documentation to point to adding the cluster before switching plugins. I think the problem would had been resolved making that happen. Let us know if there is something outstanding on this. Documentation has been updated in commit df00db73d. Hjaiti, is there anything else you need on this? Please reopen if you have anything else needed on this. |
Hi, we are planning on upgrading from 14.11.6 to 15.08.12 next monday and upgrading the OS on the machine from RHEL6 to RHEL7. We want to add SlurmDBD aswell during the upgrade so I have made an actionplan that I wanted to ask if it was correct. Here are the actions I am planning to take. 1. Create rpm packages from slurm-15.08.12 source tarball on RHEL7 machine. (make sure to include all development libraries). 2. Shut down slurm services lhpc-head (HEAD NODE) 3. Recursively copy /etc/slurm/* to an NFS mount 4. Copy /etc/munge/munge.key to an NFS mount 5. Backup mysql slurm_acc database to an NFS mount using mysqldump 6. Recursively copy /var/lib/slurm/* to an NFS mount 7. Recursively copy /var/log/slurm/* to an NFS mount 8. Deploy RHEL7 template on Red Hat Satellite server for lhpc-head server 9. Change lhpc-head network to the PXE network 10. Shutdown lhpc-head 11. Move lhpc-head from chassis 18 to chassis 3 12. Power on lhpc-head and let the RHEL7 OS installation install on the server 13. Install the slurm-15.08.12 rpm packages to lhpc-head after OS installation completes (slurm, slurm-plugins, slurm-munge, slurm-slurmdbd, slurm-sql, slurm-perlapi, slurm-sjstat) 14. Recursively copy /etc/slurm/* from the NFS mountpoint to lhpc-head:/etc/slurm/ 15. Copy /etc/munge/munge.key from the NFS mountpoint to lhpc-head:/etc/munge/munge.key 16. Create the group slurm á id 598 17. Create the user slurm á id 598 18. Create directories /var/{lib,log}/slurm/ 19. Change the owner:group on these directories to slurm:slurm 20. Recursively copy /var/lib/slurm/* from the NFS mountpoint to lhpc-head:/var/lib/slurm/ 21. Recursively copy /var/log/slurm/* from the NFS mountpoint to lhpc-head:/var/log/slurm/ 22. Restore the slurm_acc database into the lhpc-head server fromt the NFS mountpoint 23. Change slurm.conf so it uses SlurmDBD service? 24. Start the slurm services 25. Check if everything works If something is not correct here, can you please provide me with information on what I need to change in my action plan. regards, Hjalti Sveinsson