Ticket 5265 - sacct does not work after upgrade from 17.2.9 to 17.11.7
Summary: sacct does not work after upgrade from 17.2.9 to 17.11.7
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmdbd (show other tickets)
Version: 17.11.7
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Felip Moll
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-06-06 06:50 MDT by Nikolas Luke
Modified: 2018-06-08 09:33 MDT (History)
1 user (show)

See Also:
Site: Hessen
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurmdbd.log (2.22 KB, text/x-log)
2018-06-07 06:36 MDT, Nikolas Luke
Details
slurmConfigure.out from rpmbuild (1.30 MB, text/plain)
2018-06-07 08:07 MDT, Nikolas Luke
Details
slurmConfigure.err from rpmbuild (21.56 KB, text/plain)
2018-06-07 08:07 MDT, Nikolas Luke
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Nikolas Luke 2018-06-06 06:50:42 MDT
After upgrade from Slurm 17.2.9 to 17.11.7 the sacct command does not work. I have lost the spool directory data, so the job numbers are reset to 1, 2, 3. There could be the problem, because data of this job numbers are already in the database. I know the last jobnumber before the update. Here the error of all users i tried:


sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
sacct: error: slurmdbd: Operation not permitted
Comment 1 Felip Moll 2018-06-06 12:36:49 MDT
Hi Nikolas,

Despite losing slurm state directory can lead to several issues, this doesn't have anything to do with sacct.

The Job ID can be reused, it happens for example when a job is requeued. The internal index is job_inx and used internally.

I think your problem is more in the access permissions:

sacct: error: slurmdbd: Operation not permitted

Can you send slurmdbd logs? Have you modified the database and user grants?
Is your slurmdbd.conf up-to-date with any possible user/pass/db change?
Comment 2 Nikolas Luke 2018-06-07 05:46:19 MDT
The slurmdbd is now down after a systemctl restart slurmdbd.

slurmdbd.log:
[2018-06-04T14:20:37.621] fatal: Unable to initialize accounting_storage/mysql accounting storage plugin
[2018-06-07T13:26:17.768] fatal: Unable to initialize accounting_storage/mysql accounting storage plugin
Comment 3 Felip Moll 2018-06-07 06:20:37 MDT
(In reply to Nikolas Luke from comment #2)
> The slurmdbd is now down after a systemctl restart slurmdbd.
> 
> slurmdbd.log:
> [2018-06-04T14:20:37.621] fatal: Unable to initialize
> accounting_storage/mysql accounting storage plugin
> [2018-06-07T13:26:17.768] fatal: Unable to initialize
> accounting_storage/mysql accounting storage plugin

Is it possible you compiled without mysql-devel package?
Can you show me your config.log?

And also send me your full slurmdbd log please.
Comment 4 Nikolas Luke 2018-06-07 06:36:36 MDT
Created attachment 7016 [details]
slurmdbd.log
Comment 5 Nikolas Luke 2018-06-07 06:43:19 MDT
I build the packages with:

rpmbuild -ta slurm-17.11.7.tar.bz2 >slurmConfigure.out 2>slurmConfigure.err

This results in the files

slurm-17.11.7-1.el7.centos.x86_64.rpm
slurm-contribs-17.11.7-1.el7.centos.x86_64.rpm
slurm-devel-17.11.7-1.el7.centos.x86_64.rpm
slurm-example-configs-17.11.7-1.el7.centos.x86_64.rpm
slurm-libpmi-17.11.7-1.el7.centos.x86_64.rpm
slurm-openlava-17.11.7-1.el7.centos.x86_64.rpm
slurm-pam_slurm-17.11.7-1.el7.centos.x86_64.rpm
slurm-perlapi-17.11.7-1.el7.centos.x86_64.rpm
slurm-slurmctld-17.11.7-1.el7.centos.x86_64.rpm
slurm-slurmd-17.11.7-1.el7.centos.x86_64.rpm
slurm-slurmdbd-17.11.7-1.el7.centos.x86_64.rpm
slurm-torque-17.11.7-1.el7.centos.x86_64.rpm

I wondered there was no slurm-sql, no slurm-munge and no slurm-plugins packet this time.

Which config.log do you mean? slurmctld.log?
Comment 6 Felip Moll 2018-06-07 07:04:32 MDT
(In reply to Nikolas Luke from comment #5)

> I wondered there was no slurm-sql, no slurm-munge and no slurm-plugins
> packet this time.
> 

That's correct. These have been removed.


> I build the packages with:
> 
> rpmbuild -ta slurm-17.11.7.tar.bz2 >slurmConfigure.out 2>slurmConfigure.err
> ...
> Which config.log do you mean? slurmctld.log?

I mean this slurmConfigure.out and .err



Try also to run slurmdbd as SlurmUser (defined in slurmdbd.conf) manually like this:

slurmdbd -Dvvv

And send me the output together with the slurmConfigure.*.
Comment 7 Nikolas Luke 2018-06-07 08:07:10 MDT
Created attachment 7018 [details]
slurmConfigure.out from rpmbuild
Comment 8 Nikolas Luke 2018-06-07 08:07:30 MDT
Created attachment 7019 [details]
slurmConfigure.err from rpmbuild
Comment 9 Nikolas Luke 2018-06-07 08:15:22 MDT
Output of slurmdbd -Dvvv as user "root":

slurmdbd: debug:  Log file re-opened
slurmdbd: debug:  Munge authentication plugin loaded
slurmdbd: debug2: mysql_connect() called for db slurm_acct_db
slurmdbd: error: mysql_query failed: 1 Can't create/write to file '/tmp/#sql_367_0.MAI' (Errcode: 2)
show columns from convert_version_table
slurmdbd: Accounting storage MYSQL plugin failed
slurmdbd: error: Couldn't load specified plugin name for accounting_storage/mysql: Plugin init() callback failed
slurmdbd: error: cannot create accounting_storage context for accounting_storage/mysql
slurmdbd: fatal: Unable to initialize accounting_storage/mysql accounting storage plugin


Output of slurmdbd -Dvvv as user "slurm" (SlurmUser=slurm):

slurmdbd: error: s_p_parse_file: unable to read "/etc/slurm/slurmdbd.conf": Permission denied
slurmdbd: fatal: Could not open/read/parse slurmdbd.conf file /etc/slurm/slurmdbd.conf
Comment 10 Felip Moll 2018-06-07 08:38:07 MDT
Well(In reply to Nikolas Luke from comment #9)
> Output of slurmdbd -Dvvv as user "root":
> 
> slurmdbd: debug:  Log file re-opened
> slurmdbd: debug:  Munge authentication plugin loaded
> slurmdbd: debug2: mysql_connect() called for db slurm_acct_db
> slurmdbd: error: mysql_query failed: 1 Can't create/write to file
> '/tmp/#sql_367_0.MAI' (Errcode: 2)
> show columns from convert_version_table
> slurmdbd: Accounting storage MYSQL plugin failed
> slurmdbd: error: Couldn't load specified plugin name for
> accounting_storage/mysql: Plugin init() callback failed
> slurmdbd: error: cannot create accounting_storage context for
> accounting_storage/mysql
> slurmdbd: fatal: Unable to initialize accounting_storage/mysql accounting
> storage plugin
> 
> 
> Output of slurmdbd -Dvvv as user "slurm" (SlurmUser=slurm):
> 
> slurmdbd: error: s_p_parse_file: unable to read "/etc/slurm/slurmdbd.conf":
> Permission denied
> slurmdbd: fatal: Could not open/read/parse slurmdbd.conf file
> /etc/slurm/slurmdbd.conf


Well, you can see the problems here.

A) It seems slurm user cannot read /etc/slurm/slurmdbd.conf

B) Moreover it seems that /tmp/ is not writable by mysql. This error comes from mysql server:

> slurmdbd: error: mysql_query failed: 1 Can't create/write to file
> '/tmp/#sql_367_0.MAI' (Errcode: 2)

For example, see:
https://stackoverflow.com/questions/11997012/mysql-cant-create-write-to-file-tmp-sql-3c6-0-myi-errcode-2-what-does?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa


First fix A), if it still doesn't work, fix B). Check with slurmdbd -Dvvv as slurm user.
Comment 11 Nikolas Luke 2018-06-08 09:20:23 MDT
The slurmdbd.conf had lost it's permissions. Now the user slurm can read it and is using the right pathes for the logfile instead of default pathes. Good idea to use the "slurmdbd -Dvvv" command, when there is nothing in the log. Same with the mysql tempdir. It had lost it's permissions, too.

slurmdbd is running and sacct is working now.

Many thanks!
Comment 12 Felip Moll 2018-06-08 09:33:35 MDT
You're welcome :)

Closing the issue now.