Ticket 1412 - Unable to initiate HDF5 profiling plugin
Summary: Unable to initiate HDF5 profiling plugin
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 14.11.1
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: David Bigagli
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2015-01-30 05:58 MST by Daniel Milroy
Modified: 2015-02-03 03:16 MST (History)
2 users (show)

See Also:
Site: University of Colorado
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (5.98 KB, text/plain)
2015-01-30 09:12 MST, Daniel Milroy
Details
acct_gather.conf (207 bytes, text/plain)
2015-01-30 09:13 MST, Daniel Milroy
Details
slurmctld.log (2.42 MB, text/x-log)
2015-01-30 09:15 MST, Daniel Milroy
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Daniel Milroy 2015-01-30 05:58:31 MST
I’ve been attempting to test the HDF5 Lustre IO profiling plugin without success.  I have confirmed that SLURM was built with HDF5 by inspecting the configure log.  Slurmctld logs (at debug level 9) indicate that both the HDF5 and Lustre plugins are loaded upon daemon start (acct_gather_filesystem_lustre.so and acct_gather_profile_hdf5.so are loaded successfully).  However, when I try to run a job, I receive the following errors:

/curc/slurm/test/current/bin/sbatch --partition=dantest test.sh
sbatch: error: spank: "/curc/slurm/test/14.11.3/lib/slurm/acct_gather_filesystem_lustre.so" exports 0 symbols
sbatch: error: spank: /curc/slurm/test/etc/plugstack.conf:2: Failed to load plugin /curc/slurm/test/14.11.3/lib/slurm/acct_gather_filesystem_lustre.so. Aborting.
sbatch: error: Failed to initialize plugin stack
Comment 1 David Bigagli 2015-01-30 08:53:52 MST
This looks like a build or configuration error. Have you configured:

AcctGatherProfileType=acct_gather_profile/hdf5 

in your slurm.conf and:

ProfileHDF5Dir=<full path to the hdf5 directories>
ProfileHDF5Default=all # or any desired profile mode.

in your acct_gather.conf?

Please send us your configuration files for review. 
Also please send us the log files from slurmctld and slurmd after you try to
submit a job.

David
Comment 2 David Bigagli 2015-01-30 08:59:08 MST
The error messages are coming from spank. Did you put anything in the plugstack.conf file? Unless you are using spank plugin this file should be
empty or all commented out.

David
Comment 3 Daniel Milroy 2015-01-30 09:12:36 MST
Created attachment 1594 [details]
slurm.conf
Comment 4 Daniel Milroy 2015-01-30 09:13:26 MST
Created attachment 1595 [details]
acct_gather.conf
Comment 5 Daniel Milroy 2015-01-30 09:15:58 MST
Created attachment 1596 [details]
slurmctld.log
Comment 6 Daniel Milroy 2015-01-30 09:18:08 MST
I had included the following line in plugstack.conf: required acct_gather_filesystem_lustre.so.  Commenting it out allows me to run jobs with the --profile=lustre switch, but no output files are written.  I've included the configuration files and slurmctld.log, but slurmd.log hasn't been modified since 1/8/2015.
Comment 7 David Bigagli 2015-01-30 09:27:24 MST
Ok that's good indeed you should not put anything in the plugstack.conf unless
you are using the spank plugin which I don't think you are.

I don't have lustre installed my self but I noticed using your configuration 
that indeed the directory, in your case it should be /work/jontest/<user name>
is not created, this could be a problem.

Could you try the following configuration please:

o) slurm.conf

comment out AcctGatherFilesystemType and leave only AcctGatherProfileType.

o) acct_gather.conf

Change ProfileHDF5Default=lustre
to ProfileHDF5Default=all.

Then restart your test cluster. The hdf5 directory should be created and populated
when you run a job.

I looked at your slurmctld.log and their are several issues that needs to be addressed. 

Your clock don't seem to be synchronized:

[2015-01-30T16:09:33.008] error: The clock of the file system and this computer appear to not be synchronized
[2015-01-30T16:09:38.002] error: The modification time of /curc/slurm/test/state/job_state moved backwards by 38 seconds

your hosts are using different slurm.conf file:

[2015-01-30T16:01:26.918] error: Node node1607 appears to have a different slurm.conf than the slurmctld.  This could cause issues with communication and functionality.  Please review both files and make sure they are the same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.

All hosts should have synchronized clocks and run using the same slurm.conf.

David
Comment 8 Daniel Milroy 2015-01-30 09:52:28 MST
I made the changes you recommended and restarted slurmctld, including pointing slurmd to the correct configuration file.  I've also confirmed that there isn't a clock skew between the slurm master and the filesystem.  There still aren't any profile files or directories created.
Comment 9 David Bigagli 2015-01-30 09:57:40 MST
Can you send me the slurmd log file please.

With the settings you have there should be a directory in ProfileHDF5Dir with the 
username of the user running the job. Is the directory accessible in write?

David
Comment 10 Daniel Milroy 2015-01-30 10:00:17 MST
Yes, the directory permissions are 770, but it's not created.
Comment 11 David Bigagli 2015-02-02 07:49:45 MST
Hi, the code creating the directory is quite straightforward and it seems strange it does not work in your setup. The code also handle failures so there should be messages in your slurmd log files.

Could you please double check the slurm.conf where we should have this variable:

AcctGatherProfileType=acct_gather_profile/hdf5

and your acct_gather.conf

ProfileHDF5Dir=<full path to an existing directory>
ProfileHDF5Default=all

then restart your slurmctld and slurmd and submit a sample job like 'srun cat'.

If the directory with the username still does gets created I will send you an instrumentation patch to trace the code.

David
Comment 12 Daniel Milroy 2015-02-02 08:46:18 MST
I doubled checked both slurm.conf and acct_gather.conf (both were correct), restarting slurmctld and slurmd, and ran the simple "cat" job.  Slurm created the directory, and writes an HDF5 file as expected.

I ran my slightly more complicated test job again with srun instead of sbatch, and the profile file is now created successfully.  Is there any reason sbatch wouldn't work with the profile option?
Comment 13 David Bigagli 2015-02-02 08:51:29 MST
I had the same experience, I suspect that last week we did not restart the daemons. I checked the code carefully today and ran some more test and it
works as expected.

For tracing the Lustre couple more config steps are needed in slurm.conf:

AcctGatherFilesystemType=acct_gather_filesystem/lustre

to enable the plugin and set the collection frequency:

JobAcctGatherFrequency=filesystem=X

where X is the frequency of sampling.

David
Comment 14 Daniel Milroy 2015-02-02 09:00:45 MST
It appears that the Lustre profiling HDF5 files are written now, too.  However, this only works with srun.  If I submit a job script with sbatch the profiling file(s) and directory are not created.  I restarted slurmctld and slurmd last week, too.
Comment 15 David Bigagli 2015-02-02 09:11:15 MST
Indeed I double checked the collecting code and it only works for srun.
So if you run a batch step with srun inside you will see the data:

sbatch myjob
myjob:

#!/bin/sh

srun job

without the srun the code skips the collection and the creation of the
directory. It was designed that way, but it seems it can be changed.
Do you plan to run without srun?
Also please remember to set the collection frequencies for the data that
you want to collect, either in slurm.conf or on the command like.

David
Comment 16 Daniel Milroy 2015-02-02 09:17:27 MST
No, running the executable with srun in a submission script should be just fine.  One more question: is there any way to set the profiling off by default, only running it upon request via, e.g. --profile=lustre?
Comment 17 David Bigagli 2015-02-02 09:59:28 MST
To do this you have to unconfigure the parameters in acct_gather.conf and
only leave those in slurm.conf. In this case the the data will be collected
only if requested via the command line.

David
Comment 18 Daniel Milroy 2015-02-02 10:24:02 MST
Great, thank you for your help,

Dan Milroy
Comment 19 David Bigagli 2015-02-02 10:25:50 MST
You are welcome. I close this ticket then.

David
Comment 20 David Bigagli 2015-02-03 03:16:21 MST
Actually to be precise comment only the ProfileHDF5Default which tells Slurm the
default profiles but leave ProfileHDF5Dir where the hdf5 files are located.

David