I’ve been attempting to test the HDF5 Lustre IO profiling plugin without success. I have confirmed that SLURM was built with HDF5 by inspecting the configure log. Slurmctld logs (at debug level 9) indicate that both the HDF5 and Lustre plugins are loaded upon daemon start (acct_gather_filesystem_lustre.so and acct_gather_profile_hdf5.so are loaded successfully). However, when I try to run a job, I receive the following errors: /curc/slurm/test/current/bin/sbatch --partition=dantest test.sh sbatch: error: spank: "/curc/slurm/test/14.11.3/lib/slurm/acct_gather_filesystem_lustre.so" exports 0 symbols sbatch: error: spank: /curc/slurm/test/etc/plugstack.conf:2: Failed to load plugin /curc/slurm/test/14.11.3/lib/slurm/acct_gather_filesystem_lustre.so. Aborting. sbatch: error: Failed to initialize plugin stack
This looks like a build or configuration error. Have you configured: AcctGatherProfileType=acct_gather_profile/hdf5 in your slurm.conf and: ProfileHDF5Dir=<full path to the hdf5 directories> ProfileHDF5Default=all # or any desired profile mode. in your acct_gather.conf? Please send us your configuration files for review. Also please send us the log files from slurmctld and slurmd after you try to submit a job. David
The error messages are coming from spank. Did you put anything in the plugstack.conf file? Unless you are using spank plugin this file should be empty or all commented out. David
Created attachment 1594 [details] slurm.conf
Created attachment 1595 [details] acct_gather.conf
Created attachment 1596 [details] slurmctld.log
I had included the following line in plugstack.conf: required acct_gather_filesystem_lustre.so. Commenting it out allows me to run jobs with the --profile=lustre switch, but no output files are written. I've included the configuration files and slurmctld.log, but slurmd.log hasn't been modified since 1/8/2015.
Ok that's good indeed you should not put anything in the plugstack.conf unless you are using the spank plugin which I don't think you are. I don't have lustre installed my self but I noticed using your configuration that indeed the directory, in your case it should be /work/jontest/<user name> is not created, this could be a problem. Could you try the following configuration please: o) slurm.conf comment out AcctGatherFilesystemType and leave only AcctGatherProfileType. o) acct_gather.conf Change ProfileHDF5Default=lustre to ProfileHDF5Default=all. Then restart your test cluster. The hdf5 directory should be created and populated when you run a job. I looked at your slurmctld.log and their are several issues that needs to be addressed. Your clock don't seem to be synchronized: [2015-01-30T16:09:33.008] error: The clock of the file system and this computer appear to not be synchronized [2015-01-30T16:09:38.002] error: The modification time of /curc/slurm/test/state/job_state moved backwards by 38 seconds your hosts are using different slurm.conf file: [2015-01-30T16:01:26.918] error: Node node1607 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. All hosts should have synchronized clocks and run using the same slurm.conf. David
I made the changes you recommended and restarted slurmctld, including pointing slurmd to the correct configuration file. I've also confirmed that there isn't a clock skew between the slurm master and the filesystem. There still aren't any profile files or directories created.
Can you send me the slurmd log file please. With the settings you have there should be a directory in ProfileHDF5Dir with the username of the user running the job. Is the directory accessible in write? David
Yes, the directory permissions are 770, but it's not created.
Hi, the code creating the directory is quite straightforward and it seems strange it does not work in your setup. The code also handle failures so there should be messages in your slurmd log files. Could you please double check the slurm.conf where we should have this variable: AcctGatherProfileType=acct_gather_profile/hdf5 and your acct_gather.conf ProfileHDF5Dir=<full path to an existing directory> ProfileHDF5Default=all then restart your slurmctld and slurmd and submit a sample job like 'srun cat'. If the directory with the username still does gets created I will send you an instrumentation patch to trace the code. David
I doubled checked both slurm.conf and acct_gather.conf (both were correct), restarting slurmctld and slurmd, and ran the simple "cat" job. Slurm created the directory, and writes an HDF5 file as expected. I ran my slightly more complicated test job again with srun instead of sbatch, and the profile file is now created successfully. Is there any reason sbatch wouldn't work with the profile option?
I had the same experience, I suspect that last week we did not restart the daemons. I checked the code carefully today and ran some more test and it works as expected. For tracing the Lustre couple more config steps are needed in slurm.conf: AcctGatherFilesystemType=acct_gather_filesystem/lustre to enable the plugin and set the collection frequency: JobAcctGatherFrequency=filesystem=X where X is the frequency of sampling. David
It appears that the Lustre profiling HDF5 files are written now, too. However, this only works with srun. If I submit a job script with sbatch the profiling file(s) and directory are not created. I restarted slurmctld and slurmd last week, too.
Indeed I double checked the collecting code and it only works for srun. So if you run a batch step with srun inside you will see the data: sbatch myjob myjob: #!/bin/sh srun job without the srun the code skips the collection and the creation of the directory. It was designed that way, but it seems it can be changed. Do you plan to run without srun? Also please remember to set the collection frequencies for the data that you want to collect, either in slurm.conf or on the command like. David
No, running the executable with srun in a submission script should be just fine. One more question: is there any way to set the profiling off by default, only running it upon request via, e.g. --profile=lustre?
To do this you have to unconfigure the parameters in acct_gather.conf and only leave those in slurm.conf. In this case the the data will be collected only if requested via the command line. David
Great, thank you for your help, Dan Milroy
You are welcome. I close this ticket then. David
Actually to be precise comment only the ProfileHDF5Default which tells Slurm the default profiles but leave ProfileHDF5Dir where the hdf5 files are located. David