Ticket 1384

Summary: Slurmdb perl api
Product: Slurm Reporter: Josko Plazonic <plazonic>
Component: OtherAssignee: Brian Christiansen <brian>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: brian, da, dmcr
Version: 14.11.3   
Hardware: Linux   
OS: Linux   
Site: Princeton (PICSciE) Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 15.08.0pre2 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: perl api test script
perl api test output
14.11 jobs_get patch
updated perl api test script
updated test output

Description Josko Plazonic 2015-01-20 02:22:42 MST
Good morning,

we are trying to use Slurmdb perl api to pull accounting info directly from Slurmdb (rather then use sacct and parsing of that output) but we can't seem to get it working.

As there are few examples we've been relying on tests, e.g.:
https://github.com/SchedMD/slurm/blob/master/contribs/perlapi/libslurmdb/perl/t/01-clusters_get.t
so we tried:

#!/usr/bin/perl
use Slurm ':all';
use Slurmdb ':all';
my $db_conn = Slurmdb::connection_get();
my %hv = ();
my $clusters = Slurmdb::clusters_get($db_conn, \%hv);
for (my $i = 0; $i < @$clusters; $i++) {
#     print "accounting_list $clusters->[$i]{'accounting_list'}\n";
      print "classification $clusters->[$i]{'classification'}\n";
      print "control_host   $clusters->[$i]{'control_host'}\n";
      print "control_port   $clusters->[$i]{'control_port'}\n";
      print "cpu_count      $clusters->[$i]{'cpu_count'}\n";
      print "name           $clusters->[$i]{'name'}\n";
      print "nodes          $clusters->[$i]{'nodes'}\n"
          if exists $clusters->[$i]{'nodes'};
#     print "root_assoc     $clusters->[$i]{'root_assoc'}\n";
      print "rpc_version    $clusters->[$i]{'rpc_version'}\n\n";
}

my $rc = Slurmdb::connection_close($db_conn);

But no luck - we get segfaults on connection_close and $clusters is an empty array. We played around with having use Slurm and not having it (can change where it crashes) but did not help.  strace shows that it did talk to munge and db.

Any hints on how to do this?  Thanks!
Comment 1 David Bigagli 2015-01-20 03:53:10 MST
This is a pretty old code contributed by Don Lipari. Did you try to contact him,
he is usually pretty responsive. Why you don't like sacct, is there any missing feature?

David
Comment 2 Josko Plazonic 2015-01-20 05:57:43 MST
I'll check tomorrow with Dennis for details but I think it boils down to trying to avoid middleman - his cluster stats program is perl based so rather then invoke sacct we were trying to have the program pull data out of the db directly.  He also had some errors when we moved from 14.03 to 14.11 - again - more details tomorrow.

Now I see that that might be a lot harder then I thought.  Turns out that all of Slurmdb users (sreport, sacct and sacctmgr, sview maybe too), do not link libslurmdb dynamically but statically.  Indeed my simple test C program (that did little else but get a db connection and then close it) only started working once I linked with libslurmdb.a rather then .so.

Of course the perl Slurmdb.so is dynamically linked and that's probably just the start of its problems (linking libslurmdb statically did not help - plugins are still not initialized correctly which is the real reason for failure).

So I'll check with Dennis and see if he wants to purse this, e.g. by contacting Don.
Comment 3 Dennis McRitchie 2015-01-21 03:58:44 MST
Hi David,

Dennis McRitchie commenting here since Josko wanted my response to your question of why I didn't want to use sacct for my reporting purposes. It boils down to the fact that sacct produces human-readable output, which is not always optimal when one needs machine-readable output: e.g.,

1) parsing the actual memory size output with its variable suffixes (K, M, G, T, or none) is a bit tricky.

2) Ditto for the requested memory size with its additional c or n suffix.

3) Multiple jobs can be collapsed on one line, e.g., 12345[1-10]

Since I am retrieving this data to create tools and as well as daily and monthly reports of our own custom design, I am hoping that I will get more machine-readable output by using the Slurm database API directly. Since my tools are in Perl, I am trying to use the Slurmdb module.

Now that Slurmdb.so is statically linked against libslurmdb.a statically, I believe I was able to solve one of the 2 problems we've so far encountered when running the Perl Slurmdb test suite. These tests are old, and the interface to slurmdb_connection_close() has likely changed since they were written. So I had to change the Perl code a bit to get it to work.

Once I have worked my way through the examples, I'll let Don know what I found and changed. 

Regards,
Dennis
Comment 4 David Bigagli 2015-01-21 07:48:44 MST
Hi Dennis,
          thanks for the information. For the machine readable format I think you could have a look at the -p and -P options of sacct that generate parsable output.
Also the -n option could be useful to you. The documentation can be found
here: http://slurm.schedmd.com/sacct.html.

The parsing of the suffixes can indeed be tricky, what if we add a new option
that will always print the fields in megabytes? Which fields are you specifically interested in?

Multiple jobs can be collapsed into one line even if you read them from the db,
the job arrays are one example since when they are submitted in 14.11 they are
just one record.

David
Comment 5 Dennis McRitchie 2015-01-22 00:25:05 MST
Hi David,

Thanks for your helpful reply.

Yes, I have been using sacct's -P and -n options. They do help, however to create efficient and robust code, I always prefer having an API that allows me to populate the variables in my code directly, rather than performing an external invocation of a utility, capturing its output, and then parsing it.

I'm hoping that the DB API returns the memory sizes in fixed units (e.g., kb or mb); but if it doesn't, then it would be very helpful to have an option at the API level to request that data be returned in fixed units, which would allow me to directly perform arithmetic on the returned values. The sacct memory-related values I am currently using are: REQMEM, MaxRSS, and MaxVMSize.

Similarly, I use the time fields Submit, Eligible, Start, and End, and I'm hoping the DB API will return some sort of epoch or machine time.

And finally, I use the elapsed time fields TotalCPU, TIMELIMIT, and Elapsed. It would be helpful for these to be returned in seconds, though finer granularity is OK too.

In summary, I'm looking for an interface that returns numeric values in a computational form.

Does the Slurm Database API provide this?

> Multiple jobs can be collapsed into one line even if you read them from the db,
> the job arrays are one example since when they are submitted in 14.11 they are
> just one record.

OK, that makes sense. I guess I'll just have to work around that one.

Thanks again.

Dennis
Comment 6 David Bigagli 2015-01-22 07:20:21 MST
Hi Dennis,
          one of my colleague found a bug in the perl code, the fix was committed to the 14.11.4 branch. This branch is not released yet but the commit is 0bc3c68b9c which you can have a look at. The problem was in returning a value
instead of a reference.

-my $rc = Slurmdb::connection_close($db_conn);
+my $rc = Slurmdb::connection_close(\$db_conn);

Another issue we have found is that a program cannot include

use Slurm ':all';

and

use Slurmdb ':all'; 

as there is a conflict of symbols. We have logged an internal bug to fix this.

You should be able to run the battery test under the contribs directory:

david@prometeo ~/clusters/1411/linux/build/contribs/perlapi/libslurmdb/perl $ make test
PERL_DL_NONLAZY=1 /usr/bin/perl "-MExtUtils::Command::MM" "-e" "test_harness(0, 'blib/lib', 'blib/arch')" t/*.t
t/00-use.t ...................................... ok
t/01-clusters_get.t ............................. ok
t/02-report_cluster_account_by_user.t ........... ok
t/03-report_cluster_user_by_account.t ........... ok
t/04-report_job_sizes_grouped_by_top_account.t .. ok
t/05-report_user_top_usage.t .................... ok
All tests successful.
Files=6, Tests=12,  1 wallclock secs ( 0.01 usr  0.00 sys +  0.09 cusr  0.03 csys =  0.13 CPU)
Result: PASS

David
Comment 7 Josko Plazonic 2015-01-22 08:05:35 MST
Hi David, thanks - Dennis independently already found out that it should be using \$db_conn.  I just found out that while the test 1 still doesn't do anything for us (i.e. runs but gets back no data) test 2, 3 and 4 do (5 does not).

Already pulled your patch and merged into our rpm.  I'll let Dennis see if he has enough to start pulling data this way.

Thanks!
Comment 8 Dennis McRitchie 2015-01-26 03:59:32 MST
Hi David,

Looking further into why t/01-clusters_get.t does nothing, I drilled down with DDT, and saw that the Slurmdb::clusters_get() (which translates to slurmdb_clusters_get()) does get a DBD_GOT_CLUSTERS resp.msg_type several calls deeper and returns what appears to be a non-empty list. When Slurmdb.xs tries to iterate through that list it finds nothing and returns an empty array to the Perl code. At this point I'm suspecting that there is a problem with the *.xs code, but have not gone any further at this time. Perhaps I should report this to Don?

It also looks like I'm going to be able to make good use of a Perl interface to db_api/extra_get_functions.c's slurmdb_jobs_get(), but that interface does not exist yet. So I may need to add that so I can easily query single job statistics for my seff utility.

Best,
Dennis
Comment 9 David Bigagli 2015-01-26 05:20:02 MST
Hi Dennis,
         my colleague Brian is taking over this ticket as he is the in house Perl 
expert. You will hear from him shortly.

David
Comment 10 Brian Christiansen 2015-01-26 10:22:17 MST
Slurmdb::clusters_get() is fixed in the following commit:
https://github.com/SchedMD/slurm/commit/72de52fd812e3ca2ca950d473344f4ac63d16f5e

Let me know if this doesn't fix it for you. Also test 05 returns information for me. I've attached my test program and outputs.

Thanks,
Brian
Comment 11 Brian Christiansen 2015-01-26 10:23:01 MST
Created attachment 1586 [details]
perl api test script
Comment 12 Brian Christiansen 2015-01-26 10:23:55 MST
Created attachment 1587 [details]
perl api test output
Comment 13 Dennis McRitchie 2015-01-26 11:46:10 MST
Thanks Brian! That was quick work.

No doubt Josko will apply the fix to our installation, maybe tomorrow depending on how bad the storm is here in NJ.  :-)

We'll let you know how it works. And I'll try to reproduce your test 5.

Thanks again.

Dennis
Comment 14 Josko Plazonic 2015-01-27 05:20:18 MST
Added the patch and it indeed fixes clusters_get call - thanks!

##############Test 02:
print "Test 02:\n";
my %assoc_cond = ();
$clusters = Slurmdb::report_cluster_account_by_user($db_conn, \%assoc_cond);
print Dumper($clusters);

returns an empty array but it did work for me if I give it assoc_cond (usage_start and usage_end).

As far as test 5 - turns out that it was my mistake - I gave it wrong params :(.

Dennis - update is installed on della3 so you can give it a go there.

Thanks again!
Comment 15 Brian Christiansen 2015-01-27 09:11:34 MST
Good to hear. Are there any outstanding issues on this ticket or can we close it?

Thanks,
Brian
Comment 16 Dennis McRitchie 2015-01-27 23:59:32 MST
Works for me as well. Thanks Brian.

Remaining is to get Perl API support for db_api/extra_get_functions.c's slurmdb_jobs_get(), which I need for several scripts (job efficiency, daily reports, monthly reports, quarterly spreadsheets) that use collected data on previously-running jobs.

I'm going to take a crack at this, but if you were willing to provide such an interface, I wouldn't say no!  :-)

Thanks again.

Dennis
Comment 17 Brian Christiansen 2015-01-29 03:07:05 MST
Created attachment 1589 [details]
14.11 jobs_get patch

Your in luck :). I added Slurmdb::jobs_get() in 15.08:
https://github.com/SchedMD/slurm/commit/8ef80db6bef179a5092037ec4321e27c5c781e42

I've also attached the patch for 14.11 that you can apply. Let me know how this works for you.

Thanks,
Brian

ex.
my %job_cond = ();
$job_cond{cluster_list} = ["compy"];
$job_cond{userid_list}  = [1003];
$job_cond{usage_start}  = time() - (24*60*60);
$job_cond{usage_end}    = 0;
$job_cond{without_usage_truncation} = 1;

my $jobs = Slurmdb::jobs_get($db_conn, \%job_cond);
print Dumper($jobs);
Comment 18 Dennis McRitchie 2015-01-29 03:26:59 MST
Wow! Thanks so much, Brian. I can see I would have been struggling with this for quite a while.  :-)

We'll apply the 14.11 patch and let you know how we make out.

One question about usage: am I correct that to specify a single job id, I would set:

$job_cond{step_list}  = [2626729];

And to set specific states, I would set:

$job_cond{state_list}  = ['CA','CD','F','NF','PR','TO'];

Thanks again.

Dennis
Comment 19 Brian Christiansen 2015-01-29 06:50:37 MST
I need to add some plumbing to make the step_list work. I'll get back to you.
Comment 20 Dennis McRitchie 2015-01-30 01:48:57 MST
Thanks Brian.

Looking more closely at how state-list is set in sacct, I think I'll need to pass the job state numbers (in string form) rather than the state names. So I'll need to call:

$num = $slurm->job_state_num($str);

for each state string, and then pass:

$job_cond{state_list}  = ["$num1","$num2",...];

Since David reported that there is a symbol conflict between 'use Slurm;' and 'use Slurmdb;', for now I will need to wrap the job_state_num() call in a function in a separate Perl script, to be called by my main script.

Does that sound right?

Best regards,
Dennis
Comment 21 Brian Christiansen 2015-02-04 04:30:06 MST
Created attachment 1610 [details]
updated perl api test script

Hey Dennis,

The following commit allows you to specify the jobs/steps like you can with "sacct -j" in the "step_list" condition.
ex.
$job_cond{step_list} = "2547,2549,2550.1";

https://github.com/SchedMD/slurm/commit/66e3d9c2a373ad1cea2981582af892041498787d

In my tests, I'm not hitting the symbol conflict when using job_state_num(). It's still there, it's not an issue in this case. I've attached my updated test script and output.

Thanks,
Brian
Comment 22 Brian Christiansen 2015-02-04 04:31:03 MST
Created attachment 1611 [details]
updated test output
Comment 23 Brian Christiansen 2015-02-11 01:36:30 MST
Please reopen if you find anything.
Comment 24 Dennis McRitchie 2015-02-20 02:24:49 MST
Hi Brian,

Sorry to reopen the ticket, but I just wanted to close the loop and let you know that your 'step_list' support works beautifully.

Thanks so much for all your help.

Best,
Dennis
Comment 25 Brian Christiansen 2015-02-20 02:29:30 MST
Glad to hear! Thanks for letting me know.
Comment 26 Dennis McRitchie 2015-03-31 04:50:53 MDT
Hi Brian,

The use to which I am putting jobs_get() includes extracting the QOS information. The perl API provides the qosid field, but to turn that into a qos string, I need the qos list.

This in turn requires the qos_get() function, which does not appear to be currently part of the Perl API.

Would it be possible to add this interface?

The Perl API support you put in for jobs_get() is working beautifully otherwise.

Thanks very much.
Dennis
Comment 27 Brian Christiansen 2015-03-31 08:09:52 MDT
I'll look into it.
Comment 28 Dennis McRitchie 2015-04-08 04:58:11 MDT
Hi Brian,

Just checking back if you might have an ETA on support for the qos_get() function in the Perl API. I am ready to use it, and an ETA would help me plan my schedule for the next few weeks.

Thanks for all your help.

Dennis
Comment 29 Brian Christiansen 2015-04-08 06:37:16 MDT
Hey Dennis,

I'll try to get to it in the next two weeks. Like the last one it will probably go into 15.08 but you can apply the patch. 

Thanks,
Brian
Comment 30 Dennis McRitchie 2015-04-09 00:50:48 MDT
Thanks Brian. That's helpful to know.

Dennis
Comment 31 Brian Christiansen 2015-04-20 05:06:13 MDT
Hey Dennis,

I've added the qos_get function in:
https://github.com/SchedMD/slurm/commit/343f67b4ae82654a70d7f47a9e7858bdbe69ab2b

Let me know how if works for you.

Thanks,
Brian
Comment 32 Dennis McRitchie 2015-04-21 01:12:26 MDT
Thanks so much, Brian.

Will check it out shortly and let you know how it goes.

Best,
Dennis
Comment 33 Dennis McRitchie 2015-04-21 04:40:16 MDT
Works beautifully, Brian!

Thanks very much, as always, for your good work. I believe I now have everything I need to move all my scripts (those that parsed sacct output) over to the Perl API.

Best,
Dennis