| Summary: | Slurmdb perl api | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Josko Plazonic <plazonic> |
| Component: | Other | Assignee: | Brian Christiansen <brian> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | brian, da, dmcr |
| Version: | 14.11.3 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Princeton (PICSciE) | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 15.08.0pre2 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
perl api test script
perl api test output 14.11 jobs_get patch updated perl api test script updated test output |
||
|
Description
Josko Plazonic
2015-01-20 02:22:42 MST
This is a pretty old code contributed by Don Lipari. Did you try to contact him, he is usually pretty responsive. Why you don't like sacct, is there any missing feature? David I'll check tomorrow with Dennis for details but I think it boils down to trying to avoid middleman - his cluster stats program is perl based so rather then invoke sacct we were trying to have the program pull data out of the db directly. He also had some errors when we moved from 14.03 to 14.11 - again - more details tomorrow. Now I see that that might be a lot harder then I thought. Turns out that all of Slurmdb users (sreport, sacct and sacctmgr, sview maybe too), do not link libslurmdb dynamically but statically. Indeed my simple test C program (that did little else but get a db connection and then close it) only started working once I linked with libslurmdb.a rather then .so. Of course the perl Slurmdb.so is dynamically linked and that's probably just the start of its problems (linking libslurmdb statically did not help - plugins are still not initialized correctly which is the real reason for failure). So I'll check with Dennis and see if he wants to purse this, e.g. by contacting Don. Hi David, Dennis McRitchie commenting here since Josko wanted my response to your question of why I didn't want to use sacct for my reporting purposes. It boils down to the fact that sacct produces human-readable output, which is not always optimal when one needs machine-readable output: e.g., 1) parsing the actual memory size output with its variable suffixes (K, M, G, T, or none) is a bit tricky. 2) Ditto for the requested memory size with its additional c or n suffix. 3) Multiple jobs can be collapsed on one line, e.g., 12345[1-10] Since I am retrieving this data to create tools and as well as daily and monthly reports of our own custom design, I am hoping that I will get more machine-readable output by using the Slurm database API directly. Since my tools are in Perl, I am trying to use the Slurmdb module. Now that Slurmdb.so is statically linked against libslurmdb.a statically, I believe I was able to solve one of the 2 problems we've so far encountered when running the Perl Slurmdb test suite. These tests are old, and the interface to slurmdb_connection_close() has likely changed since they were written. So I had to change the Perl code a bit to get it to work. Once I have worked my way through the examples, I'll let Don know what I found and changed. Regards, Dennis
Hi Dennis,
thanks for the information. For the machine readable format I think you could have a look at the -p and -P options of sacct that generate parsable output.
Also the -n option could be useful to you. The documentation can be found
here: http://slurm.schedmd.com/sacct.html.
The parsing of the suffixes can indeed be tricky, what if we add a new option
that will always print the fields in megabytes? Which fields are you specifically interested in?
Multiple jobs can be collapsed into one line even if you read them from the db,
the job arrays are one example since when they are submitted in 14.11 they are
just one record.
David
Hi David,
Thanks for your helpful reply.
Yes, I have been using sacct's -P and -n options. They do help, however to create efficient and robust code, I always prefer having an API that allows me to populate the variables in my code directly, rather than performing an external invocation of a utility, capturing its output, and then parsing it.
I'm hoping that the DB API returns the memory sizes in fixed units (e.g., kb or mb); but if it doesn't, then it would be very helpful to have an option at the API level to request that data be returned in fixed units, which would allow me to directly perform arithmetic on the returned values. The sacct memory-related values I am currently using are: REQMEM, MaxRSS, and MaxVMSize.
Similarly, I use the time fields Submit, Eligible, Start, and End, and I'm hoping the DB API will return some sort of epoch or machine time.
And finally, I use the elapsed time fields TotalCPU, TIMELIMIT, and Elapsed. It would be helpful for these to be returned in seconds, though finer granularity is OK too.
In summary, I'm looking for an interface that returns numeric values in a computational form.
Does the Slurm Database API provide this?
> Multiple jobs can be collapsed into one line even if you read them from the db,
> the job arrays are one example since when they are submitted in 14.11 they are
> just one record.
OK, that makes sense. I guess I'll just have to work around that one.
Thanks again.
Dennis
Hi Dennis,
one of my colleague found a bug in the perl code, the fix was committed to the 14.11.4 branch. This branch is not released yet but the commit is 0bc3c68b9c which you can have a look at. The problem was in returning a value
instead of a reference.
-my $rc = Slurmdb::connection_close($db_conn);
+my $rc = Slurmdb::connection_close(\$db_conn);
Another issue we have found is that a program cannot include
use Slurm ':all';
and
use Slurmdb ':all';
as there is a conflict of symbols. We have logged an internal bug to fix this.
You should be able to run the battery test under the contribs directory:
david@prometeo ~/clusters/1411/linux/build/contribs/perlapi/libslurmdb/perl $ make test
PERL_DL_NONLAZY=1 /usr/bin/perl "-MExtUtils::Command::MM" "-e" "test_harness(0, 'blib/lib', 'blib/arch')" t/*.t
t/00-use.t ...................................... ok
t/01-clusters_get.t ............................. ok
t/02-report_cluster_account_by_user.t ........... ok
t/03-report_cluster_user_by_account.t ........... ok
t/04-report_job_sizes_grouped_by_top_account.t .. ok
t/05-report_user_top_usage.t .................... ok
All tests successful.
Files=6, Tests=12, 1 wallclock secs ( 0.01 usr 0.00 sys + 0.09 cusr 0.03 csys = 0.13 CPU)
Result: PASS
David
Hi David, thanks - Dennis independently already found out that it should be using \$db_conn. I just found out that while the test 1 still doesn't do anything for us (i.e. runs but gets back no data) test 2, 3 and 4 do (5 does not). Already pulled your patch and merged into our rpm. I'll let Dennis see if he has enough to start pulling data this way. Thanks! Hi David, Looking further into why t/01-clusters_get.t does nothing, I drilled down with DDT, and saw that the Slurmdb::clusters_get() (which translates to slurmdb_clusters_get()) does get a DBD_GOT_CLUSTERS resp.msg_type several calls deeper and returns what appears to be a non-empty list. When Slurmdb.xs tries to iterate through that list it finds nothing and returns an empty array to the Perl code. At this point I'm suspecting that there is a problem with the *.xs code, but have not gone any further at this time. Perhaps I should report this to Don? It also looks like I'm going to be able to make good use of a Perl interface to db_api/extra_get_functions.c's slurmdb_jobs_get(), but that interface does not exist yet. So I may need to add that so I can easily query single job statistics for my seff utility. Best, Dennis
Hi Dennis,
my colleague Brian is taking over this ticket as he is the in house Perl
expert. You will hear from him shortly.
David
Slurmdb::clusters_get() is fixed in the following commit: https://github.com/SchedMD/slurm/commit/72de52fd812e3ca2ca950d473344f4ac63d16f5e Let me know if this doesn't fix it for you. Also test 05 returns information for me. I've attached my test program and outputs. Thanks, Brian Created attachment 1586 [details]
perl api test script
Created attachment 1587 [details]
perl api test output
Thanks Brian! That was quick work. No doubt Josko will apply the fix to our installation, maybe tomorrow depending on how bad the storm is here in NJ. :-) We'll let you know how it works. And I'll try to reproduce your test 5. Thanks again. Dennis Added the patch and it indeed fixes clusters_get call - thanks! ##############Test 02: print "Test 02:\n"; my %assoc_cond = (); $clusters = Slurmdb::report_cluster_account_by_user($db_conn, \%assoc_cond); print Dumper($clusters); returns an empty array but it did work for me if I give it assoc_cond (usage_start and usage_end). As far as test 5 - turns out that it was my mistake - I gave it wrong params :(. Dennis - update is installed on della3 so you can give it a go there. Thanks again! Good to hear. Are there any outstanding issues on this ticket or can we close it? Thanks, Brian Works for me as well. Thanks Brian. Remaining is to get Perl API support for db_api/extra_get_functions.c's slurmdb_jobs_get(), which I need for several scripts (job efficiency, daily reports, monthly reports, quarterly spreadsheets) that use collected data on previously-running jobs. I'm going to take a crack at this, but if you were willing to provide such an interface, I wouldn't say no! :-) Thanks again. Dennis Created attachment 1589 [details] 14.11 jobs_get patch Your in luck :). I added Slurmdb::jobs_get() in 15.08: https://github.com/SchedMD/slurm/commit/8ef80db6bef179a5092037ec4321e27c5c781e42 I've also attached the patch for 14.11 that you can apply. Let me know how this works for you. Thanks, Brian ex. my %job_cond = (); $job_cond{cluster_list} = ["compy"]; $job_cond{userid_list} = [1003]; $job_cond{usage_start} = time() - (24*60*60); $job_cond{usage_end} = 0; $job_cond{without_usage_truncation} = 1; my $jobs = Slurmdb::jobs_get($db_conn, \%job_cond); print Dumper($jobs); Wow! Thanks so much, Brian. I can see I would have been struggling with this for quite a while. :-)
We'll apply the 14.11 patch and let you know how we make out.
One question about usage: am I correct that to specify a single job id, I would set:
$job_cond{step_list} = [2626729];
And to set specific states, I would set:
$job_cond{state_list} = ['CA','CD','F','NF','PR','TO'];
Thanks again.
Dennis
I need to add some plumbing to make the step_list work. I'll get back to you. Thanks Brian.
Looking more closely at how state-list is set in sacct, I think I'll need to pass the job state numbers (in string form) rather than the state names. So I'll need to call:
$num = $slurm->job_state_num($str);
for each state string, and then pass:
$job_cond{state_list} = ["$num1","$num2",...];
Since David reported that there is a symbol conflict between 'use Slurm;' and 'use Slurmdb;', for now I will need to wrap the job_state_num() call in a function in a separate Perl script, to be called by my main script.
Does that sound right?
Best regards,
Dennis
Created attachment 1610 [details] updated perl api test script Hey Dennis, The following commit allows you to specify the jobs/steps like you can with "sacct -j" in the "step_list" condition. ex. $job_cond{step_list} = "2547,2549,2550.1"; https://github.com/SchedMD/slurm/commit/66e3d9c2a373ad1cea2981582af892041498787d In my tests, I'm not hitting the symbol conflict when using job_state_num(). It's still there, it's not an issue in this case. I've attached my updated test script and output. Thanks, Brian Created attachment 1611 [details]
updated test output
Please reopen if you find anything. Hi Brian, Sorry to reopen the ticket, but I just wanted to close the loop and let you know that your 'step_list' support works beautifully. Thanks so much for all your help. Best, Dennis Glad to hear! Thanks for letting me know. Hi Brian, The use to which I am putting jobs_get() includes extracting the QOS information. The perl API provides the qosid field, but to turn that into a qos string, I need the qos list. This in turn requires the qos_get() function, which does not appear to be currently part of the Perl API. Would it be possible to add this interface? The Perl API support you put in for jobs_get() is working beautifully otherwise. Thanks very much. Dennis I'll look into it. Hi Brian, Just checking back if you might have an ETA on support for the qos_get() function in the Perl API. I am ready to use it, and an ETA would help me plan my schedule for the next few weeks. Thanks for all your help. Dennis Hey Dennis, I'll try to get to it in the next two weeks. Like the last one it will probably go into 15.08 but you can apply the patch. Thanks, Brian Thanks Brian. That's helpful to know. Dennis Hey Dennis, I've added the qos_get function in: https://github.com/SchedMD/slurm/commit/343f67b4ae82654a70d7f47a9e7858bdbe69ab2b Let me know how if works for you. Thanks, Brian Thanks so much, Brian. Will check it out shortly and let you know how it goes. Best, Dennis Works beautifully, Brian! Thanks very much, as always, for your good work. I believe I now have everything I need to move all my scripts (those that parsed sacct output) over to the Perl API. Best, Dennis |