Ticket 12273

Summary: "scontrol show job" / SlurmctldProlog envvars equivalence
Product: Slurm Reporter: Kilian Cavalotti <kilian>
Component: User CommandsAssignee: Oriol Vilarrubi <jvilarru>
Status: RESOLVED WONTFIX QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: bart, marshall
Version: 20.11.8   
Hardware: Linux   
OS: Linux   
Site: Stanford Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Kilian Cavalotti 2021-08-13 17:00:43 MDT
Hi SchedMD,

For accounting and diagnostics purposes, we need to record and archive the most information possible about past jobs. To that end, and we know it's bad, we've been calling `scontrol show job` in our SlurmctldProlog. 

It's been working ok so far, most of the time, but when large bursts of jobs get started at the same time, we clearly see that slurmctld is suffering. So we're looking at getting rid of that scontrol call, and find another way to record the information provided in `scontrol show job`.

Unfortunately, what's available as environment variables in SlurmctldProlog doesn't cover all the fields provided in `scontrol show job`. I've started working on an equivalence list:

# scontrol show job / slurmctld prolog environment variables

JobId           $SLURM_JOB_ID / $SLURM_ARRAY_JOB_ID_$SLURM_ARRAY_TASK_ID
JobName         $SLURM_JOB_NAME
UserId          $SLURM_JOB_USER($SLURM_JOB_UID) 
GroupId         $SLURM_JOB_GROUP($SLURM_JOB_GID)
Priority        not relevant in SlurmctldProlog
Nice            $SLURM_PRIO_PROCESS
Account         $SLURM_JOB_ACCOUNT 
QOS             $SLURM_JOB_QOS
JobState        not relevant in SlurmctldProlog
Reason          not relevant in SlurmctldProlog
Dependency      ???
Requeue         ??? 
Restarts        ??? 
BatchFlag       ???
Reboot          ???
ExitCode        not relevant in SlurmctldProlog
DerivedExitCode not relevant in SlurmctldProlog
RunTime         not relevant in SlurmctldProlog
TimeLimit       ???
TimeMin         ???
SubmitTime      ??? 
EligibleTime    ???
AccrueTime      ???
StartTime       ??? 
EndTime         not relevant in SlurmctldProlog 
Deadline        ???
PreemptEligibleTime ???
PreemptTime     ???
SuspendTime     ???
SecsPreSuspend  ???
LastSchedEval   ???
Partition       $SLURM_JOB_PARTITION
AllocNode       $SLURM_SUBMIT_HOST
ReqNodeList     ??? 
ExcNodeList     ???
NodeList        $SLURM_JOB_NODELIST
BatchHost       ???
NumNodes        $SLURM_JOB_NUM_NODES
NumCPUs         ???
NumTasks        ???
CPUs/Task       ???
ReqB            ???
TRES            ???
Socks/Node      ???
NtasksPerN      ???
CoreSpec        ???
JOB_GRES        ???
Per-node resource allocation: Nodes CPU_IDs Mem GRES ???
MinCPUsNode     ???
MinMemoryCPU    ??? 
MinTmpDiskNode  ???
Features        $SLURM_JOB_CONSTRAINTS
DelayBoot       ???
OverSubscribe   ??? 
Contiguous      ???
Licenses        ???
Network         ???
Command         ???
WorkDir         $SLURM_JOB_WORK_DIR (new in 21.08)
StdErr          $SLURM_JOB_STDERR (new in 21.08)
StdIn           $SLURM_JOB_STDIN (new in 21.08)
StdOut          $SLURM_JOB_STOUT (new in 21.08)
Power=          not relevant in SlurmctldProlog
NtasksPerTRES   ???


Would you mind checking that list and let me know if I missed something?

And would it be possible to add the missing information? I think the most critical parts that are missing (at least for us) are:

- Dependency
- Requeue
- TimeLimit
- ReqNodeList
- ExcNodeList
- all the job resource requests information (NumCPUs, NumTasks, TRES)
- the actual resources allocated, with CPU and GPU ids per node
- Licenses
- Command

Of course, if there's a better way to do this, please let me know. What we're looking for is really capturing the most details about how a job has been submitted, and what resources have been allocated (things that are not in the accounting database).

Thanks!
--
Kilian
Comment 3 Oriol Vilarrubi 2021-08-23 10:21:43 MDT
Hi Killian

I would suggest for that task to use a Job Completion plugin, there are many of them:

elasticsearch: If you are already using elasticsearch for something else I would recommend using this as you already have all the data in an structured form and is easy to make dashboards of it. The list of values stored is the following one: account,alloc_node,array_job_id,array_task_id,cluster,container,cpu_hours,cpus_per_task,derived_ec,elapsed,@eligible,@end,excluded_nodes,exit_code,group_id,groupname,het_job_id,het_job_offset,jobid,job_name,nodes,ntasks,ntasks_per_node,ntasks_per_tres,orig_dependency,pack_job_id,pack_job_offset,parent_accounts,partition,qos,@queue_wait,reservation_name,script,@start,state,std_err,std_in,std_out,@submit,time_limit,total_cpus,total_nodes,tres_alloc,tres_req,user_id,username,wc_key,work_dir

filetxt: Most probably you find it insufficient for yout needs as not much data is stored using this, this is an example of a test job I launched before: JobId=7052 UserId=jvilarru(1000) GroupId=jvilarru(1000) Name=hostname JobState=COMPLETED Partition=debug TimeLimit=UNLIMITED StartTime=2021-08-23T16:13:10 EndTime=2021-08-23T16:13:10 NodeList=centos NodeCnt=1 ProcCnt=1 WorkDir=/home/jvilarru ReservationName= Tres=cpu=1,mem=200M,node=1,billing=1 Account=users QOS=normal WcKey= Cluster=cluster SubmitTime=2021-08-23T16:13:10 EligibleTime=2021-08-23T16:13:10 DerivedExitCode=0:0 ExitCode=0:0

script: This plugin executes the script you set in JobCompLoc with the environment populated with the job variables, you can find a full list of those in here: https://github.com/SchedMD/slurm/blob/master/src/plugins/jobcomp/script/README but again, this list does not contain all the list of data you want to extract from slurm.

mysql: This plugin stores the data in mysql, the columns are the following: | jobid | uid  | user_name | gid  | group_name | name     | state | partition | timelimit | starttime  | endtime    | nodelist | nodecnt | proc_cnt | connect_type | reboot | rotate | maxprocs | geometry | start | blockid |
This is missing again some of the fields you need for your request

And finally the you have the lua plugin, this one lets you access the job_record internal structure(filled in here https://github.com/SchedMD/slurm/blob/master/src/lua/slurm_lua.c#L340), so you can find all the data from inside slurm. I'm preparing and testing an example for you. 

In order to configure it you need to set JobCompType=jobcomp/lua and create the jobcomp.lua in the same directory as slurm.conf, In this file you need to have defined the function slurm_jobcomp_log_record with one parameter(the job record). So the minimum script would be the following one:

jobcomp.lua:
function slurm_jobcomp_log_record(job_rec)
    return slurm.SUCCESS
end
return slurm.SUCCESS

Also take into account that in 21.08 the submission line is stored in the database, so you can access it with sacct, example:

[root@centos ~]# sacct -P -j 7055 -Xo JobID,SubmitLine
JobID|SubmitLine
7055|srun --mem=200 hostname

And you can also store the job script itself in the database with AccountingStoreFlags=job_script in slurm.conf

I'll come back to you as soon as I've finished the lua script example.
Comment 4 Kilian Cavalotti 2021-08-24 16:29:26 MDT
Hi Oriol,

Thanks for the thorough answer, much appreciated!

Out of all the options you presented, the most promising and interesting to us seems to be the jobcomp/lua script.

> Also take into account that in 21.08 the submission line is stored in the
> database, so you can access it with sacct, example:
> 
> [root@centos ~]# sacct -P -j 7055 -Xo JobID,SubmitLine
> JobID|SubmitLine
> 7055|srun --mem=200 hostname
>
> And you can also store the job script itself in the database with
> AccountingStoreFlags=job_script in slurm.conf

Ah good, that will be very helpful as well.

> I'll come back to you as soon as I've finished the lua script example.

Thanks! Looking forward to it.

Cheers,
--
Kilian
Comment 5 Oriol Vilarrubi 2021-09-02 07:25:58 MDT
Hi Killian,

Some colleagues of mine told me that you have a lot of jobs in your environment, after knowing that the lua jobcompletion component is not the best option as with many jobs it might degrade the performance of slurm, now I'm finishing to test another idea which is the following:

I do not know if you are aware that slurm now offers the possibility of querying it using a rest API, so my idea is that in the slurmCtldProlog you write only the jobid into a file, and that a script constantly checks that file and queries the REST api in order to get the information about the jobs and saving that in a DB, json files, etc...

Now I'm testing this to see if it would provide all the necessary data for you.

I'll keep you updated.
Comment 6 Kilian Cavalotti 2021-09-02 09:47:19 MDT
Hi Oriol, 

(In reply to Oriol Vilarrubi from comment #5)
> Some colleagues of mine told me that you have a lot of jobs in your
> environment, 

That's true: we're averaging around 20,000 jobs in queue at any given moment, with job submission rates in the 100,000s/day.

> after knowing that the lua jobcompletion component is not the
> best option as with many jobs it might degrade the performance of slurm

We're using a decently sized job_submit.lua script, that gets executed at every job submission, and this seems to be working fine. Do you think a jobcompletion lua script would be more impactful? If anything, it should run less often, since the job_sumit script even runs for jobs that are eventually rejected, and will never reach the jobcompletion phase.

So I'm curious about the impact of a lua jobcompletion script vs a job_submit lua script.

> I do not know if you are aware that slurm now offers the possibility of
> querying it using a rest API, so my idea is that in the slurmCtldProlog you
> write only the jobid into a file, and that a script constantly checks that
> file and queries the REST api in order to get the information about the jobs
> and saving that in a DB, json files, etc...

Oh I see, that's an interesting idea. Although, if we go with an external process querying jobs, we can probably use `scontrol show job` directly instead of the extra slurmrestd layer, right? I assume slurmrestd will generate the same RPCs and potentially the same locks as `scontrol show job`, so in terms of load, on the controller, that's probably equivalent, correct?

Thanks!
--
Kilian
Comment 8 Oriol Vilarrubi 2021-09-13 06:25:10 MDT
Hello Killian,

I've been doing some tests on the jobcomp plugin of lua and unfortunately not all the fields that you need are there. You can get a list of all the currently implemented fields by looking at the function slurm_lua_job_record_field in the source code: https://github.com/SchedMD/slurm/blob/master/src/lua/slurm_lua.c#L340

But not everything are bad news, I've also been "playing around" with the filetxt completion plugin and even though it does not have all the fields that you need, it is pretty easy to add them. Also the same for the filetxt one.

So, how do you want to proceed? with the script or the filetxt? I'm inferring that the strategy is to move the process from the slurmctld prolog into the jobcompletion plugin? I'm also taking for granted that all that can be obtained using the accounting DB you do not want to use the jobcompletion plugin for it.

Greetings.
Comment 9 Kilian Cavalotti 2021-09-13 09:25:29 MDT
Hi Oriol,

(In reply to Oriol Vilarrubi from comment #8)
> I've been doing some tests on the jobcomp plugin of lua and unfortunately
> not all the fields that you need are there. You can get a list of all the
> currently implemented fields by looking at the function
> slurm_lua_job_record_field in the source code:
> https://github.com/SchedMD/slurm/blob/master/src/lua/slurm_lua.c#L340
> 
> But not everything are bad news, I've also been "playing around" with the
> filetxt completion plugin and even though it does not have all the fields
> that you need, it is pretty easy to add them. Also the same for the filetxt
> one.

That sounds great!

> So, how do you want to proceed? with the script or the filetxt? 

I think that the lua job completion approach may be best of the two, as it would provide more flexibility for users to define the recording format they want. For instance, my understanding is that the filetxt completion plugin will store all job information in a single file, while we would need to have each job's information stored in a separate file.

> I'm
> inferring that the strategy is to move the process from the slurmctld prolog
> into the jobcompletion plugin? I'm also taking for granted that all that can
> be obtained using the accounting DB you do not want to use the jobcompletion
> plugin for it.

Yes, and actually, thinking more about this approach, there are a few questions, I guess:

1. recording that information through a jobcomp plugin actually seems a bit redundant with the accounting database. I know that the DB has been recently extended to store more information about jobs (like the submission script, workdir, etc) but parts are still missing. So instead of extending the information recorded by the jobcomp mechanism, wouldn't it make more sense to continue adding the missing bits in the accounting database, and thus have a single point of reference for all job information?
Fragmenting the information across the accounting DB and a job completion plugin doesn't seem optimal in that respect.

2. our current scontrol-based system occurs during the SlurmctldProlog, when the job *starts*. With a jobcomp plugin, it would occurs when the job *ends*. Meaning that during the whole duration of the job, that information would not be  available. And we routinely rely on that information while jobs are running, so moving to a jobcomp plugin wouldn't actually work for this, since the information wouldn't be available until a job has ended.
On the other hand, job information becomes available in the accounting database as soon as the job starts.

Both those points make me think that expanding the job accounting database to store the missing information would be best than using a separate jobcomp plugin.


What do you think?

Thanks!
--
Kilian
Comment 10 Oriol Vilarrubi 2021-09-14 12:52:23 MDT
Hi Killian

I will reply you inline of your text

> I think that the lua job completion approach may be best of the two, as it
> would provide more flexibility for users to define the recording format they
> want. For instance, my understanding is that the filetxt completion plugin
> will store all job information in a single file, while we would need to have
> each job's information stored in a separate file.

I agree with the point that the lua one will provide more flexibility, also as you said, the filetxt one stores it in the same file, so that would be problematic if you want to read it while it is being written.

> 
> 1. recording that information through a jobcomp plugin actually seems a bit
> redundant with the accounting database. I know that the DB has been recently
> extended to store more information about jobs (like the submission script,
> workdir, etc) but parts are still missing. So instead of extending the
> information recorded by the jobcomp mechanism, wouldn't it make more sense
> to continue adding the missing bits in the accounting database, and thus
> have a single point of reference for all job information?
> Fragmenting the information across the accounting DB and a job completion
> plugin doesn't seem optimal in that respect.

That would be a difficult topic, some sites have really really big databases, to the point where adding a single field would mean that their database would grew a lot, and might have some impact on the performance.
Also if we store all these data separately (Dependecy, exclude/include nodelist etc) we might be duplication the data, let me explain this: As you know in 21.08 the submission line is stored in the DB, and there is an option to also store also the job script itself, in the vast majority of the jobs that contains the totality of the submission data, either inside the job script or in parameters in the submitline.
Even saying that I understand your point that you want to have this data directly accessible without the need to parse the job script and the submit line, but as said before we need to be very careful in how we modify the DB fields.

> 2. our current scontrol-based system occurs during the SlurmctldProlog, when
> the job *starts*. With a jobcomp plugin, it would occurs when the job
> *ends*. Meaning that during the whole duration of the job, that information
> would not be  available. And we routinely rely on that information while
> jobs are running, so moving to a jobcomp plugin wouldn't actually work for
> this, since the information wouldn't be available until a job has ended.
> On the other hand, job information becomes available in the accounting
> database as soon as the job starts.

That is a very valid point, you would have no data while the job is running(if you move entirely to the jobcomp plugin).

> Both those points make me think that expanding the job accounting database
> to store the missing information would be best than using a separate jobcomp
> plugin.

As said before I am reluctant to modify the fields of what is stored in the database. 
Can I try to convice you so that I include the missing ENV vars in the slurmctldprolog? I cannot guarantee that this change would ship officially with the next slurm version, but I can provide you a patch for your local slurm installation to include it. 
Does that sounds good to you?
Comment 11 Kilian Cavalotti 2021-09-14 13:03:30 MDT
Hi Oriol, 

(In reply to Oriol Vilarrubi from comment #10)
> As said before I am reluctant to modify the fields of what is stored in the
> database. 

And you made valid points about it, that's true. This is probably a larger issue than just this bug, and something that likely has been discussed many time, but the discrepancy (and sometimes redundancy) between all the different ways to look at a job (through squeue, scontrol show job, sstat or sacct) has always been and remains a great source of confusion for users and sysadmins alike.

> Can I try to convice you so that I include the missing ENV vars in the
> slurmctldprolog? 

Well, yes, that would be great! Having the missing bits available as environment variables in SlurmctldProlog would totally work for us.

> I cannot guarantee that this change would ship officially
> with the next slurm version, but I can provide you a patch for your local
> slurm installation to include it. 
> Does that sounds good to you?

I'm completely fine with testing and carrying a local patch. 
It will very likely benefit other sites as well, so I'm pretty sure it would be useful to eventually integrate it.

Thank you!
--
Kilian
Comment 16 Jason Booth 2022-04-12 16:24:04 MDT
Killian - I have been reviewing this issue with Oriol. We would be willing to expand the database, however adding all the environment variables is something we are not interested at this time, at least not without some type of sponsored development.

Recently, we added a flag that help sites record more information about their jobs, which I am sure you are aware of.

New:
job_script

Current/Previously supported:
job_env
job_comment

https://slurm.schedmd.com/slurm.conf.html#OPT_AccountingStoreFlags

Although this does not obtain every check box or feature you are after, it does offer some added details for your jobs that was not previously stored.