Ticket 9331 - Expand and sync prolog/epilog environment variables
Summary: Expand and sync prolog/epilog environment variables
Status: RESOLVED DUPLICATE of ticket 12110
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 20.02.2
Hardware: Linux Linux
: 5 - Enhancement
Assignee: Unassigned Developer
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-07-02 16:24 MDT by Anthony DelSorbo
Modified: 2022-04-28 15:29 MDT (History)
5 users (show)

See Also:
Site: NOAA
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: NESCC
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Juno slurm.conf file (7.95 KB, text/plain)
2020-07-03 11:48 MDT, Anthony DelSorbo
Details
sdiag out from 0628 through 0702 (3.98 KB, application/x-compressed-tar)
2020-07-03 11:52 MDT, Anthony DelSorbo
Details
perf data file (about 10 min) (10.31 MB, application/x-compressed-tar)
2020-08-17 11:45 MDT, Anthony DelSorbo
Details
file created via: perf archive perf.data (8.25 MB, application/x-bzip)
2020-08-17 11:47 MDT, Anthony DelSorbo
Details
Refreshed sdiag as of this morning directly after perf run (4.47 KB, text/plain)
2020-08-17 11:48 MDT, Anthony DelSorbo
Details
All my pro/epilog files (30.00 KB, application/x-tar)
2020-08-18 12:52 MDT, Anthony DelSorbo
Details
perf data file refreshed 2020-08-18 (36.21 MB, application/x-compressed-tar)
2020-08-18 12:54 MDT, Anthony DelSorbo
Details
perf data file refreshed 2020-08-24 (21.14 MB, application/x-compressed-tar)
2020-08-24 08:26 MDT, Anthony DelSorbo
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Anthony DelSorbo 2020-07-02 16:24:03 MDT
I'm seeing some odd behavior as of late with jobs failing with either exit code 0:9 or exit code 1:0.  This occurs during a stress test we run to ensure the system still functions as it should and is able to handle a large number of job requests.  

In the case where we see exit code 0:9, for example, 

Jul  1 21:26:43 j1c10 EPILOG-ROOT: Job 3638483: (86) Finished: Job 3638483 (serial) finished for user Raghu.Reddy in partition juno with exit code 0:9
Jul  1 21:26:43 j1c10 EPILOG-ROOT: Job 3638491: (86) Finished: Job 3638491 (serial) finished for user Raghu.Reddy in partition juno with exit code 0:9
Jul  1 21:26:43 j1c10 EPILOG-ROOT: Job 3638492: (86) Finished: Job 3638492 (serial) finished for user Raghu.Reddy in partition juno with exit code 0:9
Jul  1 21:26:43 j1c13 EPILOG-ROOT: Job 3638480: (86) Finished: Job 3638480 (serial) finished for user Raghu.Reddy in partition juno with exit code 0:9
Jul  1 21:26:43 j1c13 EPILOG-ROOT: Job 3638481: (86) Finished: Job 3638481 (serial) finished for user Raghu.Reddy in partition juno with exit code 0:9


there are some corresponding entries in the slurmctld at the time of the above messages:

61539703 [2020-07-01T21:26:43.207] debug:  Note large processing time from _slurm_rpc_epilog_complete: usec=1858446 began=21:26:41.348
...
61539713 [2020-07-01T21:26:43.209] debug:  Note large processing time from _slurm_rpc_dump_job_single: usec=1691571 began=21:26:41.516

These jobs do tax the slurmctld as they are short, and chain themselves by submitting a copy of themselves via the sbatch command just before they end.  The jobs with exit code 0:9 appear to fail at the point where the sbatch command is fired off to launch the chained job.  The job uses only one core and other jobs share the node (same type of job), up to 40 jobs (one per core).  Out of 85K jobs processed, we have experienced about 80 fails.  While 85K jobs in about 20 hours seems to be a lot, we have run up to 150K jobs in 24 hours, in the past, without issue.   

I do believe this is a communication issue among the nodes.  BUT, I have tuned the system via the high-throughput guidelines SchedMD has provided and this is a very small cluster - two server nodes and only 14 compute nodes in action.  What limit have I reached that may be causing this intermittent issue?  

Please let me know what additional information I can provide you to assist in tuning the system
Comment 1 Dominik Bartkiewicz 2020-07-03 10:30:41 MDT
Hi

Can you send me slurm.conf and output from sdiag for this system?

Dominik
Comment 2 Anthony DelSorbo 2020-07-03 11:48:22 MDT
Created attachment 14906 [details]
Juno slurm.conf file
Comment 3 Anthony DelSorbo 2020-07-03 11:52:57 MDT
Created attachment 14907 [details]
sdiag out from 0628 through 0702

This file contains 5 sdiag files.  We collect these nightly at 23:59 UTC
Comment 4 Dominik Bartkiewicz 2020-07-10 09:36:36 MDT
Hi

Sorry for the late response.

In sdiag I noticed:
REQUEST_SUBMIT_BATCH_JOB seems to be heavy. Could this be caused by JobSubmitPlugins=lua?

A lot of REQUEST_NODE_INFO, REQUEST_JOB_INFO_SINGLE. Do these RPCs come from prolog/epilog?
If yes, what info you require in this script?

Can you grab some perf data from ~10 minutes (both perf.data.tar.bz2 and perf.data)?
Maybe this will show us some bottleneck.

eg.:
perf record  -s --call-graph dwarf -p `pidof slurmctld`
perf archive perf.data

then send both perf.data.tar.bz2 and perf.data

Dominik
Comment 5 Dominik Bartkiewicz 2020-08-11 05:19:56 MDT
Hi

Any news?

Dominik
Comment 6 Tony 2020-08-11 10:12:07 MDT
(In reply to Dominik Bartkiewicz from comment #5)
Hi Dominik,

Sorry for the delay.  Had some personal medical issues to attend to since your reply and haven't had an opportunity to get back to this.  

We're going to update our system to 20.02.4 today and rerun our stress tests.  I'll collect that info during that time and send it to you.

Best,

Tony.
Comment 7 Anthony DelSorbo 2020-08-14 13:44:47 MDT
(In reply to Dominik Bartkiewicz from comment #4)
> Can you grab some perf data from ~10 minutes (both perf.data.tar.bz2 and
> perf.data)?
> Maybe this will show us some bottleneck.
> 
> eg.:
> perf record  -s --call-graph dwarf -p `pidof slurmctld`
> perf archive perf.data

Dominik,

Not very familiar with perf.  I'll come up to speed on it.  But we didn't even have in on the system.  Just installed it via yum install perf.  Regardless, I don't believe it's working:

[root@bqs7 ~]# perf record  -s --call-graph dwarf -p `pidof slurmctld`
WARNING: Ignored open failure for pid 74924
WARNING: Ignored open failure for pid 74936
WARNING: Ignored open failure for pid 74950
WARNING: Ignored open failure for pid 74957
WARNING: Ignored open failure for pid 74960

And, even though perf.data exists with a size of 552K, there is not a perf.data.tar.bz2 file.  And trying to read perf.data:

[root@bqs7 ~]# perf script -i perf.data > perf.data.txt
WARNING: The perf.data file's data size field is 0 which is unexpected.

please let me know how you'd like to proceed.

Tony
Comment 8 Dominik Bartkiewicz 2020-08-17 09:13:05 MDT
Hi

To generate  perf.data.tar.bz2 you need to use  "perf archive perf.data".

Dominik
Comment 9 Anthony DelSorbo 2020-08-17 11:45:11 MDT
Created attachment 15470 [details]
perf data file (about 10 min)

Compressed perf.data file using:tar -czvf perf.data.tgz perf.data necessary to fit in file size limit.
Comment 10 Anthony DelSorbo 2020-08-17 11:47:16 MDT
Created attachment 15471 [details]
file created via: perf archive perf.data
Comment 11 Anthony DelSorbo 2020-08-17 11:48:14 MDT
Created attachment 15472 [details]
Refreshed sdiag as of this morning directly after perf run
Comment 12 Dominik Bartkiewicz 2020-08-18 11:29:45 MDT
Hi

I am afraid that perf.data is broken (internet reports that this can happen with big processes) could you take perf.data one more time (perf.data.tar.bz2 you already send is fine)?

Could you send me prolog and epilog from slurmctld and slurmd?

Dominik
Comment 13 Anthony DelSorbo 2020-08-18 12:52:22 MDT
Created attachment 15490 [details]
All my pro/epilog files
Comment 14 Anthony DelSorbo 2020-08-18 12:54:16 MDT
Created attachment 15491 [details]
perf data file refreshed 2020-08-18

See if this works out.
Comment 15 Dominik Bartkiewicz 2020-08-24 06:40:54 MDT
Hi

Unfortunately, I also can't read this perf.
Could you check if you can locally generate any useful output?
eg.:
perf report --hierarchy -T -i  perf.data

You can also try to use LBR to grab perf.data (it only works on newer Intel CPU)
eg.:
perf record --call-graph=lbr -p `pidof slurmctld`

I noticed that every prolog and epilog generate one "scontrol show job <job_id>"
Using s-tools (squeue, scontrol) in a script can always cause some performance issues in a high throughput system. The ideal is to use data from environment variables.

Dominik
Comment 16 Anthony DelSorbo 2020-08-24 07:49:19 MDT
(In reply to Dominik Bartkiewicz from comment #15)

> Could you check if you can locally generate any useful output?
> eg.:
> perf report --hierarchy -T -i  perf.data

I get the same as you:

[root@bqs7 ~]# perf report --hierarchy -T -i  perf.data
0x3407f2a0 [0xffff]: failed to process type: -1596379697
Warning:
71 out of order events recorded.
Error:
failed to process sample
# To display the perf.data header info, please use --header/--header-only options.


> You can also try to use LBR to grab perf.data (it only works on newer Intel
> CPU)
> eg.:
> perf record --call-graph=lbr -p `pidof slurmctld`

[root@bqs7 ~]# perf record --call-graph=lbr -p `pidof slurmctld`
Lowering default frequency rate to 3000.
Please consider tweaking /proc/sys/kernel/perf_event_max_sample_rate.
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.245 MB perf.data (803 samples) ]


I've run this for less than a minute on a quiet system and it does seem to provide a call-graph report - but this is on a quiet system:  

[root@bqs7 ~]# perf report --hierarchy -T -i  perf.data
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 803  of event 'cycles:ppp'
# Event count (approx.): 44852584
#
....

After seeing that, I also retried: perf record  -s --call-graph dwarf -p `pidof slurmctld` and it too was successful for a short period on a quiet system.  Let me restart the job load to see what happens.  

> I noticed that every prolog and epilog generate one "scontrol show job
> <job_id>"
> Using s-tools (squeue, scontrol) in a script can always cause some
> performance issues in a high throughput system. The ideal is to use data
> from environment variables.

That would be great if the data was there.  But, I have no way of knowing if the job is on a shared node or not.  This is needed so that we can determine whether or not the cleanup procedure can be run on the node (both in prolog and in epilog).  If you know of another way to detect this condition, I would gladly give up the "scontrol sho job" calls.  Ideally, there should be a way for all job information to be available at the node(s) without having to call scontrol.  I'd be interested in hearing recommendations from SchedMD.
Comment 17 Anthony DelSorbo 2020-08-24 08:26:51 MDT
Created attachment 15561 [details]
perf data file refreshed 2020-08-24

This seems to work on my end.  See if it is useful to you.
Comment 18 Dominik Bartkiewicz 2020-08-25 09:11:32 MDT
Hi

PrologSlurmctld and EpilogSlurmctld scripts seem to do just logging.
Could you check if disabling PrologSlurmctld and EpilogSlurmctld help?

Have you considered using pam_slurm_adopt instead of the cleaning procedure in prolog/epilog?
I have tried to locate all data needed by those scripts that are currently taken from scontrol:

isShared() NUMTASKS, NUMCPUS, PARTITION (now env exists only in prolog)
epilog.root.sh BATCHHOST

Do you think it is possible to change this approach slightly to eliminate scontrol from this script?

If you describe which data you need, perhaps we will be able to add them to prolog and epilog environment in 20.11.

Have you checked the high throughput guide?
https://slurm.schedmd.com/high_throughput.html

Dominik
Comment 19 Anthony DelSorbo 2020-08-25 13:51:58 MDT
(In reply to Dominik Bartkiewicz from comment #18)
> Hi
> 
> PrologSlurmctld and EpilogSlurmctld scripts seem to do just logging.
> Could you check if disabling PrologSlurmctld and EpilogSlurmctld help?

I can do that.  Will let you know what happens.  But, I'm having trouble understanding why this would be troublesome since it should be a "local" RPC call.

> Have you considered using pam_slurm_adopt instead of the cleaning procedure
> in prolog/epilog?

We are already using pam_slurm_adopt.  How is this helpful?  As I understand it, that simply prevents other users from getting onto a node if those users don't already have a job there.  The more pressing issue is that there may be left over processes or memory (/dev/shm ) or shared storage (/tmp) still allocated (especially from an ill-terminated job).  Our past experience has forced us in this direction to ensure nodes are clean after each job or prior to the next job.

> I have tried to locate all data needed by those scripts that are currently
> taken from scontrol:
> 
> isShared() NUMTASKS, NUMCPUS, PARTITION (now env exists only in prolog)
> epilog.root.sh BATCHHOST

Since I need to clean up after a job and before a job, I need the information in the prolog.root and epilog.root phases.  Our way of determining if a job is on a shared node is to naively test:

[[ $NUMTASKS -eq 1 ]] && [[ $NUMCPUS -eq 1 ]] && [[ $NUMNODES -eq 1 ]] 

since we permit single core jobs to be placed on shared nodes.  2 or more cores get exclusive access to a node.  If there's a better way of determining this, I wouldn't have any reservations of dropping my approach.  In fact, we really want a more robust method of determining whether or not a job is on a node being shared with other jobs since the customer would prefer to be able to have multiple jobs on nodes if the resources are available. 


> Do you think it is possible to change this approach slightly to eliminate
> scontrol from this script?

Our goal is to provide the user clean nodes to work on.  I would love to give up scontrol if I could be provided with the necessary information at each node in both the prolog and epilog.  We do use the output of "scontrol sho job" for troubleshooting purposes by logging that info to the syslog.  It would be helpful if SchedMD would simply provide that info to each node upon allocation.  That would solve this issue altogether.


> If you describe which data you need, perhaps we will be able to add them to
> prolog and epilog environment in 20.11.  

See above.


> Have you checked the high throughput guide?
> https://slurm.schedmd.com/high_throughput.html

For the "most" part, I've been pretty cognizant of this document - at least for the proc filesystem:

cat /proc/sys/fs/file-max
19468965

cat /proc/sys/net/ipv4/tcp_max_syn_backlog
2048

cat /proc/sys/net/ipv4/tcp_syncookies
1

cat /proc/sys/net/ipv4/tcp_synack_retries
5

cat /proc/sys/net/core/somaxconn
1024

cat /proc/sys/net/ipv4/ip_local_port_range
32768	60999

The exception is the txqueuelen, which at the moment remains at 1000.  If you think this is a contributor to my issues, we would have to make a cluster-wide change as I would think this should apply to all slurmd nodes as well as the controllers.

As for slurmctld limits, most are set to unlimited (as you can see), but open file count might need some tweaking. On the other hand, it is set in systemd to LimitNOFILE=65536.  Is this sufficient?  Do you see anything else that should be tuned?

[root@bqs7 ~]# cat /proc/2645/limits 
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            unlimited            unlimited            bytes     
Max core file size        unlimited            unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             766912               766912               processes 
Max open files            65536                65536                files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       766912               766912               signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us   

I'll take some time to review the slurm.conf configuration.  Most of the settings we have were reviewed by Tim Wickberg about 2 years ago.  But - it's time for a review - now that we have more experience under our belt.

As for slurmdbd, we don't purge any of our data.  Customer wants the data to be persistent for the life of the system.  If you think this is a major contributor, we can have that discussion with the customer - please let me know.  We don't use CommitDelay.  But may consider setting it to 1 for testing.  Customer is very risk averse and does not want to lose any data.
Comment 21 Dominik Bartkiewicz 2020-08-27 04:55:02 MDT
Hi

Slurm hasn't "local" RPCs. All requests are threaded the same way.

"scontrol show job <>"  request requires jobs read lock and it is interrupting for scheduler/backfill.

Besides, PrologSlurmctld creates a separate slurmctld thread for every job start, which also requires the acquisition of a job write lock, This is a consuming operation that can severely limit scheduler throughput.

Dominik
Comment 22 Dominik Bartkiewicz 2020-09-02 05:01:10 MDT
Hi

I don't think slurmdbd is making any problem in this case.

For sure, enabling slurmctld prolog can significantly limit scheduling throughput.

We have a long term plan to extend and unified set of env available in prolog epilog scripts. But for now, I recommend limit using "scontrol show job" in scripts to a minimum.

cons_tres plugin can also be notable slower (eg:. preempt/qos) then cons_res.
Do you plan to use features exclusively available in cons_tres?

Does disabling slurmctld prolog help?

Dominik
Comment 23 Anthony DelSorbo 2020-09-02 08:00:35 MDT
(In reply to Dominik Bartkiewicz from comment #22)
Hi Dominik,

> For sure, enabling slurmctld prolog can significantly limit scheduling
> throughput.
> Does disabling slurmctld prolog help?

Yes - I've disabled both the prolog and epilog slurmctld scripts.  It did provide a more stable environment.  Thank you for your hints there.  

> We have a long term plan to extend and unified set of env available in
> prolog epilog scripts. But for now, I recommend limit using "scontrol show
> job" in scripts to a minimum.

Yes. It is now limited to only the root prolog and epilog. We use it only once per script.  This is still troublesome as when there are many jobs trying to get at the controller at the same time, or even a job with many nodes - each making a call to the slurmctld at the same time, there is a noticeable impact on the responsiveness of the controller.  

Will any of the "long term plan ... env" be available in 20.11?  Our inability to efficiently determine whether a job is on a shared node (in prolog and epilog) is really affecting us.  It would be very helpful to have this capability sooner than later.

> cons_tres plugin can also be notable slower (eg:. preempt/qos) then cons_res.
> Do you plan to use features exclusively available in cons_tres?

We do use preemption on a limited basis.  We expect the need for this to increase in the future.  I know the customer wants to take advantage of many more of Slurm's features.

Tony.
Comment 24 Dominik Bartkiewicz 2020-09-03 06:52:13 MDT
(In reply to Anthony DelSorbo from comment #23)
> 
> Yes. It is now limited to only the root prolog and epilog. We use it only
> once per script.  This is still troublesome as when there are many jobs
> trying to get at the controller at the same time, or even a job with many
> nodes - each making a call to the slurmctld at the same time, there is a
> noticeable impact on the responsiveness of the controller.

Could you send me the output from sdiag and take another perf?

> 
> Will any of the "long term plan ... env" be available in 20.11?

We can add a particular field to 20.11 but a bigger rewriting of this code will be done in 21.08

preempt/qos is available in both select plugins. But it can work much slower in cons_tres.
Do you know which of the features they consider to use?


Dominik
Comment 25 Dominik Bartkiewicz 2020-09-29 04:28:33 MDT
Hi

Any news?
Do you still notice problems during the stress tests? If yes, could you send me the data mentioned in the previous comment?

Dominik
Comment 26 Dominik Bartkiewicz 2020-10-12 05:48:16 MDT
Hi

Any news?
If you don't make here any update I will close this in few days as timeout.

Dominik
Comment 27 Anthony DelSorbo 2020-10-12 07:10:27 MDT
(In reply to Dominik Bartkiewicz from comment #26)
> Hi
> 
> Any news?
Dominik - Apologies for the delay.  The test system is still being used to test 20.11 pre-release for the enhancements we requested.  I'm unsure as to when I will get the system back.

I can tell you that since we prevented the slurmctld pro/epi scripts from running (since it does "scontrol sho job", too), it performs much better.  But again, I am now not able to use a feature of slurm because of this performance degradation in this circumstance.

Is there any news on your end on providing the necessary job information to the pro/epi scripts at the nodes (and the controller) so that we don't have to query the controller to get it?

Best,

Tony.
Comment 28 Dominik Bartkiewicz 2020-10-13 08:27:56 MDT
Hi

If you give me a list of fields/information required in prolog, I will try to add them to the available envs. Unfortunately, adding env to epilog is more complicated, and if you can find an alternative way to obtain this info, this would be best.

Dominik
Comment 29 Dominik Bartkiewicz 2020-10-13 08:41:53 MDT
Hi

You can get a list of job and step currently running on slurmd host by 'scontrol show slurmd'. This connects directly to slurmd and has zero impact on slurmctld.

eg.:
SLURMD_NODENAME=<host_name> scontrol show slurmd |grep 'Active Steps'

Dominik
Comment 30 Anthony DelSorbo 2020-10-13 16:11:49 MDT
(In reply to Dominik Bartkiewicz from comment #29)
> Hi
> 
> You can get a list of job and step currently running on slurmd host by
> 'scontrol show slurmd'. This connects directly to slurmd and has zero impact
> on slurmctld.

Now, that's interesting.  I did not know that.  I thought any "scontrol" command went through the slurmctld.  I'll need to investigate that.  Thanks for that valuable information!

Here's what we currently key on in the pro/epilogs:

STDOUT=${jobData['StdOut']}

EXITCODE=${jobData['ExitCode']}

PARTITION=${jobData['Partition']}
NUMNODES=${jobData['NumNodes']}
NUMCPUS=${jobData['NumCPUs']}
NUMTASKS=${jobData['NumTasks']}

If these can be made available, in both prolog and epilog then we would be able to forego the scontrol sho job even though all the other job info is useful for troubleshooting.

If there was one piece of information I really need is whether or not the job is on a shared node.  If that can be made available, it would make what we do so much easier.

Tony.
Comment 31 S Senator 2020-10-13 16:29:03 MDT
Just to chime in, we would find it useful, and could significantly reduce the complexity of our prolog scripts if the following fields from a job's attributes were available to the script without a call to 'scontrol show job':
   AdminComment - to pass site-specific json to job script
   AllocNode - to determine entry point to the cluster, implying security policy
   Feature - to determine secondary job allocation that may not be part of the common scheduling policy
   Requeue - to affect behavior on errors encountered within the prolog
   Reservation - to influence decisions based on the above

Ideally, if a job were running within a reservation, reservation attributes would also be extremely useful, especially:
   State - to determine if we kept running after the reservation finished
   MaxStartDelay - to influence overlap into a reservation / cancel policy
   EndTime - to influence the signal catching & checkpoint behavior remaining
Comment 34 Dominik Bartkiewicz 2020-10-28 10:27:27 MDT
Hi

Sorry for the late response. 
Unfortunately, after talking with Tim, we decided that we can't add any new env to 20.11. As I mentioned before, we plan to rewrite this part of the code in 21.08. After that, both prolog and epilog will have the same, bigger set of environments.

Dominik
Comment 35 Dominik Bartkiewicz 2020-11-20 08:09:32 MST
Hi

Please let me know if there is anything else I can do to help or if this is ok to close.

Dominik
Comment 36 Anthony DelSorbo 2020-11-20 08:37:51 MST
(In reply to Dominik Bartkiewicz from comment #35)
Dominik

We were disappointed to read your response in comment 34, to be sure.  We need your support in providing systems engineers with the job information needed in the prologs and epilogs in order to make appropriate decisions on how to manage the nodes.  If you have a roadmap of when those feature would be available so that we can inform our management team, that would be helpful. 

Tony.
Comment 38 Tim Wickberg 2020-12-12 20:35:56 MST
Hi Tony -

I'm updating some details on this ticket to reflect an outstanding request to  expand the details available in the prolog/epilog, and ensure those are synced up where appropriate.

While this may seem like a simple request, it's unfortunately a bit complicated due to our architecture and the current RPC patterns. We cannot expand the prolog without some significant refactoring.

While I'd like to tackle that at some point, without sponsorship I can make no guarantee of a timeframe for any related changes.

If sponsoring such work is of interest to NOAA, I can get an SoW over to you sometime in January that would ensure this is done by the 21.08 release next August.

- Tim
Comment 39 Anthony DelSorbo 2020-12-13 12:57:45 MST
(In reply to Tim Wickberg from comment #38)
> 
> If sponsoring such work is of interest to NOAA, I can get an SoW over to you
> sometime in January that would ensure this is done by the 21.08 release next
> August.
 
Tim,

Thanks for getting back to us on this.  We'd be interested, but will need time to get it through the process.  So, the sooner you can get it to us so that we can have it reviewed by our team with a bit of coordination and review by your team, the better.  As you know the process here takes time to get approved and funds allocated.

Best,

Tony
Comment 40 Anthony DelSorbo 2021-07-08 09:00:52 MDT
(In reply to Anthony DelSorbo from comment #39)
> (In reply to Tim Wickberg from comment #38)
> > 
> > If sponsoring such work is of interest to NOAA, I can get an SoW over to you
> > sometime in January that would ensure this is done by the 21.08 release next

Tim,

If you sent me an SOW on this, I must have missed it.  But, we're still interested.  Would you be able to resend it so that we can try to work this into our next fiscal cycle?

Thanks,

Tony.
Comment 41 Tim Wickberg 2022-04-28 15:29:56 MDT
Work related to this ticket was performed for another customer as part of bug 12110, and will be available in 22.05 when released.

Marking this closed as a duplicate of that later ticket.

- Tim

*** This ticket has been marked as a duplicate of ticket 12110 ***