Ticket 11494 - srun cli_filter defaults inside allocation
Summary: srun cli_filter defaults inside allocation
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: User Commands (show other tickets)
Version: 20.11.6
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Carlos Tripiana Montes
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-04-29 14:29 MDT by Matt Ezell
Modified: 2021-06-10 01:35 MDT (History)
2 users (show)

See Also:
Site: ORNL-OLCF
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 20.11.8 21.08.0pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Matt Ezell 2021-04-29 14:29:52 MDT
srun does not process cli_filter defaults inside an allocation:

https://github.com/SchedMD/slurm/blob/9c9ea1f6246c6e0fdf8f7e5ce94c4f43910468bb/src/srun/libsrun/opt.c#L369

What is the reason for this logic? Is it documented anywhere?
Comment 1 Carlos Tripiana Montes 2021-05-03 03:57:28 MDT
Hi Matt,

If I'm not wrong, from https://slurm.schedmd.com/cli_filter_plugins.html:

"int cli_filter_p_setup_defaults(slurm_opt_t *options, bool early)

Description:

This function is called by the salloc, sbatch, or srun command line interface (CLI) programs shortly before processing any options from the environment, command line, or script (#SBATCH). The hook may be run multiple times per job component, once for an early pass (if implemented by the CLI), and again for the main pass. The options and early arguments are meant to be passed to slurm_option_set() which will set the option if it is in the appropriate pass. Failures to set an option may be a symptom of trying to set the option on the wrong pass. Given that you should not return SLURM_ERROR simply because of a failure to set an option."

So, it should execute this function after executing each component, without any condition; whereas looking at code (line pointed by you), if srun is called inside salloc allocation, it is not called.

I could say that might be because srun is going to inherit things from salloc, that could potentially be overridden/lost inside srun, if we allow to setup default *again* for each inside it.

Right now, I have no answer to this behaviour. Let me some time to investigate this in detail. It can be justified, but then we need to clarify docs.

Regards.
Comment 2 Carlos Tripiana Montes 2021-05-03 08:02:45 MDT
Matt,

Here it goes: https://github.com/SchedMD/slurm/commit/39ad5f4e00fb6793b93a7998588a5e922c24f38a

This belong to https://bugs.schedmd.com/show_bug.cgi?id=3745#c54, the attached tar.gz is a patchset, and it's 0016 one is the one matching the above commit.

This is part of the development as a contribution of the cli_filter. Before being officially included and supported as part of the production releases.

So, in the end, this is for an important reason. And we should then clarify this point in the documentation.

Regards.
Comment 4 Matt Ezell 2021-05-04 08:19:32 MDT
(In reply to Carlos Tripiana Montes from comment #2)
> So, in the end, this is for an important reason. And we should then clarify
> this point in the documentation.

I understand that srun should take some "defaults" from the environment, but I'm still not sure why that should be skipped. It seems that if a site wants to shoot themselves in the foot, shouldn't they be able to? The cli_filter can do the same check (am I in an active allocation?) and act appropriately.

We see lots of benefits from cli_filter, but since we intend to support multi-cluster cross-submission, we need to try to make sure our defaults get set no matter where the job was submitted from. The only place this can be enforced is either in job_submit or the cli_filter for a srun inside a batch script that is running on the target cluster.
Comment 5 Carlos Tripiana Montes 2021-05-04 08:34:56 MDT
I see... I'd say this is not a bug, but a not supported feature for the cli_filter right now.

I have a doc patch to make a clear statement but, in any case, let me discuss this internally first.

I'll be back once I have more information.

Regards.
Comment 10 Carlos Tripiana Montes 2021-06-04 02:37:36 MDT
Hi Matt,

We are going to clarify this in the documentation. There's a patch going through review right now.

Cheers.
Comment 12 Carlos Tripiana Montes 2021-06-10 01:29:01 MDT
Hi Matt,

Patch has been pushed both for master and 20.11 branches. See https://github.com/SchedMD/slurm/commit/503b1aa61956fe7bfe46587f6cbff49946e7081a, which will be included in next 20.11.8 release.

We're going to close this bug as fixed. Please, reopen if necessary.

Regards.