Summary: | Limit srun allocation to one node | ||
---|---|---|---|
Product: | Slurm | Reporter: | Torkil Svensgaard <torkil> |
Component: | Configuration | Assignee: | Marcin Stolarek <cinek> |
Status: | RESOLVED INFOGIVEN | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | cinek, rkv |
Version: | 20.11.5 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: | https://bugs.schedmd.com/show_bug.cgi?id=11411 | ||
Site: | DRCMR | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | Google sites: | --- |
HPCnow Sites: | --- | HPE Sites: | --- |
IBM Sites: | --- | NOAA SIte: | --- |
NoveTech Sites: | --- | Nvidia HWinf-CS Sites: | --- |
OCF Sites: | --- | Recursion Pharma Sites: | --- |
SFW Sites: | --- | SNIC sites: | --- |
Tzag Elita Sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Description
Torkil Svensgaard
2021-04-28 01:30:23 MDT
Torkil, The natural way of doing that is by the addition of -N1 to srun options. If you want to make it a site default, you can achieve this easily by cli_filter plugin[1]. cheers, Marcin [1]https://slurm.schedmd.com/cli_filter_plugins.html (In reply to Marcin Stolarek from comment #1) > The natural way of doing that is by the addition of -N1 to srun options. > > If you want to make it a site default, you can achieve this easily by > cli_filter plugin[1]. Thanks. I want to enforce it more than just set as default, so I have to do it in a job_submit plugin I guess? Would you happen to have code that does something like that at hand? Mvh. Torkil >Thanks. I want to enforce it more than just set as default, so I have to do it in a job_submit plugin I guess?
Do you mean you want to forbid multinode jobs?
cheers,
Marcin
(In reply to Marcin Stolarek from comment #3) > >Thanks. I want to enforce it more than just set as default, so I have to do it in a job_submit plugin I guess? > > Do you mean you want to forbid multinode jobs? With srun, yes. It's only used for interactive matlab where more than one node makes no sense. Mvh. Torkil Torkil,
>With srun, yes. It's only used for interactive matlab where more than one node makes no sense.
Please be aware that there are certain tools that use srun behind a scene (for instance openMPI) since srun is not only used for interactive "allocate and run" scenarios, but maybe even more often to create steps and launch tasks within the existing allocation.
Another important fact is that you can't really distinguish which tool was used by the end-user to prepare a job description from a job submit plugin perspective. Since it works on slurmctld side and the job can be even submitted by someone using Slurm API (or slurmrestd) directly (without sbatch/srun/salloc).
cheers,
Marcin
Hi Marcin Ok, we'll stick with just setting a default. I did this in job_submit.lua, that was what you had in mind? if job_desc.max_nodes == 4294967294 then job_desc.max_nodes = 1 slurm.log_info("Setting max_nodes to 1") end Mvh. Torkil >Ok, we'll stick with just setting a default. I did this in job_submit.lua, that was what you had in mind? No, it's not what I'd recommend. I think that the more suitable place is the cli_filter plugin[1], with the code like: >function slurm_cli_setup_defaults(options, early_pass) > --[[ > -- Set -N1 for srun as a default > ]]-- > if options['type'] == 'srun' then > options['nodes'] = 1 > end > return slurm.SUCCESS >end cheers, Marcin [1]https://slurm.schedmd.com/cli_filter_plugins.html (In reply to Marcin Stolarek from comment #7) > >Ok, we'll stick with just setting a default. I did this in job_submit.lua, that was what you had in mind? > > No, it's not what I'd recommend. I think that the more suitable place is the > cli_filter plugin[1], with the code like: > >function slurm_cli_setup_defaults(options, early_pass) > > --[[ > > -- Set -N1 for srun as a default > > ]]-- > > if options['type'] == 'srun' then > > options['nodes'] = 1 > > end > > return slurm.SUCCESS > >end > > > cheers, > Marcin > [1]https://slurm.schedmd.com/cli_filter_plugins.html Ah, cool. I read [1] and didn't understand it, so went with job_submit. Can see I missed the reference to where to put cli_filter.lua. It looks like "scontrol reconfigure" doesn't copy the file to nodes and even if it did the submit hosts do not run slurmd. Do I have to copy it manually to all hosts? And what is the default script dir on hosts with no /etc/slurm? Mvh. Torkil >It looks like "scontrol reconfigure" doesn't copy the file to nodes and even if it did the submit hosts do not run slurmd. That's correct - config less does a copy of neither job_submit.lua nor cli_filter.lua as of today. >Do I have to copy it manually to all hosts? And what is the default script dir on hosts with no /etc/slurm? From the code perspective it's >static const char lua_script_path[] = DEFAULT_SCRIPT_DIR "/cli_filter.lua"; while DEFAULT_SCRIPT_DIR is set during the build (based on options passed to configure), so if /etc/slurm is a default location for your config (on slurmctld) then you'll have to create this for job_submit/cli_filter lua. This is something we're looking into in Bug 11411. It may change/improve in the next major release of Slurm. Let me know if you have further questions. cheers, Marcin (In reply to Marcin Stolarek from comment #9) > >Do I have to copy it manually to all hosts? And what is the default script dir on hosts with no /etc/slurm? > From the code perspective it's > >static const char lua_script_path[] = DEFAULT_SCRIPT_DIR "/cli_filter.lua"; > while DEFAULT_SCRIPT_DIR is set during the build (based on options passed to > configure), so if /etc/slurm is a default location for your config (on > slurmctld) then you'll have to create this for job_submit/cli_filter lua. > This is something we're looking into in Bug 11411. It may change/improve in > the next major release of Slurm. I created /etc/slurm and dumped cli_filter.lua there but it doesn't work or isn't found. How do I debug? Running srun with -v yielded no clues. Thanks, Torkil (In reply to Torkil Svensgaard from comment #10) > (In reply to Marcin Stolarek from comment #9) > > > >Do I have to copy it manually to all hosts? And what is the default script dir on hosts with no /etc/slurm? > > From the code perspective it's > > >static const char lua_script_path[] = DEFAULT_SCRIPT_DIR "/cli_filter.lua"; > > while DEFAULT_SCRIPT_DIR is set during the build (based on options passed to > > configure), so if /etc/slurm is a default location for your config (on > > slurmctld) then you'll have to create this for job_submit/cli_filter lua. > > This is something we're looking into in Bug 11411. It may change/improve in > > the next major release of Slurm. > > I created /etc/slurm and dumped cli_filter.lua there but it doesn't work or > isn't found. How do I debug? Running srun with -v yielded no clues. I also copied slurm.conf to which I added CliFilterPlugins=cli_filter.lua. Neither of thes two locatios for cli_filter.lua seems to work: /etc/slurm/cli_filter.lua /etc/slurm/cli_filter/cli_filter.lua " torkil@bill:/home/torkil$ srun -vvv --pty -n 96 bash srun: error: Couldn't find the specified plugin name for cli_filter/cli_filter.lua looking at all files srun: error: cannot find cli_filter plugin for cli_filter/cli_filter.lua srun: error: cannot create cli_filter context for cli_filter/cli_filter.lua srun: error: cli_filter plugin terminated with error " What at am I missing? The installed slurm packages are compiled with rpmbuild with no modifications. Mvh. Torkil (In reply to Torkil Svensgaard from comment #11) > (In reply to Torkil Svensgaard from comment #10) > > (In reply to Marcin Stolarek from comment #9) > > > > > >Do I have to copy it manually to all hosts? And what is the default script dir on hosts with no /etc/slurm? > > > From the code perspective it's > > > >static const char lua_script_path[] = DEFAULT_SCRIPT_DIR "/cli_filter.lua"; > > > while DEFAULT_SCRIPT_DIR is set during the build (based on options passed to > > > configure), so if /etc/slurm is a default location for your config (on > > > slurmctld) then you'll have to create this for job_submit/cli_filter lua. > > > This is something we're looking into in Bug 11411. It may change/improve in > > > the next major release of Slurm. > > > > I created /etc/slurm and dumped cli_filter.lua there but it doesn't work or > > isn't found. How do I debug? Running srun with -v yielded no clues. > > I also copied slurm.conf to which I added CliFilterPlugins=cli_filter.lua. > Neither of thes two locatios for cli_filter.lua seems to work: > > /etc/slurm/cli_filter.lua > /etc/slurm/cli_filter/cli_filter.lua > > " > torkil@bill:/home/torkil$ srun -vvv --pty -n 96 bash > srun: error: Couldn't find the specified plugin name for > cli_filter/cli_filter.lua looking at all files > srun: error: cannot find cli_filter plugin for cli_filter/cli_filter.lua > srun: error: cannot create cli_filter context for cli_filter/cli_filter.lua > srun: error: cli_filter plugin terminated with error > " > > What at am I missing? The installed slurm packages are compiled with > rpmbuild with no modifications. I think I got it, posting it here for posterity. On the submit node create /etc/slurm and copy slurm.conf. Add this line to slurm.conf: CliFilterPlugins =lua Complete cli_filter.lua (put in /etc/slurm): " function slurm_cli_pre_submit(cli, opts) return slurm.SUCCESS end function slurm_cli_post_submit(cli, opts) return slurm.SUCCESS end function slurm_cli_setup_defaults(options, early_pass) --[[ -- Set -N1 for srun as a default ]]-- if options['type'] == 'srun' then options['nodes'] = 1 end return slurm.SUCCESS end " Do I need the full slurm.conf on these submit hosts? I tried starting from scratch with an almost empty one but one error after another. Mvh. Torkil >Do I need the full slurm.conf on these submit hosts? I tried starting from scratch with an almost empty one but one error after another.
In general I'd recommend keeping all the configs in sync on all hosts.
cheers,
Marcin
(In reply to Marcin Stolarek from comment #13) > >Do I need the full slurm.conf on these submit hosts? I tried starting from scratch with an almost empty one but one error after another. > In general I'd recommend keeping all the configs in sync on all hosts. We were very happy with the option of configless since sync was taken care of automatically but for this we are back at needing a slurm.conf in puppet. Not the end of the world keeping it in sync but it would have been nice if slurm.conf for login/submit nodes could have consisted of just: CliFilterPlugins=lua Then no sync issues at all, since that parameter isn't used on the master. Btw, where do these nodes get their configuration from when they run with no slurm.conf, which they did up til now? Mvh. Torkil >We were very happy with the option of configless since sync was taken care of automatically[..] It's not something I can commit to now, but we're considering a config less modification in Slurm 21.08 that will allow .lua scripts to be send together with configuration files supported today. >[...]t would have been nice if slurm.conf for login/submit nodes could have consisted of just: >CliFilterPlugins=lua This will rather not be possible, since we don't merge different configuration sources. When slurm.conf source is found we just use it and just having CliFilterPlugins=.. won't be enough. > Btw, where do these nodes get their configuration from when they run with no slurm.conf, which they did up til now? I guess you have _slurmctld._tcp SRV records in DNS [1]? cheers, Marcin [1]https://slurm.schedmd.com/configless_slurm.html (In reply to Marcin Stolarek from comment #15) > >We were very happy with the option of configless since sync was taken care of automatically[..] > It's not something I can commit to now, but we're considering a config less > modification in Slurm 21.08 that will allow .lua scripts to be send together > with configuration files supported today. That would be nice =) > > Btw, where do these nodes get their configuration from when they run with no slurm.conf, which they did up til now? > I guess you have _slurmctld._tcp SRV records in DNS [1]? Of course I have, totally forgot about that. Thanks, feel free to close the ticket. Mvh. Torkil |