| Summary: | Slurm configuration | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | ellingjd.ctr |
| Component: | Configuration | Assignee: | Jason Booth <jbooth> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | - Unsupported Older Versions | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | AFRL | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
ellingjd.ctr
2018-08-22 08:08:38 MDT
Hi Jay, You will need to set limits in the slurm.conf and in the accounting database to achieve what you are after. > -maximum time limit is 48 hours PartitionName=normal Nodes=ALL Default=NO MaxTime=48:00 State=UP You may also want to consider setting "DefaultTime" for users that submit a job without a time limit so they are not rejected when their job time limit is set to INFINITY. > -maximum number of nodes per user in the queue is 20 To enforce a maxjob per user, you would need to enable slurm accounting. Additionally, you can specify who has access to a partition by adding an "AllowQos" parameter to the partition. https://slurm.schedmd.com/accounting.html These are the parameters you would need to set, however, a more complete explanation is found in the link above. AccountingStorageTRES AccountingStorageType AccountingStorageHost AccountingStorageEnforce To then add the max node per-user you would then add the association into the accounting database. sacctmgr modify user set MaxTRES=NODE=2 where user=jason > -setup a debug queue with 1 hour time limit and 2 nodes max MaxNodes PartitionName=debug Nodes=ALL MaxNodes=2 Default=NO MaxTime=1:00:00 State=UP https://slurm.schedmd.com/sacctmgr.html https://slurm.schedmd.com/tres.html https://slurm.schedmd.com/slurm.conf.html -Jason So currently my slurm.conf file looks like this: ControlMachine=grault-adm01 ClusterName=hpe_stack authType=auth/munge SlurmctldLogFile=/var/log/slurmctld.log SlurmdLogFile=/var/log/slurmd.log SlurmctldDebug=3 InactiveLimit=0 MpiDefault=pmi2 ReturnToService=2 NodeName=grault-l[01-02],grault-node[01-60] ThreadsPerCore=1 CoresPerSocket=18 Sockets=2 PartitionName=defaultx Nodes=grault-node[01-60] Default=yes PropagateResourceLimits=NONE #Epilog=/opt/hpe/hpc/slurm/default/etc/slurm.epilog.clean # End of CST wizard configuration of slurm.conf So to set the wall time I would change: PartitionName=defaultx Nodes=grault-node[01-60] Default=yes to: PartitionName=defaultx Nodes=ALL Default=NO MaxTime=48:00 State=UP Also to add the second queue PartitionName=debug Nodes=ALL MaxNodes=2 Default=NO MaxTime=1:00:00 After making the changes, do I need to restart slurmctld on the admin and slurmd on the compute nodes? Jay On 2018-08-22 12:52, bugs@schedmd.com wrote: > Jason Booth changed bug 5607 [1] > > WHAT > REMOVED > ADDED > > Assignee > support@schedmd.com > jbooth@schedmd.com > > COMMENT # 1 [2] ON BUG 5607 [1] FROM JASON BOOTH > > Hi Jay, > > You will need to set limits in the slurm.conf and in the accounting > database to > achieve what you are after. > >> -maximum time limit is 48 hours > > PartitionName=normal Nodes=ALL Default=NO MaxTime=48:00 State=UP > > You may also want to consider setting "DefaultTime" for users that > submit a job > without a time limit so they are not rejected when their job time > limit is set > to INFINITY. > >> -maximum number of nodes per user in the queue is 20 > > To enforce a maxjob per user, you would need to enable slurm > accounting. > Additionally, you can specify who has access to a partition by adding > an > "AllowQos" parameter to the partition. > > https://slurm.schedmd.com/accounting.html > > These are the parameters you would need to set, however, a more > complete > explanation is found in the link above. > AccountingStorageTRES > AccountingStorageType > AccountingStorageHost > AccountingStorageEnforce > > To then add the max node per-user you would then add the association > into the > accounting database. > > sacctmgr modify user set MaxTRES=NODE=2 where user=jason > > > -setup a debug queue with 1 hour time limit and 2 nodes max > > MaxNodes > > PartitionName=debug Nodes=ALL MaxNodes=2 Default=NO MaxTime=1:00:00 > State=UP > > https://slurm.schedmd.com/sacctmgr.html > https://slurm.schedmd.com/tres.html > https://slurm.schedmd.com/slurm.conf.html > > -Jason > > ------------------------- > You are receiving this mail because: > > * You reported the bug. > > > > Links: > ------ > [1] https://bugs.schedmd.com/show_bug.cgi?id=5607 > [2] https://bugs.schedmd.com/show_bug.cgi?id=5607#c1 Hi Jay,
> After making the changes, do I need to restart slurmctld on the admin and slurmd on the compute nodes?
It is highly suggested that you restart both the slurmctld and the slumd processes after making changes to your slurm.conf. Please also make sure that the compute nodes have the updated slurm.conf.
If this is not possible, then you can use scontrol to modify these attributes.
scontrol update partition=defaultx maxtime=48:00
scontrol update partition=debug MaxNodes=2 MaxTime=1:00:00
-Jason
Is using the scontrol to modify the attributes permanent or after reboot the attributes go away? Do you have an example slurm.conf of accounting enabled? The link you mentioned mainly focuses on using a database. I would probably use a text file. Jay On 2018-08-22 15:05, bugs@schedmd.com wrote: > COMMENT # 3 [1] ON BUG 5607 [2] FROM JASON BOOTH > > Hi Jay, > >> After making the changes, do I need to restart slurmctld on the > admin and slurmd on the compute nodes? > > It is highly suggested that you restart both the slurmctld and the > slumd > processes after making changes to your slurm.conf. Please also make > sure that > the compute nodes have the updated slurm.conf. > > If this is not possible, then you can use scontrol to modify these > attributes. > > scontrol update partition=defaultx maxtime=48:00 > scontrol update partition=debug MaxNodes=2 MaxTime=1:00:00 > > -Jason > > ------------------------- > You are receiving this mail because: > > * You reported the bug. > > > > Links: > ------ > [1] https://bugs.schedmd.com/show_bug.cgi?id=5607#c3 > [2] https://bugs.schedmd.com/show_bug.cgi?id=5607 Hi Jay, > Is using the scontrol to modify the attributes permanent or after reboot the attributes go away? This is correct behavior. You would use the scontrol modify to update the configuration to avoid a restart however, you will still have to add those attributes in the slurm.conf so they are persistent across reboots/restarts. > Do you have an example slurm.conf of accounting enabled? The link you mentioned mainly focuses on using a database. I would probably use a text file. The main slurm.conf options are below. Note that with 'AccountingStorageType', you will also need to setup the slurmdbd. Please also refer to the accounting documentation and the man page for slurm.conf for a more complete explanation of what these parameters do. https://slurm.schedmd.com/accounting.html AccountingStorageUser=slurm AccountingStorageEnforce=safe,associations,limits,qo AccountingStorageHost=localhost AccountingStorageType=accounting_storage/slurmdbd AccountingStorageTRES=cpu,node,mem,energy An example slurmdbd.conf looks like the following: AuthType=auth/munge DbdAddr=192.168.1.1 SlurmUser=slurm DebugFlags=DB_QUERY,DB_ASSOC,DB_JOB,DB_RESOURCE,DB_USAGE DebugLevel=1 LogFile=/var/log/slurmdbd PidFile=/var/run/slurmdbd.pid StorageType=accounting_storage/mysql StorageHost=localhost StoragePass=<PASSWORD> StorageUser=<USER> StorageLoc=slurm_acct_db DbdHost=slurmserver -Jason When I create the debug queue as the following: PartitionName=debug Nodes=ALL MaxNodes=2 Default=NO MaxTime=1:00:00 State=UP It shows my node count at 62. I only have 60 nodes, I just want the debug queue to utilize 2 of the 60. Do I need to specify which two nodes? Jay On 2018-08-22 12:52, bugs@schedmd.com wrote: > Jason Booth changed bug 5607 [1] > > WHAT > REMOVED > ADDED > > Assignee > support@schedmd.com > jbooth@schedmd.com > > COMMENT # 1 [2] ON BUG 5607 [1] FROM JASON BOOTH > > Hi Jay, > > You will need to set limits in the slurm.conf and in the accounting > database to > achieve what you are after. > >> -maximum time limit is 48 hours > > PartitionName=normal Nodes=ALL Default=NO MaxTime=48:00 State=UP > > You may also want to consider setting "DefaultTime" for users that > submit a job > without a time limit so they are not rejected when their job time > limit is set > to INFINITY. > >> -maximum number of nodes per user in the queue is 20 > > To enforce a maxjob per user, you would need to enable slurm > accounting. > Additionally, you can specify who has access to a partition by adding > an > "AllowQos" parameter to the partition. > > https://slurm.schedmd.com/accounting.html > > These are the parameters you would need to set, however, a more > complete > explanation is found in the link above. > AccountingStorageTRES > AccountingStorageType > AccountingStorageHost > AccountingStorageEnforce > > To then add the max node per-user you would then add the association > into the > accounting database. > > sacctmgr modify user set MaxTRES=NODE=2 where user=jason > > > -setup a debug queue with 1 hour time limit and 2 nodes max > > MaxNodes > > PartitionName=debug Nodes=ALL MaxNodes=2 Default=NO MaxTime=1:00:00 > State=UP > > https://slurm.schedmd.com/sacctmgr.html > https://slurm.schedmd.com/tres.html > https://slurm.schedmd.com/slurm.conf.html > > -Jason > > ------------------------- > You are receiving this mail because: > > * You reported the bug. > > > > Links: > ------ > [1] https://bugs.schedmd.com/show_bug.cgi?id=5607 > [2] https://bugs.schedmd.com/show_bug.cgi?id=5607#c1 Hi Jay, Yes, you would need to configure which nodes you wish to limit the queue too. For example: PartitionName=debug Nodes=node[01-02] MaxNodes=2 Default=NO MaxTime=1:00:00 State=UP The MaxNodes parameter becomes useless since this partition can only use the two nodes you have configured above so, you choose to keep the 'MaxNodes' or remove it. -Jason I still struggling with enabling the accounting portion. You mentioned a slurm database, but what if I just want to use a text file. AccountingStorageType controls how detailed job and job step information is recorded. You can store this information in a text file or into SlurmDBD. How would I set this up? Jay On 2018-08-23 13:24, bugs@schedmd.com wrote: > COMMENT # 7 [1] ON BUG 5607 [2] FROM JASON BOOTH > > Hi Jay, > > Yes, you would need to configure which nodes you wish to limit the > queue too. > > For example: > > PartitionName=debug Nodes=node[01-02] MaxNodes=2 Default=NO > MaxTime=1:00:00 > State=UP > > The MaxNodes parameter becomes useless since this partition can only > use the > two nodes you have configured above so, you choose to keep the > 'MaxNodes' or > remove it. > > -Jason > > ------------------------- > You are receiving this mail because: > > * You reported the bug. > > > > Links: > ------ > [1] https://bugs.schedmd.com/show_bug.cgi?id=5607#c7 > [2] https://bugs.schedmd.com/show_bug.cgi?id=5607 Hi Jay, Accounting does more than just store Job information. This feature also stores QOS, Accounts, Clusters, Users, and limits. It is the limits you are after and the limits you will need to limit specific users. > I still struggling with enabling the accounting portion. The documentation for accounting does walk you through the setup process, so if you could be more specific as to the problem you are seeing, then perhaps I can suggest a resolution to the problem you encountered. https://slurm.schedmd.com/accounting.html -Jason The documentation assumes you will be using a database to record data. According to https://slurm.schedmd.com/accounting.html: Slurm Accounting Configuration After Build For simplicity sake we are going to reference everything as if you are running with the SlurmDBD. However, I would like to use a text file. So do I just need to use this instead: AccountingStorageType=accounting_storage/filetxt AccountingStorageLoc=/var/log/slurm/accounting Jay On 2018-08-23 15:03, bugs@schedmd.com wrote: > COMMENT # 9 [1] ON BUG 5607 [2] FROM JASON BOOTH > > Hi Jay, > > Accounting does more than just store Job information. This feature > also stores > QOS, Accounts, Clusters, Users, and limits. It is the limits you are > after and > the limits you will need to limit specific users. > >> I still struggling with enabling the accounting portion. > > The documentation for accounting does walk you through the setup > process, so if > you could be more specific as to the problem you are seeing, then > perhaps I can > suggest a resolution to the problem you encountered. > > https://slurm.schedmd.com/accounting.html > > -Jason > > ------------------------- > You are receiving this mail because: > > * You reported the bug. > > > > Links: > ------ > [1] https://bugs.schedmd.com/show_bug.cgi?id=5607#c9 > [2] https://bugs.schedmd.com/show_bug.cgi?id=5607 If this is not possible, then you can use scontrol to modify these attributes. scontrol update partition=defaultx maxtime=48:00 scontrol update partition=debug MaxNodes=2 MaxTime=1:00:00 So after making a change to slurm.conf, I can issue a scontrol update, to read the new slurm.conf file? Jay On 2018-08-23 15:03, bugs@schedmd.com wrote: > COMMENT # 9 [1] ON BUG 5607 [2] FROM JASON BOOTH > > Hi Jay, > > Accounting does more than just store Job information. This feature > also stores > QOS, Accounts, Clusters, Users, and limits. It is the limits you are > after and > the limits you will need to limit specific users. > >> I still struggling with enabling the accounting portion. > > The documentation for accounting does walk you through the setup > process, so if > you could be more specific as to the problem you are seeing, then > perhaps I can > suggest a resolution to the problem you encountered. > > https://slurm.schedmd.com/accounting.html > > -Jason > > ------------------------- > You are receiving this mail because: > > * You reported the bug. > > > > Links: > ------ > [1] https://bugs.schedmd.com/show_bug.cgi?id=5607#c9 > [2] https://bugs.schedmd.com/show_bug.cgi?id=5607 Hi Jay, I was out of the office on Friday last week so I missed your update. > The documentation assumes you will be using a database to record data. >For simplicity sake we are going to reference everything as if you are running with the SlurmDBD. >However, I would like to use a text file. So do I just need to use this instead: > AccountingStorageType=accounting_storage/filetxt > AccountingStorageLoc=/var/log/slurm/accounting This is not possible since the slurmctld stores this limits and association in the database and not these flat files. You will see errors like the following if you try to use just the "accounting_storage/filetxt" with TRES and Fairshare. slurmctld: fatal: Unless running with a database you can only run with certain TRES, gres/gpu is not one of them. Either set up a database preferably with a slurmdbd or remove this TRES from your configuration. Fairshare can only be calculated with either 'accounting_storage/slurmdbd' or 'accounting_storage/mysql' enabled. If you want multifactor priority without fairshare ignore this message. > So after making a change to slurm.conf, I can issue a > scontrol update, to read the new slurm.conf file? You can make changes to the in-memory configuration with scontrol, however, if you want the changes to persist after a restart then you will need to add the options into the slurm.conf. There are two points of interest here. scontrol update scontrol reconfigure I believe you are referring to the reconfigure. in your statement about reading the new slurm.conf. reconfigure Instruct all Slurm daemons to re-read the configuration file. This command does not restart the daemons. This mechanism would be used to modify configuration parameters (Epilog, Prolog, SlurmctldLogFile, slurmdLogFile, etc.). The Slurm controller (slurmctld) forwards the request all other daemons (slurmd daemon on each compute node). Running jobs continue execution. Most configuration parameters can be changed by just running this coommand, however, Slurm daemons should be shutdown and restarted if any of these parameters are to be changed: AuthType, BackupAddr, BackupConâtroller, ControlAddr, ControlMach, PluginDir, StateSaveLocation, SlurmctldPort or SlurmdPort. The slurmctld daemon and all slurmd daemons must be restarted if nodes are added to or removed from the cluster. -Jason Basically what you're saying is I will not be able to use flat file accounting and limit the users to a max of 20 nodes? So I will have to install Mysql server and configure a database. Jay On 2018-08-27 13:43, bugs@schedmd.com wrote: > COMMENT # 12 [1] ON BUG 5607 [2] FROM JASON BOOTH > > Hi Jay, > > I was out of the office on Friday last week so I missed your update. > >> The documentation assumes you will be using a database to record > data. >> For simplicity sake we are going to reference everything as if you > are running with the SlurmDBD. > >> However, I would like to use a text file. So do I just need to use > this instead: > >> AccountingStorageType=accounting_storage/filetxt >> AccountingStorageLoc=/var/log/slurm/accounting > > This is not possible since the slurmctld stores this limits and > association in > the database and not these flat files. You will see errors like the > following > if you try to use just the "accounting_storage/filetxt" with TRES and > Fairshare. > > slurmctld: fatal: Unless running with a database you can only run with > certain > TRES, gres/gpu is not one of them. Either set up a database > preferably with a > slurmdbd or remove this TRES from your configuration. > > Fairshare can only be calculated with either > 'accounting_storage/slurmdbd' or > 'accounting_storage/mysql' enabled. If you want multifactor priority > without > fairshare ignore this message. > >> So after making a change to slurm.conf, I can issue a >> scontrol update, to read the new slurm.conf file? > > You can make changes to the in-memory configuration with scontrol, > however, if > you want the changes to persist after a restart then you will need to > add the > options into the slurm.conf. There are two points of interest here. > > scontrol update > scontrol reconfigure > > I believe you are referring to the reconfigure. in your statement > about reading > the new slurm.conf. > > reconfigure > Instruct all Slurm daemons to re-read the configuration > file. > This command does not restart the daemons. This mechanism would be > used to > modify configuration parameters (Epilog, Prolog, SlurmctldLogFile, > slurmdLogFile, etc.). The Slurm controller (slurmctld) forwards the > request > all other daemons (slurmd daemon on each compute node). Running jobs > continue > execution. Most configuration parameters can be changed by just > running this > coommand, however, Slurm daemons should be shutdown and restarted if > any of > these parameters are to be changed: AuthType, BackupAddr, > BackupConâtroller, > ControlAddr, ControlMach, PluginDir, StateSaveLocation, SlurmctldPort > or > SlurmdPort. The slurmctld daemon and all slurmd daemons must be > restarted if > nodes are added to or removed from the cluster. > > -Jason > > ------------------------- > You are receiving this mail because: > > * You reported the bug. > > > > Links: > ------ > [1] https://bugs.schedmd.com/show_bug.cgi?id=5607#c12 > [2] https://bugs.schedmd.com/show_bug.cgi?id=5607 Hi Jay,
> Basically what you're saying is I will not be able to use flat file accounting and limit the users to a max of 20 nodes? So I will have to install Mysql server and configure a database.
This is correct. You will need to set up a database where the associations can be stored. This is not possible with flat file accounting.
-Jason
Hi Jay, Do you need any further assistance with this issue? -Jason Hi Jay, I have not heard any further information from you since the 27th so, I am going to go ahead and close this issue out as info given. Should you need any further assistance then please feel free to re-open this issue or log an additional ticket. -Jason |