Ticket 6271

Summary:	job_submit filter rejects batch scripts named 'batch'
Product:	Slurm	Reporter:	pbisbal
Component:	Other	Assignee:	Nate Rini <nate>
Status:	RESOLVED FIXED	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	brian.gilmer
Version:	18.08.4
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=5848 https://bugs.schedmd.com/show_bug.cgi?id=6278
Site:	Princeton Plasma Physics Laboratory (PPPL)	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	18.08.5, 19.05
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	job_submit.lua file. slurm.conf file my mpihello.sbatch file

Description pbisbal 2018-12-19 11:38:17 MST

Created attachment 8711 [details]
job_submit.lua file.

I'm not sure whether this is bug in sbatch, or my job_submit.lua script, or the lua plug-in. 

Yesterday I upgraded from 18.08.3 to 18.08.4. Since that upgrade,  it seems like sbatch or my job_submit.lua script will reject job scripts named 'batch'. For example:

$ cat  mpihello.sbatch 
#!/bin/bash

#SBATCH -n 4 
#SBATCH --mem=63000M
#SBATCH -p ellis
#SBATCH -t 00:01:00
#SBATCH -J mpihello
#SBATCH -o mpihello-%j.out
#SBATCH -e mpihello-%j.err
#SBATCH --mail-type=ALL

module load gcc/7.3.0
module load openmpi/3.0.0
mpiexec ./mpihello

This script works when named "mpihello.sbatch"

$ sbatch mpihello.sbatch 
Submitted batch job 398504

But doesn't when renamed to "batch"

$ mv mpihello.sbatch batch

$ sbatch batch
sbatch: error: ERROR: A time limit must be specified
sbatch: error: Batch job submission failed: Time limit specification required, but not provided

It also fails if it's a symlink named "batch" 

$ mv batch mpihello.sbatch

$ ln -s mpihello.sbatch batch

$ ls -l batch
lrwxrwxrwx 1 pbisbal users 15 Dec 19 13:27 batch -> mpihello.sbatch

$ sbatch batch
sbatch: error: ERROR: A time limit must be specified
sbatch: error: Batch job submission failed: Time limit specification required, but not provided


At the same time I did the upgrade. I added the following lines to my cgroups.conf file: 

ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainDevices=yes
ConstrainKmemSpace=no
TaskAffinity=no

BI don't expect that to be related to this bug. Just offering that info in the spirit of full-disclosure. 

This appears to be specific to 18.08.4. This issue did not occur with 18.08.3. The same user who initially reported this problem used the same batch script to submit a job to 18.08.3 less than 24 hours before the upgrade. We upgraded to 18.08.3 on 11/20/18, so we've been using 18.08.3 for about 1 month with no issues. 

I've attached my job_submit.lua script. According to ls, it was last modified on  10/16/2018.

Comment 2 Nate Rini 2018-12-19 13:41:22 MST

(In reply to pbisbal from comment #0)
> This appears to be specific to 18.08.4. This issue did not occur with
> 18.08.3. The same user who initially reported this problem used the same
> batch script to submit a job to 18.08.3 less than 24 hours before the
> upgrade. We upgraded to 18.08.3 on 11/20/18, so we've been using 18.08.3 for
> about 1 month with no issues.

Having a batch script named batch works on my test setup with your job_submit.lua:
> $ sbatch batch
> sbatch: NOTICE: Job assigned to ellis partition due to low number of nodes or CPUs requested.
> Submitted batch job 7
> $ sbatch -V
> slurm 18.08.4

I trimmed the attachment to the cat command.

Can you please attach your slurm.conf.

--Nate

Comment 3 pbisbal 2018-12-19 13:44:40 MST

Created attachment 8714 [details]
slurm.conf file

Comment 4 pbisbal 2018-12-19 13:45:07 MST

Slurm.conf attached.

Comment 5 Nate Rini 2018-12-19 14:08:15 MST

(In reply to pbisbal from comment #4)
> Slurm.conf attached.

In an unrelated topic while reviewing your config:
> TaskPlugin=task/cgroup

From https://slurm.schedmd.com/slurm.conf.html:
>NOTE: It is recommended to stack task/affinity,task/cgroup together when configuring TaskPlugin, and setting TaskAffinity=no and ConstrainCores=yes in cgroup.conf.

Comment 6 Nate Rini 2018-12-19 14:18:45 MST

(In reply to pbisbal from comment #4)
> Slurm.conf attached.

Have an error now:

>$ sbatch batch
> sbatch: error: NOTICE: Job assigned to ellis partition due to low number of nodes or CPUs requested.
> sbatch: error: Batch job submission failed: Invalid qos specification

Can you please provide the output of the following:
> $ sacctmgr -p show qos

--Nate

Comment 7 Nate Rini 2018-12-19 14:24:10 MST

(In reply to Nate Rini from comment #6)
> (In reply to pbisbal from comment #4)

I may have replicated the issue:

$ sbatch batch
sbatch: error: ERROR: A time limit must be specified
sbatch: error: Batch job submission failed: Time limit specification required, but not provided

$ cat batch
#!/bin/bash
#SBATCH -n 4 
#SBATCH --mem=100M
#SBATCH -p debug
#SBATCH -J mpihello
#SBATCH -o mpihello-%j.out
#SBATCH -e mpihello-%j.err

srun echo test
#module load gcc/7.3.0
#module load openmpi/3.0.0
#mpiexec ./mpihello

Can you please try this:
> sbatch -t 00:01:00 batch

Can you please attach your batch script too? Maybe I'm missing some whitespace issue.

--Nate

Comment 8 pbisbal 2018-12-19 14:48:08 MST

(In reply to Nate Rini from comment #5)
> (In reply to pbisbal from comment #4)
> > Slurm.conf attached.
> 
> In an unrelated topic while reviewing your config:
> > TaskPlugin=task/cgroup
> 
> From https://slurm.schedmd.com/slurm.conf.html:
> >NOTE: It is recommended to stack task/affinity,task/cgroup together when configuring TaskPlugin, and setting TaskAffinity=no and ConstrainCores=yes in cgroup.conf.

Thanks for that unrelated tip. I assume to "stack" them together, I list them both like this in slurm.conf:

TaskPlugin=task/affinity,task/cgroup

Is that correct? 

Prentice

Comment 9 pbisbal 2018-12-19 14:51:14 MST

Created attachment 8716 [details]
my mpihello.sbatch file

This is the exact sbatch file I've been using for my testing.

Comment 10 Nate Rini 2018-12-19 14:53:01 MST

(In reply to pbisbal from comment #8)
> (In reply to Nate Rini from comment #5)
> > (In reply to pbisbal from comment #4)
> > > Slurm.conf attached.
> > 
> > In an unrelated topic while reviewing your config:
> > > TaskPlugin=task/cgroup
> > 
> > From https://slurm.schedmd.com/slurm.conf.html:
> > >NOTE: It is recommended to stack task/affinity,task/cgroup together when configuring TaskPlugin, and setting TaskAffinity=no and ConstrainCores=yes in cgroup.conf.
> 
> Thanks for that unrelated tip. I assume to "stack" them together, I list
> them both like this in slurm.conf:
> 
> TaskPlugin=task/affinity,task/cgroup
> 
> Is that correct? 
Yes, that is correct.

Comment 11 Nate Rini 2018-12-19 15:01:39 MST

(In reply to pbisbal from comment #9)
> Created attachment 8716 [details]
> my mpihello.sbatch file
> 
> This is the exact sbatch file I've been using for my testing.

Using the same file:
> $ ls -la batch
> lrwxrwxrwx 1 nate nate 15 Dec 19 14:53 batch -> mpihello.sbatch
> $ md5sum batch
> d915e0d5a5eac4b702b6f11777c30bb5  batch
> $ sbatch mpihello.sbatch 
> Submitted batch job 11
> $ sbatch batch 
> Submitted batch job 12
> $ md5sum etc/job_submit.lua 
> 6a5d57f4fb01c6c37e2dbffc65864fe8  etc/job_submit.lua

I added the QOSs listed in job_submit.lua to avoid the invalid QOS error.

Can you please call:
> sbatch -v batch

--Nate

Comment 12 pbisbal 2018-12-19 15:03:29 MST

My sbatch file does specify a time limit in the file itself: 

$ cat mpihello.sbatch 
#!/bin/bash

#SBATCH -n 4 
#SBATCH --mem=63000M
#SBATCH -p ellis
#SBATCH -t 00:01:00
#SBATCH -J mpihello
#SBATCH -o mpihello-%j.out
#SBATCH -e mpihello-%j.err
#SBATCH --mail-type=ALL

module load gcc/7.3.0
module load openmpi/3.0.0
#echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
#srun --mpi=pmi2 ./mpihello
mpiexec ./mpihello

But specifying -t on the command-line definitely fixes the problem: 

$ ls -l batch 
lrwxrwxrwx 1 pbisbal users 15 Dec 19 16:52 batch -> mpihello.sbatch

$ sbatch batch
sbatch: error: ERROR: A time limit must be specified
sbatch: error: Batch job submission failed: Time limit specification required, but not provided


$ sbatch -t 00:01:00 batch 
sbatch: NOTICE: Job assigned to ellis partition due to low number of nodes or CPUs requested.
Submitted batch job 398549

That notice about the job being assigned to the ellis partition is coming from my job_submit.lua script. In this case, since I'm already requesting the ellis partition, that message should not be printed. If you look at my job_submit.lua script, that message should only be printed if a different partition was requested, or no partition was specified. Here's the code that does this:

    -- time limits up to 8-08:00:00 are allowed on ellis, so move logic for ellis to here. 
    -- if a job requests <= 15 CPUs and 1 node it goes to ellis
    if ( job_desc.min_cpus <= 15 and job_desc.time_limit <= 12000 and (job_desc.min_nodes == 0xfffffffe or job_desc.min_nodes == 1 )) then
        if job_desc.partition ~= 'ellis' then
            slurm.user_msg("NOTICE: Job assigned to ellis partition due to low number of nodes or CPUs requested.")
        end
        job_desc.partition = 'ellis'
        job_desc.qos = 'ellis'
        return slurm.SUCCESS
    elseif ( job_desc.min_cpus <= 15 and job_desc.time_limit >= 12000 ) then
        slurm.user_msg("Job rejected. Max. time limit for jobs in ellis partition is 8-08:00:00 (12000 minutes)" )
        return 2051 -- ESLURM_INVALID_TIME_LIMIT
    end

This logic worked the last time I tested it, but I think that was on 17.11.4 or 17.11.5. I don't think I personally tested this since upgrading to 18.08.3

Comment 13 pbisbal 2018-12-19 15:04:55 MST

In response to your earlier request: 

$  sacctmgr -p show qos
Name|Priority|GraceTime|Preempt|PreemptMode|Flags|UsageThres|UsageFactor|GrpTRES|GrpTRESMins|GrpTRESRunMins|GrpJobs|GrpSubmit|GrpWall|MaxTRES|MaxTRESPerNode|MaxTRESMins|MaxWall|MaxTRESPU|MaxJobsPU|MaxSubmitPU|MaxTRESPA|MaxJobsPA|MaxSubmitPA|MinTRES|
normal|0|00:00:00||cluster|||1.000000||||||||||||||||||
dawson|100|00:00:00||cluster|||1.000000|||||||cpu=1024||||cpu=1024|30|||||cpu=16|
ellis|100|00:00:00||cluster|||1.000000|||||||cpu=15,node=1||||cpu=80|45||||||
kruskal|100|00:00:00||cluster|||1.000000|||||||cpu=512||||cpu=512|8|||||cpu=16|
mque|100|00:00:00||cluster|||1.000000|||||||cpu=128,node=4|||||20||||||
default|0|00:00:00||cluster|||1.000000||||||||||||40||||||
mccune|100|00:00:00||cluster|||1.000000|||||||cpu=256||||cpu=256|||||||
sque|100|00:00:00||cluster|||1.000000|||||||cpu=512,node=100|||||||||||
fenx|100|00:00:00||cluster|||1.000000|||||||cpu=40,node=16||||cpu=40,node=16|||||||
fielder|100|00:00:00||cluster|||1.000000|||||||cpu=512,node=96||||cpu=96,node=12|||||||
gque|100|00:00:00||cluster|||1.000000|||||||cpu=32,node=1|||||||||||
jassby|100|00:00:00||cluster|||1.000000|||||||cpu=96,node=6||||cpu=96,node=6|||||||
greene|100|00:00:00||cluster|||1.000000|||||||cpu=512,node=32||||cpu=512,node=32|||||||
pswift|100|00:00:00||cluster|||1.000000||||||||||||||||||
interactive|100|00:00:00||cluster|||1.000000||||||||||12:00:00||||||||
beast|100|00:00:00||cluster|||1.000000|||||||cpu=32|||||||||||
general|100|00:00:00||cluster|||1.000000||||||||||||||||||
interruptible|1|00:00:00||cluster|||1.000000||||||||||||||||||

Comment 14 pbisbal 2018-12-19 15:06:08 MST

$ sbatch -v batch 
sbatch: defined options for program `sbatch'
sbatch: ----------------- ---------------------
sbatch: user              : `pbisbal'
sbatch: uid               : 41266
sbatch: gid               : 589
sbatch: cwd               : /u/pbisbal/testing/mpihello
sbatch: ntasks            : 1 (default)
sbatch: nodes             : 1 (default)
sbatch: jobid             : 4294967294 (default)
sbatch: partition         : default
sbatch: profile           : `NotSet'
sbatch: job name          : `batch'
sbatch: reservation       : `(null)'
sbatch: wckey             : `(null)'
sbatch: distribution      : unknown
sbatch: verbose           : 1
sbatch: overcommit        : false
sbatch: nice              : -2
sbatch: account           : (null)
sbatch: comment           : (null)
sbatch: dependency        : (null)
sbatch: qos               : (null)
sbatch: constraints       : 
sbatch: reboot            : yes
sbatch: network           : (null)
sbatch: array             : N/A
sbatch: cpu_freq_min      : 4294967294
sbatch: cpu_freq_max      : 4294967294
sbatch: cpu_freq_gov      : 4294967294
sbatch: mail_type         : NONE
sbatch: mail_user         : (null)
sbatch: sockets-per-node  : -2
sbatch: cores-per-socket  : -2
sbatch: threads-per-core  : -2
sbatch: ntasks-per-node   : 0
sbatch: ntasks-per-socket : -2
sbatch: ntasks-per-core   : -2
sbatch: mem-bind          : default
sbatch: plane_size        : 4294967294
sbatch: propagate         : NONE
sbatch: switches          : -1
sbatch: wait-for-switches : -1
sbatch: core-spec         : NA
sbatch: burst_buffer      : `(null)'
sbatch: burst_buffer_file : `(null)'
sbatch: remote command    : `/usr/bin/batch'
sbatch: power             : 
sbatch: wait              : yes
sbatch: cpus-per-gpu      : 0
sbatch: gpus              : (null)
sbatch: gpu-bind          : (null)
sbatch: gpu-freq          : (null)
sbatch: gpus-per-node     : (null)
sbatch: gpus-per-socket   : (null)
sbatch: gpus-per-task     : (null)
sbatch: mem-per-gpu       : 0
sbatch: Cray node selection plugin loaded
sbatch: Serial Job Resource Selection plugin loaded with argument 17
sbatch: Linear node selection plugin loaded with argument 17
sbatch: Consumable Resources (CR) Node Selection plugin loaded with argument 17
sbatch: error: ERROR: A time limit must be specified
sbatch: error: Batch job submission failed: Time limit specification required, but not provided

Comment 15 pbisbal 2018-12-19 15:14:50 MST

From the sbatch -v batch output, it looks like this could be part of the problem: 

sbatch: remote command    : `/usr/bin/batch'

This command /usr/bin/batch, is part of the at package: 

$ rpm -qf /usr/bin/batch 
at-3.1.10-49.el6.x86_64

Which is probably on just about every RHEL-based system, and has definitely been on my systems for as long as I can tell. 

When I do sbatch ./batch, this problem goes away: 

$ sbatch -v ./batch
sbatch: defined options for program `sbatch'
sbatch: ----------------- ---------------------
sbatch: user              : `pbisbal'
sbatch: uid               : 41266
sbatch: gid               : 589
sbatch: cwd               : /u/pbisbal/testing/mpihello
sbatch: ntasks            : 4 (set)
sbatch: nodes             : 1 (default)
sbatch: jobid             : 4294967294 (default)
sbatch: partition         : ellis
sbatch: profile           : `NotSet'
sbatch: job name          : `mpihello'
sbatch: reservation       : `(null)'
sbatch: wckey             : `(null)'
sbatch: distribution      : unknown
sbatch: verbose           : 1
sbatch: overcommit        : false
sbatch: time_limit        : 1
sbatch: nice              : -2
sbatch: account           : (null)
sbatch: comment           : (null)
sbatch: dependency        : (null)
sbatch: qos               : (null)
sbatch: constraints       : mem=63000M 
sbatch: reboot            : yes
sbatch: network           : (null)
sbatch: array             : N/A
sbatch: cpu_freq_min      : 4294967294
sbatch: cpu_freq_max      : 4294967294
sbatch: cpu_freq_gov      : 4294967294
sbatch: mail_type         : BEGIN,END,FAIL,REQUEUE,STAGE_OUT
sbatch: mail_user         : (null)
sbatch: sockets-per-node  : -2
sbatch: cores-per-socket  : -2
sbatch: threads-per-core  : -2
sbatch: ntasks-per-node   : 0
sbatch: ntasks-per-socket : -2
sbatch: ntasks-per-core   : -2
sbatch: mem-bind          : default
sbatch: plane_size        : 4294967294
sbatch: propagate         : NONE
sbatch: switches          : -1
sbatch: wait-for-switches : -1
sbatch: core-spec         : NA
sbatch: burst_buffer      : `(null)'
sbatch: burst_buffer_file : `(null)'
sbatch: remote command    : `./batch'
sbatch: power             : 
sbatch: wait              : yes
sbatch: cpus-per-gpu      : 0
sbatch: gpus              : (null)
sbatch: gpu-bind          : (null)
sbatch: gpu-freq          : (null)
sbatch: gpus-per-node     : (null)
sbatch: gpus-per-socket   : (null)
sbatch: gpus-per-task     : (null)
sbatch: mem-per-gpu       : 0
sbatch: Cray node selection plugin loaded
sbatch: Serial Job Resource Selection plugin loaded with argument 17
sbatch: Linear node selection plugin loaded with argument 17
sbatch: Consumable Resources (CR) Node Selection plugin loaded with argument 17
Submitted batch job 398552

I hope that helps.

Comment 18 Nate Rini 2018-12-19 15:54:49 MST

(In reply to pbisbal from comment #15)
> From the sbatch -v batch output, it looks like this could be part of the
> problem: 
> 
> sbatch: remote command    : `/usr/bin/batch'
> 
> This command /usr/bin/batch, is part of the at package: 
> 
> $ rpm -qf /usr/bin/batch 
> at-3.1.10-49.el6.x86_64
> 
> Which is probably on just about every RHEL-based system, and has definitely
> been on my systems for as long as I can tell. 
> 
> When I do sbatch ./batch, this problem goes away: 

The behavior changed with this patch in 18.08.4:
https://github.com/SchedMD/slurm/commit/ccafaf7b60090155639edcbdbf4a3ab5e36967c6

Looking into what the correct behavior should be with sbatch as opposed to srun/salloc.

--Nate

Comment 25 pbisbal 2018-12-20 09:12:24 MST

Nate, 

Thanks. That bugfix makes perfect sense, both as to why you added it, and why it's causing problems for my user. 

I look forward to hearing if there's any difference between sbatch and srun or salloc. 

Prentice

Comment 49 Nate Rini 2019-01-03 13:37:39 MST

The behaviour change to sbatch has been corrected with this patch:

https://github.com/SchedMD/slurm/commit/aca40d167b21835b0eeb61b18a2eaaef68c33865

Closing this ticket, please reply to re-open it.

Thanks
--Nate

Comment 50 Nate Rini 2019-02-06 15:55:08 MST

*** Ticket 6396 has been marked as a duplicate of this ticket. ***