Ticket 4997

Summary: killing job array returning Invalid job id specified
Product: Slurm Reporter: Mauricio van den Berg <mauriciob>
Component: SchedulingAssignee: Marshall Garey <marshall>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: bart, jennyw, kaylea.nelson
Version: 17.11.4   
Hardware: Linux   
OS: Linux   
Site: DownUnder GeoSolutions Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 17.11.6
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Mauricio van den Berg 2018-03-28 02:18:59 MDT
I've come across a minor slurm bug in 17.11 that's causing a one of our tests to fail.

> [vagrant@node1 tests]$ sbatch --wrap='sleep 100' --partition=teamdev --array=1-5
> Submitted batch job 130
> [vagrant@node1 tests]$ squeue
> PARTITION   PRIORITY   NAME                     USER ST       TIME TIME_LEFT  NODES NODELIST(REASON JOBID
> teamdev     100        wrap                  vagrant PD       0:00   1:00:00      1 (Resources)     130_[2-5]
> teamdev     100        wrap                  vagrant  R       0:01     59:59      1 node1           130_1
> [vagrant@node1 tests]$ scancel 130_[2,5] -v
> scancel: Terminating job 130_[2,5]
> scancel: error: Kill job error on job id 130_[2,5]: Invalid job id specified

The tasks are actually killed. The same does complete successfully in 17.02 and 16.05. It seems to only fail on job arrays.

We're actually using slurm_kill_job2 which is returning non-zero in this scenario.

Thanks
Comment 1 Marshall Garey 2018-03-28 09:57:32 MDT
You're right, this is unique to job arrays. That's because when a job array is submitted, only one job record is created initially. When each job array job is launched, then a new job record is created for that job. The pending job array jobs don't have unique job records, so the error "Invalid job id specified" makes sense, because the job record doesn't exist. However, it's interesting that it still kills the jobs anyway.

I'm able to reproduce:

$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
      531800_[2-5]     debug     wrap marshall PD       0:00     10 (Resources)
          531800_1     debug     wrap marshall  R       0:01     10 v[1-10]

$ scancel 531800_[2,5] -v
scancel: Terminating job 531800_[2,5]
scancel: error: Kill job error on job id 531800_[2,5]: Invalid job id specified

$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
      531800_[3-4]     debug     wrap marshall PD       0:00     10 (Resources)
          531800_1     debug     wrap marshall  R       0:13     10 v[1-10]


I'll work on a fix so it doesn't error when it succeeds in killing the job.
Comment 2 Jenny Williams 2018-03-30 17:03:04 MDT
We see the same error under version 17.11.3-2.  The signal did not succeed in terminating the jobs - jobs completed.

Jenny
Comment 8 Marshall Garey 2018-04-06 17:38:56 MDT
Hi Mauricio, I just wanted to give an update. I found the problem and have a fix that is pending review. It appears that this error is harmless, though I did fix it so it doesn't error anymore.

Jenny, this bug was only with signalling PENDING job array jobs. If you're signalling a RUNNING job and it isn't getting killed, that's something else.
Comment 9 Mauricio van den Berg 2018-04-09 19:07:34 MDT
Hi Marshall,

Thanks for the update. Good to know the error is harmless.
Comment 12 Marshall Garey 2018-04-19 15:41:38 MDT
This has been fixed in commit 8432f9f64cb40d9291bdb43bb3e12864641778b3. It will be in 17.11.6.

Closing as resolved/fixed.