I've come across a minor slurm bug in 17.11 that's causing a one of our tests to fail. > [vagrant@node1 tests]$ sbatch --wrap='sleep 100' --partition=teamdev --array=1-5 > Submitted batch job 130 > [vagrant@node1 tests]$ squeue > PARTITION PRIORITY NAME USER ST TIME TIME_LEFT NODES NODELIST(REASON JOBID > teamdev 100 wrap vagrant PD 0:00 1:00:00 1 (Resources) 130_[2-5] > teamdev 100 wrap vagrant R 0:01 59:59 1 node1 130_1 > [vagrant@node1 tests]$ scancel 130_[2,5] -v > scancel: Terminating job 130_[2,5] > scancel: error: Kill job error on job id 130_[2,5]: Invalid job id specified The tasks are actually killed. The same does complete successfully in 17.02 and 16.05. It seems to only fail on job arrays. We're actually using slurm_kill_job2 which is returning non-zero in this scenario. Thanks
You're right, this is unique to job arrays. That's because when a job array is submitted, only one job record is created initially. When each job array job is launched, then a new job record is created for that job. The pending job array jobs don't have unique job records, so the error "Invalid job id specified" makes sense, because the job record doesn't exist. However, it's interesting that it still kills the jobs anyway. I'm able to reproduce: $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 531800_[2-5] debug wrap marshall PD 0:00 10 (Resources) 531800_1 debug wrap marshall R 0:01 10 v[1-10] $ scancel 531800_[2,5] -v scancel: Terminating job 531800_[2,5] scancel: error: Kill job error on job id 531800_[2,5]: Invalid job id specified $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 531800_[3-4] debug wrap marshall PD 0:00 10 (Resources) 531800_1 debug wrap marshall R 0:13 10 v[1-10] I'll work on a fix so it doesn't error when it succeeds in killing the job.
We see the same error under version 17.11.3-2. The signal did not succeed in terminating the jobs - jobs completed. Jenny
Hi Mauricio, I just wanted to give an update. I found the problem and have a fix that is pending review. It appears that this error is harmless, though I did fix it so it doesn't error anymore. Jenny, this bug was only with signalling PENDING job array jobs. If you're signalling a RUNNING job and it isn't getting killed, that's something else.
Hi Marshall, Thanks for the update. Good to know the error is harmless.
This has been fixed in commit 8432f9f64cb40d9291bdb43bb3e12864641778b3. It will be in 17.11.6. Closing as resolved/fixed.