Ticket 1381

Summary: scontrol update job can result in confusing error message
Product: Slurm Reporter: Bill Brophy <bill.brophy>
Component: OtherAssignee: David Bigagli <david>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: brian, da
Version: 15.08.x   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=4685
Site: Atos/Eviden Sites Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 15.08.0pre2 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: pre-patch test results
post-patch test results
patch to src/slurmctld/job_mgr.c
real pre-patch test results file
slurm.conf file

Description Bill Brophy 2015-01-19 02:56:17 MST

    
Comment 1 Bill Brophy 2015-01-19 03:02:52 MST
Created attachment 1556 [details]
pre-patch test results
Comment 2 Bill Brophy 2015-01-19 03:03:32 MST
Created attachment 1557 [details]
post-patch test results
Comment 3 Bill Brophy 2015-01-19 03:05:51 MST
Created attachment 1558 [details]
patch to src/slurmctld/job_mgr.c
Comment 4 Bill Brophy 2015-01-19 03:08:54 MST
The user submitted a "scontrol update job=103 excnodelist=n1" command which resulted in the display of the message:

Job violates accounting/QOS policy (job submit limit, user's size and/or time li
mits) for job 103

The user thought the update failed, however an "scontrol show job 103"  showed that the jobs excnodelist was successfully updated.  In the proposed patch that is attached the display of the error message is skipped if the state of the job is not changed by the requested update.
Best Regards,
Bill
Comment 5 David Bigagli 2015-01-20 09:34:48 MST
Hi Bill,
       do you know how to reproduce the issue? I tried to configure submission
limit in my qos but I can modify the job without problems.

david@prometeo ~/slurm/work $ sbatch -H -o /dev/null sleepme 3600
Submitted batch job 6769
david@prometeo ~/slurm/work $ sbatch -H -o /dev/null sleepme 3600
Submitted batch job 6770
david@prometeo ~/slurm/work $ sbatch -H -o /dev/null sleepme 3600
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
david@prometeo ~/slurm/work $ scontrol update job=6770 excnodelist=prometeo

Thanks,

David
Comment 6 David Bigagli 2015-01-20 09:42:12 MST
I am also a bit confused about the pre and post patch test results files.
The 2 files appear to be the same, diff returns 0.

David
Comment 7 Bill Brophy 2015-01-21 00:15:53 MST
Created attachment 1562 [details]
real pre-patch test results file

David,
I am sorry, I made a mistake.  The new attachment shows how I setup the conditions to cause the problem & the test results prior to applying my patch.
Best Regards,
Bill
Comment 8 David Bigagli 2015-01-21 03:54:48 MST
Hi Bill,
         I cannot reproduce this neither in 14.11 nor in 15.08. What version is your customer using? 15.08? I think I also need your configuration to try to 
reproduce this again.

David
Comment 9 Bill Brophy 2015-01-21 04:47:02 MST
Created attachment 1565 [details]
slurm.conf file

David,
The customer was running slurm 2.6.0, but I reproduced the problem on slurm 15.08.0-0pre1.  My slurm.conf is attached.
Best Regards,
Bill
Comment 10 David Bigagli 2015-01-22 07:12:14 MST
Thanks Bill. We have fixed the issue in commit 9cf314e6054d.
By looking at the diff you can see we chose a different approach
since your patch completely disabled the accounting checking.

David
Comment 11 Bill Brophy 2015-01-23 03:36:56 MST
David,
I like your fix.  Thank you for addressing this problem.
Best Regards,
Bill