Ticket 106

Summary: Fix broadcasting actions using sbcast from aborting after 20 minutes.
Product: Slurm Reporter: Bill Brophy <bill.brophy>
Component: OtherAssignee: Danny Auble <da>
Status: RESOLVED FIXED QA Contact:
Severity: 1 - System not usable    
Priority: --- CC: da
Version: 2.4.x   
Hardware: Linux   
OS: Linux   
Site: CEA Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: patch to change validity period for sbcast credential

Description Bill Brophy 2012-08-09 08:36:58 MDT
Created attachment 105 [details]
patch to change validity period for sbcast credential

When using sbcat to boradcast large file on the allocated nodes, the 
 transfer  is aborted after 20 minutes with the following errors 
 printed on the sbcast clinet side:

 sbcast: error REQUEST_FILE_BCAST(leaf2800): invalid job credential
 sbcast: error REQUEST_FILE_BCAST(leaf2825): invalid job credential
 sbcast: error REQUEST_FILE_BCAST(leaf2819): invalid job credential

 On the slurmd side, the following lines are printed :

 [date] sbcast reqreq_uid=1025 fname=/tmp/bigdata block_no=1
 .....
 [date] sbcast reqreq_uid=1025 fname=/tmp/bigdata block_no=798
 [date] error: Security violation: invalid sbcast_cred from uid 1025

 Looking at the code src/common/slurm_cred.c: function create_sbcast_cred, 
 This is due to the factthat the default validity period for sbcast credential 
 is DEFAULT_EXPIRATION_WINDOW which equals 1200 seconds

(This fix was provided by CEA.)
Comment 1 Danny Auble 2012-08-09 09:00:40 MDT
Thanks Bill, this patch is a great idea.  It will be in the next 2.4 release.  Was the ctx->expiry_window looked at as well?  I haven't looked at the code closely yet, but it appears to be suspious as well.
Comment 2 Danny Auble 2012-09-17 09:31:03 MDT
ping?
Comment 3 Bill Brophy 2012-09-18 02:18:54 MDT
Danny,
Sorry I missed your question.  (I was just glad to see you liked the patch.)  I just took a quick look at ctx->expiry_window and it sure looks like it has the potential for the same problem, but is it appropriate to update this on a per-job basis since it is a public key?
Best Regards,
Bill
Comment 4 Moe Jette 2012-09-18 05:58:05 MDT
I just studied this code. The job credentials for sbcast are handled differently than those for job steps. The expiry_window variable that Danny references is not used with sbcast credentials. Bill's patch addresses the problem for long running jobs and no additional changes should be required.