Ticket 106 - Fix broadcasting actions using sbcast from aborting after 20 minutes.
Summary: Fix broadcasting actions using sbcast from aborting after 20 minutes.
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 2.4.x
Hardware: Linux Linux
: 1 - System not usable
Assignee: Danny Auble
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2012-08-09 08:36 MDT by Bill Brophy
Modified: 2012-09-18 05:58 MDT (History)
1 user (show)

See Also:
Site: CEA
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
patch to change validity period for sbcast credential (2.60 KB, application/octet-stream)
2012-08-09 08:36 MDT, Bill Brophy
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Bill Brophy 2012-08-09 08:36:58 MDT
Created attachment 105 [details]
patch to change validity period for sbcast credential

When using sbcat to boradcast large file on the allocated nodes, the 
 transfer  is aborted after 20 minutes with the following errors 
 printed on the sbcast clinet side:

 sbcast: error REQUEST_FILE_BCAST(leaf2800): invalid job credential
 sbcast: error REQUEST_FILE_BCAST(leaf2825): invalid job credential
 sbcast: error REQUEST_FILE_BCAST(leaf2819): invalid job credential

 On the slurmd side, the following lines are printed :

 [date] sbcast reqreq_uid=1025 fname=/tmp/bigdata block_no=1
 .....
 [date] sbcast reqreq_uid=1025 fname=/tmp/bigdata block_no=798
 [date] error: Security violation: invalid sbcast_cred from uid 1025

 Looking at the code src/common/slurm_cred.c: function create_sbcast_cred, 
 This is due to the factthat the default validity period for sbcast credential 
 is DEFAULT_EXPIRATION_WINDOW which equals 1200 seconds

(This fix was provided by CEA.)
Comment 1 Danny Auble 2012-08-09 09:00:40 MDT
Thanks Bill, this patch is a great idea.  It will be in the next 2.4 release.  Was the ctx->expiry_window looked at as well?  I haven't looked at the code closely yet, but it appears to be suspious as well.
Comment 2 Danny Auble 2012-09-17 09:31:03 MDT
ping?
Comment 3 Bill Brophy 2012-09-18 02:18:54 MDT
Danny,
Sorry I missed your question.  (I was just glad to see you liked the patch.)  I just took a quick look at ctx->expiry_window and it sure looks like it has the potential for the same problem, but is it appropriate to update this on a per-job basis since it is a public key?
Best Regards,
Bill
Comment 4 Moe Jette 2012-09-18 05:58:05 MDT
I just studied this code. The job credentials for sbcast are handled differently than those for job steps. The expiry_window variable that Danny references is not used with sbcast credentials. Bill's patch addresses the problem for long running jobs and no additional changes should be required.