| Summary: | Fix broadcasting actions using sbcast from aborting after 20 minutes. | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Bill Brophy <bill.brophy> |
| Component: | Other | Assignee: | Danny Auble <da> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 1 - System not usable | ||
| Priority: | --- | CC: | da |
| Version: | 2.4.x | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | CEA | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | patch to change validity period for sbcast credential | ||
Thanks Bill, this patch is a great idea. It will be in the next 2.4 release. Was the ctx->expiry_window looked at as well? I haven't looked at the code closely yet, but it appears to be suspious as well. ping? Danny, Sorry I missed your question. (I was just glad to see you liked the patch.) I just took a quick look at ctx->expiry_window and it sure looks like it has the potential for the same problem, but is it appropriate to update this on a per-job basis since it is a public key? Best Regards, Bill I just studied this code. The job credentials for sbcast are handled differently than those for job steps. The expiry_window variable that Danny references is not used with sbcast credentials. Bill's patch addresses the problem for long running jobs and no additional changes should be required. |
Created attachment 105 [details] patch to change validity period for sbcast credential When using sbcat to boradcast large file on the allocated nodes, the transfer is aborted after 20 minutes with the following errors printed on the sbcast clinet side: sbcast: error REQUEST_FILE_BCAST(leaf2800): invalid job credential sbcast: error REQUEST_FILE_BCAST(leaf2825): invalid job credential sbcast: error REQUEST_FILE_BCAST(leaf2819): invalid job credential On the slurmd side, the following lines are printed : [date] sbcast reqreq_uid=1025 fname=/tmp/bigdata block_no=1 ..... [date] sbcast reqreq_uid=1025 fname=/tmp/bigdata block_no=798 [date] error: Security violation: invalid sbcast_cred from uid 1025 Looking at the code src/common/slurm_cred.c: function create_sbcast_cred, This is due to the factthat the default validity period for sbcast credential is DEFAULT_EXPIRATION_WINDOW which equals 1200 seconds (This fix was provided by CEA.)