Ticket 5000

Summary: Increase extern step sleep duration
Product: Slurm Reporter: Dylan Simon <dsimon>
Component: slurmstepdAssignee: Marshall Garey <marshall>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: marshall, tim
Version: 17.02.2   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=9289
https://support.schedmd.com/show_bug.cgi?id=12407
Site: Simons Foundation & Flatiron Institute Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 17.11.6
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Dylan Simon 2018-03-29 10:30:53 MDT
We have some jobs that run for more than 11-13:46:40, but the extern step seems to time out after that time (a million seconds): pam_slurm_adopt no longer works correctly and existing extern processes are killed.  Would it be possible to add a few zeros to the sleep 1000000 in slurmstepd/mgr.c _spawn_job_container?
Comment 4 Marshall Garey 2018-03-29 15:05:38 MDT
Sure thing, we'll add two zeroes for 17.11 and see if we can do something a bit smarter for 18.08.

Until it's committed, you can probably also just add a couple zeroes to the sleep time yourself and recompile, if you're comfortable with that. You don't have to restart anything, since it only changes the slurmstepd binary. The change will be reflected in any future job launches.
Comment 12 Marshall Garey 2018-06-06 13:33:47 MDT
I was holding this bug open for the development work for 18.08, since we wanted a better fix, but I moved that into an internal bug (bug 5268). An extra two zeros have been added into the sleep for 17.11 in commit 140758ca77b4ad8 (committed a couple months ago).

Closing as resolved/fixed.