We have some jobs that run for more than 11-13:46:40, but the extern step seems to time out after that time (a million seconds): pam_slurm_adopt no longer works correctly and existing extern processes are killed. Would it be possible to add a few zeros to the sleep 1000000 in slurmstepd/mgr.c _spawn_job_container?
Sure thing, we'll add two zeroes for 17.11 and see if we can do something a bit smarter for 18.08. Until it's committed, you can probably also just add a couple zeroes to the sleep time yourself and recompile, if you're comfortable with that. You don't have to restart anything, since it only changes the slurmstepd binary. The change will be reflected in any future job launches.
I was holding this bug open for the development work for 18.08, since we wanted a better fix, but I moved that into an internal bug (bug 5268). An extra two zeros have been added into the sleep for 17.11 in commit 140758ca77b4ad8 (committed a couple months ago). Closing as resolved/fixed.