Ticket 4348

Summary:	slurmstepd failed to launch job
Product:	Slurm	Reporter:	mengxing cheng <mxcheng>
Component:	slurmd	Assignee:	Tim Wickberg <tim>
Status:	RESOLVED FIXED	QA Contact:
Severity:	2 - High Impact
Priority:	---
Version:	16.05.11
Hardware:	Linux
OS:	Linux
Site:	University of Chicago	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	16.05.11
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf slurmd.log on compute compilation log

Description mengxing cheng 2017-11-08 15:46:03 MST

Dear slurm team,

We are upgrading 16.05.4 to 16.05.11 to patch the security bug. But when testing the 16.05.11 on a compute node with job submitted from a 16.05.4 login node and 16.05.4 controller. Job failed and the following errors come up. 

compute log
Nov  8 12:38:01 gibbs103 slurmd-gibbs103[31971]: _run_prolog: run job script took usec=5
Nov  8 12:38:01 gibbs103 slurmd-gibbs103[31971]: _run_prolog: prolog with lock for job 831130 ran for 0 seconds
Nov  8 12:38:01 gibbs103 slurmd-gibbs103[31971]: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd
Nov  8 12:38:01 gibbs103 slurmd-gibbs103[31971]: debug3: slurmstepd rank 0 (gibbs103), parent rank -1 (NONE), children 0, depth 0, max_depth 0
Nov  8 12:38:01 gibbs103 slurmd-gibbs103[31971]: debug3: _send_slurmstepd_init: call to getpwuid_r
Nov  8 12:38:01 gibbs103 slurmd-gibbs103[31971]: debug3: _send_slurmstepd_init: return from getpwuid_r
Nov  8 12:38:01 gibbs103 slurmd-gibbs103[31971]: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: No error
Nov  8 12:38:01 gibbs103 slurmd-gibbs103[31971]: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd



controller log
[2017-11-08T12:38:01.606] sched: _slurm_rpc_allocate_resources JobId=831130 NodeList=gibbs103 usec=8205
[2017-11-08T12:38:01.714] _pick_step_nodes: Configuration for job 831130 is complete
[2017-11-08T12:38:01.714] debug:  laying out the 1 tasks on 1 hosts gibbs103 dist 1
[2017-11-08T12:38:01.740] job_step_signal step 831130.0 not found
[2017-11-08T12:38:09.045] debug:  backfill: beginning
[2017-11-08T12:38:09.046] debug:  backfill: 261 jobs to backfill
[2017-11-08T12:38:09.304] debug:  sched: Running job scheduler
[2017-11-08T12:38:33.002] job_step_signal step 831130.0 not found
[2017-11-08T12:38:33.004] job_complete: JobID=831130 State=0x1 NodeCnt=1 WEXITSTATUS 255
[2017-11-08T12:38:33.007] job_complete: JobID=831130 State=0x8003 NodeCnt=1 done


client error
srun: error: Task launch for 831130.0 failed on node gibbs103: Unspecified error
srun: error: Application launch failed: Unspecified error
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

Do you what is wrong?

Mengxing

Comment 1 Tim Wickberg 2017-11-08 15:55:45 MST

Can you attach your current slurm.conf ?

Comment 2 Tim Wickberg 2017-11-08 15:58:13 MST

Logs from the slurmstepd on the node would help isolate this.

My best guess is there's a problem with however you've installed the updated slurmd on the compute node - are you using RPM packages or some other install method?

Comment 3 mengxing cheng 2017-11-08 16:14:42 MST

Created attachment 5530 [details]
slurm.conf

Comment 4 mengxing cheng 2017-11-08 16:15:06 MST

Created attachment 5531 [details]
slurmd.log on compute

Comment 5 Tim Wickberg 2017-11-08 16:19:18 MST

Is there anything in 'dmesg' on the compute node?

It looks like the slurmstepd process dies almost immediately, and that you also have the core file size limit set to zero which precludes a core file from having been generated. If you're able to raise that limit, and get a backtrace from the presumed core file that would help considerably.

Comment 6 mengxing cheng 2017-11-08 16:21:03 MST

The 16.05.11 slurm is compiled from git source git://github.com/SchedMD/slurm.git branch 16.05. It was installed to gpfs /software/staging/slurm-16.05-el6-x86_64/. The legacy 16.05.4 was installed to the gpfs /software/slurm-16.05-el6-x86_64/.

I see some other errors in slurmd.log

[2017-11-08T17:01:43.718] debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd
[2017-11-08T17:01:43.718] debug3: slurmstepd rank 0 (gibbs103), parent rank -1 (NONE), children 0, depth 0, max_depth 0
[2017-11-08T17:01:43.718] debug3: _send_slurmstepd_init: call to getpwuid_r
[2017-11-08T17:01:43.719] debug3: _send_slurmstepd_init: return from getpwuid_r
[2017-11-08T17:01:43.727] debug2: Cached group access list for mxcheng/2121976265
[2017-11-08T17:01:43.727] debug:  req.c:699: : safe_write (4 of 4) failed: Broken pipe
[2017-11-08T17:01:43.727] error: _send_slurmstepd_init failed
[2017-11-08T17:01:43.727] error: Unable to init slurmstepd
[2017-11-08T17:01:43.727] debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd
[2017-11-08T17:02:15.009] debug3: in the service_connection
[2017-11-08T17:02:15.009] debug2: got this type of message 6011
[2017-11-08T17:02:15.009] debug2: Processing RPC: REQUEST_TERMINATE_JOB
[2017-11-08T17:02:15.009] debug:  _rpc_terminate_job, uid = 20006

Comment 7 Tim Wickberg 2017-11-08 16:24:03 MST

(In reply to mengxing cheng from comment #6)
> The 16.05.11 slurm is compiled from git source
> git://github.com/SchedMD/slurm.git branch 16.05. It was installed to gpfs
> /software/staging/slurm-16.05-el6-x86_64/. The legacy 16.05.4 was installed
> to the gpfs /software/slurm-16.05-el6-x86_64/.

Was this done using the --prefix option to configure, or through some other approach?

> I see some other errors in slurmd.log
> 

These are all fallout from the slurmstepd failing to launch. We need to isolate why the slurmstepd is apparently crashing immediately, but it does not look like details about this are making it in to the log file.

If you can find some kernel warning about a segfault, or change the core file size limit and get a backtrace out of the (presumably generated) slurmstepd code, that would help tremendously.

Comment 8 mengxing cheng 2017-11-08 16:37:50 MST

Created attachment 5532 [details]
compilation log

I just attached the build log of 16.05.11. Though --prefix=/software/slurm-16.05-el6-x86_64, our build system built the binary to /software/staging/slurm-16.05-el6-x86_64 for testing purpose. 

I don't see anything useful in dmesg and /var/log/messages. I attempted to adjust core file limit in both /etc/sysconfig/slurm and lively ulimit -c command and restarted slurmd on the compute node, but it still has the warning "Core limit is only 0 KB". Do you know how to tune the core limit for slurmd?

Comment 9 Tim Wickberg 2017-11-08 16:43:06 MST

(In reply to mengxing cheng from comment #8)
> Created attachment 5532 [details]
> compilation log
> 
> I just attached the build log of 16.05.11. Though
> --prefix=/software/slurm-16.05-el6-x86_64, our build system built the binary
> to /software/staging/slurm-16.05-el6-x86_64 for testing purpose. 

That is likely the source of the issue.

The path to the slurmstepd is hard-coded in, along with several library paths, and you cannot simply relocate the install and have things work properly.

While I'm not sure exactly what is happening, the slurmstepd is likely crashing due to a mismatch in versions in some library.

If you're installing to a single shared directory, we suggest doing something like:

/software/slurm/16.05.11

for the installation and --prefix value.

Use the explicit path to the slurmd / slurmctld / slurmdbd binaries at the version you want to start them.

For the user commands, a symlink of /software/slurm/current pointing to the preferred version can let you easily stage updates out over time - once all components in the system have been upgraded, change that current symlink to point to the latest version.

- Tim

Comment 10 mengxing cheng 2017-11-08 16:53:58 MST

Tim, 

Thank you for the prompt support. 

I have put the /software/staging/slurm-16.05-el6-x86_64 to the /etc/sysconfig/slurm that sourced by slurm init script to set environmental variable including LD_LIBRARY_PATH. Does that correctly set the environmental variable for slurmstepd? Do I have a way to set the slurmstepd environmental variable explicitly as PluginDir in slurm.conf?

Comment 11 Tim Wickberg 2017-11-08 16:56:23 MST

No, that and a number of other paths are hardcoded in the binary at compile time.

If you install to your staging area with the --prefix having been set differently, then you'll run into problems like you're currently facing.

Comment 12 mengxing cheng 2017-11-08 17:13:42 MST

Tim, thank you very much. I will recompile it. Could you keep this ticket open? 


Mengxing

Comment 13 mengxing cheng 2017-11-14 19:01:19 MST

Tim, we have upgraded to 16.05.11, which looks fine. You can close this ticket. Thank you very much for help!


Mengxing

Comment 14 Tim Wickberg 2017-11-14 19:15:24 MST

Glad to hear it's all set. Marking resolved.

- Tim

Comment 15 mengxing cheng 2017-11-14 21:19:44 MST

Tim, 

Actually, I didn't receive your email about the security patch, possibly because the email system mistakenly filtered it, until my colleague Jason Hedden who is also a SchedMD registered user told me about the patch. But Jason is gonna soon leaving the University. I wonder if it is possible to add a mailing list sysadm@rcc.uchicago.edu as a contact. Thank you!

Mengxing

Comment 16 mengxing cheng 2017-11-14 21:20:47 MST

sorry, it is sysadmin@rcc.uchicago.edu