| Summary: | slurmstepd failed to launch job | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | mengxing cheng <mxcheng> |
| Component: | slurmd | Assignee: | Tim Wickberg <tim> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | ||
| Version: | 16.05.11 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | University of Chicago | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 16.05.11 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurm.conf
slurmd.log on compute compilation log |
||
|
Description
mengxing cheng
2017-11-08 15:46:03 MST
Can you attach your current slurm.conf ? Logs from the slurmstepd on the node would help isolate this. My best guess is there's a problem with however you've installed the updated slurmd on the compute node - are you using RPM packages or some other install method? Created attachment 5530 [details]
slurm.conf
Created attachment 5531 [details]
slurmd.log on compute
Is there anything in 'dmesg' on the compute node? It looks like the slurmstepd process dies almost immediately, and that you also have the core file size limit set to zero which precludes a core file from having been generated. If you're able to raise that limit, and get a backtrace from the presumed core file that would help considerably. The 16.05.11 slurm is compiled from git source git://github.com/SchedMD/slurm.git branch 16.05. It was installed to gpfs /software/staging/slurm-16.05-el6-x86_64/. The legacy 16.05.4 was installed to the gpfs /software/slurm-16.05-el6-x86_64/. I see some other errors in slurmd.log [2017-11-08T17:01:43.718] debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd [2017-11-08T17:01:43.718] debug3: slurmstepd rank 0 (gibbs103), parent rank -1 (NONE), children 0, depth 0, max_depth 0 [2017-11-08T17:01:43.718] debug3: _send_slurmstepd_init: call to getpwuid_r [2017-11-08T17:01:43.719] debug3: _send_slurmstepd_init: return from getpwuid_r [2017-11-08T17:01:43.727] debug2: Cached group access list for mxcheng/2121976265 [2017-11-08T17:01:43.727] debug: req.c:699: : safe_write (4 of 4) failed: Broken pipe [2017-11-08T17:01:43.727] error: _send_slurmstepd_init failed [2017-11-08T17:01:43.727] error: Unable to init slurmstepd [2017-11-08T17:01:43.727] debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd [2017-11-08T17:02:15.009] debug3: in the service_connection [2017-11-08T17:02:15.009] debug2: got this type of message 6011 [2017-11-08T17:02:15.009] debug2: Processing RPC: REQUEST_TERMINATE_JOB [2017-11-08T17:02:15.009] debug: _rpc_terminate_job, uid = 20006 (In reply to mengxing cheng from comment #6) > The 16.05.11 slurm is compiled from git source > git://github.com/SchedMD/slurm.git branch 16.05. It was installed to gpfs > /software/staging/slurm-16.05-el6-x86_64/. The legacy 16.05.4 was installed > to the gpfs /software/slurm-16.05-el6-x86_64/. Was this done using the --prefix option to configure, or through some other approach? > I see some other errors in slurmd.log > These are all fallout from the slurmstepd failing to launch. We need to isolate why the slurmstepd is apparently crashing immediately, but it does not look like details about this are making it in to the log file. If you can find some kernel warning about a segfault, or change the core file size limit and get a backtrace out of the (presumably generated) slurmstepd code, that would help tremendously. Created attachment 5532 [details]
compilation log
I just attached the build log of 16.05.11. Though --prefix=/software/slurm-16.05-el6-x86_64, our build system built the binary to /software/staging/slurm-16.05-el6-x86_64 for testing purpose.
I don't see anything useful in dmesg and /var/log/messages. I attempted to adjust core file limit in both /etc/sysconfig/slurm and lively ulimit -c command and restarted slurmd on the compute node, but it still has the warning "Core limit is only 0 KB". Do you know how to tune the core limit for slurmd?
(In reply to mengxing cheng from comment #8) > Created attachment 5532 [details] > compilation log > > I just attached the build log of 16.05.11. Though > --prefix=/software/slurm-16.05-el6-x86_64, our build system built the binary > to /software/staging/slurm-16.05-el6-x86_64 for testing purpose. That is likely the source of the issue. The path to the slurmstepd is hard-coded in, along with several library paths, and you cannot simply relocate the install and have things work properly. While I'm not sure exactly what is happening, the slurmstepd is likely crashing due to a mismatch in versions in some library. If you're installing to a single shared directory, we suggest doing something like: /software/slurm/16.05.11 for the installation and --prefix value. Use the explicit path to the slurmd / slurmctld / slurmdbd binaries at the version you want to start them. For the user commands, a symlink of /software/slurm/current pointing to the preferred version can let you easily stage updates out over time - once all components in the system have been upgraded, change that current symlink to point to the latest version. - Tim Tim, Thank you for the prompt support. I have put the /software/staging/slurm-16.05-el6-x86_64 to the /etc/sysconfig/slurm that sourced by slurm init script to set environmental variable including LD_LIBRARY_PATH. Does that correctly set the environmental variable for slurmstepd? Do I have a way to set the slurmstepd environmental variable explicitly as PluginDir in slurm.conf? No, that and a number of other paths are hardcoded in the binary at compile time. If you install to your staging area with the --prefix having been set differently, then you'll run into problems like you're currently facing. Tim, thank you very much. I will recompile it. Could you keep this ticket open? Mengxing Tim, we have upgraded to 16.05.11, which looks fine. You can close this ticket. Thank you very much for help! Mengxing Glad to hear it's all set. Marking resolved. - Tim Tim, Actually, I didn't receive your email about the security patch, possibly because the email system mistakenly filtered it, until my colleague Jason Hedden who is also a SchedMD registered user told me about the patch. But Jason is gonna soon leaving the University. I wonder if it is possible to add a mailing list sysadm@rcc.uchicago.edu as a contact. Thank you! Mengxing sorry, it is sysadmin@rcc.uchicago.edu |