Dear slurm team, We are upgrading 16.05.4 to 16.05.11 to patch the security bug. But when testing the 16.05.11 on a compute node with job submitted from a 16.05.4 login node and 16.05.4 controller. Job failed and the following errors come up. compute log Nov 8 12:38:01 gibbs103 slurmd-gibbs103[31971]: _run_prolog: run job script took usec=5 Nov 8 12:38:01 gibbs103 slurmd-gibbs103[31971]: _run_prolog: prolog with lock for job 831130 ran for 0 seconds Nov 8 12:38:01 gibbs103 slurmd-gibbs103[31971]: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd Nov 8 12:38:01 gibbs103 slurmd-gibbs103[31971]: debug3: slurmstepd rank 0 (gibbs103), parent rank -1 (NONE), children 0, depth 0, max_depth 0 Nov 8 12:38:01 gibbs103 slurmd-gibbs103[31971]: debug3: _send_slurmstepd_init: call to getpwuid_r Nov 8 12:38:01 gibbs103 slurmd-gibbs103[31971]: debug3: _send_slurmstepd_init: return from getpwuid_r Nov 8 12:38:01 gibbs103 slurmd-gibbs103[31971]: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: No error Nov 8 12:38:01 gibbs103 slurmd-gibbs103[31971]: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd controller log [2017-11-08T12:38:01.606] sched: _slurm_rpc_allocate_resources JobId=831130 NodeList=gibbs103 usec=8205 [2017-11-08T12:38:01.714] _pick_step_nodes: Configuration for job 831130 is complete [2017-11-08T12:38:01.714] debug: laying out the 1 tasks on 1 hosts gibbs103 dist 1 [2017-11-08T12:38:01.740] job_step_signal step 831130.0 not found [2017-11-08T12:38:09.045] debug: backfill: beginning [2017-11-08T12:38:09.046] debug: backfill: 261 jobs to backfill [2017-11-08T12:38:09.304] debug: sched: Running job scheduler [2017-11-08T12:38:33.002] job_step_signal step 831130.0 not found [2017-11-08T12:38:33.004] job_complete: JobID=831130 State=0x1 NodeCnt=1 WEXITSTATUS 255 [2017-11-08T12:38:33.007] job_complete: JobID=831130 State=0x8003 NodeCnt=1 done client error srun: error: Task launch for 831130.0 failed on node gibbs103: Unspecified error srun: error: Application launch failed: Unspecified error srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: error: Timed out waiting for job step to complete Do you what is wrong? Mengxing
Can you attach your current slurm.conf ?
Logs from the slurmstepd on the node would help isolate this. My best guess is there's a problem with however you've installed the updated slurmd on the compute node - are you using RPM packages or some other install method?
Created attachment 5530 [details] slurm.conf
Created attachment 5531 [details] slurmd.log on compute
Is there anything in 'dmesg' on the compute node? It looks like the slurmstepd process dies almost immediately, and that you also have the core file size limit set to zero which precludes a core file from having been generated. If you're able to raise that limit, and get a backtrace from the presumed core file that would help considerably.
The 16.05.11 slurm is compiled from git source git://github.com/SchedMD/slurm.git branch 16.05. It was installed to gpfs /software/staging/slurm-16.05-el6-x86_64/. The legacy 16.05.4 was installed to the gpfs /software/slurm-16.05-el6-x86_64/. I see some other errors in slurmd.log [2017-11-08T17:01:43.718] debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd [2017-11-08T17:01:43.718] debug3: slurmstepd rank 0 (gibbs103), parent rank -1 (NONE), children 0, depth 0, max_depth 0 [2017-11-08T17:01:43.718] debug3: _send_slurmstepd_init: call to getpwuid_r [2017-11-08T17:01:43.719] debug3: _send_slurmstepd_init: return from getpwuid_r [2017-11-08T17:01:43.727] debug2: Cached group access list for mxcheng/2121976265 [2017-11-08T17:01:43.727] debug: req.c:699: : safe_write (4 of 4) failed: Broken pipe [2017-11-08T17:01:43.727] error: _send_slurmstepd_init failed [2017-11-08T17:01:43.727] error: Unable to init slurmstepd [2017-11-08T17:01:43.727] debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd [2017-11-08T17:02:15.009] debug3: in the service_connection [2017-11-08T17:02:15.009] debug2: got this type of message 6011 [2017-11-08T17:02:15.009] debug2: Processing RPC: REQUEST_TERMINATE_JOB [2017-11-08T17:02:15.009] debug: _rpc_terminate_job, uid = 20006
(In reply to mengxing cheng from comment #6) > The 16.05.11 slurm is compiled from git source > git://github.com/SchedMD/slurm.git branch 16.05. It was installed to gpfs > /software/staging/slurm-16.05-el6-x86_64/. The legacy 16.05.4 was installed > to the gpfs /software/slurm-16.05-el6-x86_64/. Was this done using the --prefix option to configure, or through some other approach? > I see some other errors in slurmd.log > These are all fallout from the slurmstepd failing to launch. We need to isolate why the slurmstepd is apparently crashing immediately, but it does not look like details about this are making it in to the log file. If you can find some kernel warning about a segfault, or change the core file size limit and get a backtrace out of the (presumably generated) slurmstepd code, that would help tremendously.
Created attachment 5532 [details] compilation log I just attached the build log of 16.05.11. Though --prefix=/software/slurm-16.05-el6-x86_64, our build system built the binary to /software/staging/slurm-16.05-el6-x86_64 for testing purpose. I don't see anything useful in dmesg and /var/log/messages. I attempted to adjust core file limit in both /etc/sysconfig/slurm and lively ulimit -c command and restarted slurmd on the compute node, but it still has the warning "Core limit is only 0 KB". Do you know how to tune the core limit for slurmd?
(In reply to mengxing cheng from comment #8) > Created attachment 5532 [details] > compilation log > > I just attached the build log of 16.05.11. Though > --prefix=/software/slurm-16.05-el6-x86_64, our build system built the binary > to /software/staging/slurm-16.05-el6-x86_64 for testing purpose. That is likely the source of the issue. The path to the slurmstepd is hard-coded in, along with several library paths, and you cannot simply relocate the install and have things work properly. While I'm not sure exactly what is happening, the slurmstepd is likely crashing due to a mismatch in versions in some library. If you're installing to a single shared directory, we suggest doing something like: /software/slurm/16.05.11 for the installation and --prefix value. Use the explicit path to the slurmd / slurmctld / slurmdbd binaries at the version you want to start them. For the user commands, a symlink of /software/slurm/current pointing to the preferred version can let you easily stage updates out over time - once all components in the system have been upgraded, change that current symlink to point to the latest version. - Tim
Tim, Thank you for the prompt support. I have put the /software/staging/slurm-16.05-el6-x86_64 to the /etc/sysconfig/slurm that sourced by slurm init script to set environmental variable including LD_LIBRARY_PATH. Does that correctly set the environmental variable for slurmstepd? Do I have a way to set the slurmstepd environmental variable explicitly as PluginDir in slurm.conf?
No, that and a number of other paths are hardcoded in the binary at compile time. If you install to your staging area with the --prefix having been set differently, then you'll run into problems like you're currently facing.
Tim, thank you very much. I will recompile it. Could you keep this ticket open? Mengxing
Tim, we have upgraded to 16.05.11, which looks fine. You can close this ticket. Thank you very much for help! Mengxing
Glad to hear it's all set. Marking resolved. - Tim
Tim, Actually, I didn't receive your email about the security patch, possibly because the email system mistakenly filtered it, until my colleague Jason Hedden who is also a SchedMD registered user told me about the patch. But Jason is gonna soon leaving the University. I wonder if it is possible to add a mailing list sysadm@rcc.uchicago.edu as a contact. Thank you! Mengxing
sorry, it is sysadmin@rcc.uchicago.edu