Ticket 2166 - Orphaned dw_wlm_cli processes on slurmctld server
Summary: Orphaned dw_wlm_cli processes on slurmctld server
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Burst Buffers (show other tickets)
Version: 15.08.3
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Moe Jette
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2015-11-17 10:49 MST by David Paul
Modified: 2015-11-18 04:35 MST (History)
3 users (show)

See Also:
Site: NERSC
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name: Cori
CLE Version:
Version Fixed: 15.08.5
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description David Paul 2015-11-17 10:49:44 MST
There are two orphaned dw_wlm_cli processes lingering on the server.  Is any debug info desired before killing them off?  Any way to determine why they still exist?

root@ctl1==> ps -elf | grep slurm
5 S root     10614     1  9  80   0 - 414137 hrtime Nov16 ?       03:09:55 /opt/slurm/default/sbin/slurmctld

0 S root     13237     1  0  80   0 - 25701 poll_s Nov09 ?        00:00:21 /usr/bin/python /opt/cray/dw_wlm/default/bin/dw_wlm_cli --function data_in --token 19182 --job /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.2/job.19182/script

0 S root     24730     1  0  80   0 - 25702 poll_s Nov09 ?        00:00:20 /usr/bin/python /opt/cray/dw_wlm/default/bin/dw_wlm_cli --function data_in --token 20669 --job /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.9/job.20669/script
Comment 1 Moe Jette 2015-11-17 10:51:50 MST
Please attached slurmctld log file related to jobs 19182 and 20669 plus any slurmctld daemon restart information. Thanks!
Comment 2 Moe Jette 2015-11-18 04:35:05 MST
This commit should fix the problem.
https://github.com/SchedMD/slurm/commit/7b270dc6d28284c67bd955fe8d2fa7375ae8050b

Until it is applied, there may be vestigial dw_wlm_cli processes left when the slurmctld daemon is shutdown.