Ticket 3730 - After upgrade to 17.02.2, spank-x11 plugin broken and nodes draining
Summary: After upgrade to 17.02.2, spank-x11 plugin broken and nodes draining
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 17.02.2
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Director of Support
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-04-23 08:15 MDT by HMS Research Computing
Modified: 2017-05-02 12:41 MDT (History)
0 users

See Also:
Site: Harvard Medical School
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description HMS Research Computing 2017-04-23 08:15:27 MDT
Hi,

we recently upgraded to slurm 17.02.2 from 16.05.4
and for some reason x11 display isn't working

Previously we are able to run x11 using --x11 and --x11=batch

**This is the output of job**
sh: -c: line 0: syntax error near unexpected token `('
sh: -c: line 0: `/usr/libexec/slurm-spank-x11 -u root -s "ssh" -o "" -f (null) -d localhost:11.0 -t compute-a-16-30.o2.rc.hms.harvard.edu -i 909796.4294967294 -cwg  &'
slurmstepd: error: x11: unable to get a DISPLAY value
slurmstepd: error: spank: required plugin x11.so: user_init() failed with rc=-6
slurmstepd: error: spank_user failed.
slurmstepd: error: Unable to return to working directory
slurmstepd: error: job_manager exiting abnormally, rc = 4020

The major issue is this is setting nodes to Drain state with reason "batch job complete failure"

**slurmd logs on one of the nodes**
[2017-04-21T11:35:55.137] [873337] error: x11: unable to get a DISPLAY value
[2017-04-21T11:35:55.138] [873337] error: spank: required plugin x11.so: user_init() failed with rc=-6
[2017-04-21T11:35:55.139] [873337] error: spank_user failed.
[2017-04-21T11:35:55.140] [873337] error: Unable to return to working directory
[2017-04-21T11:35:55.170] [873337] error: job_manager exiting abnormally, rc = 4020

We don't know why x11 plugin broken after upgrading slurm
Comment 3 Tim Shaw 2017-04-25 10:18:17 MDT
Hello,

Make sure you recompile the x11 SPANK plugin.  SPANK plugins using the Slurm APIs need to be recompiled when upgrading Slurm to a new major release.

https://slurm.schedmd.com/spank.html

Hope that helps.

Regards.

Tim
Comment 4 Tim Shaw 2017-05-01 16:13:14 MDT
Hello,

I just wanted to verify this recommendation fixed the problem you were seeing.  Is it okay to resolve this bug?

Thanks

Tim
Comment 5 HMS Research Computing 2017-05-02 12:38:00 MDT
Hi Tim,

yes recompiling the plug in fixed the problem, you can close this ticket.

Thanks of the help!

Raffaele
Comment 6 Tim Shaw 2017-05-02 12:41:12 MDT
Resolving this bug.