Hi, we had a user who had issues running a python script with matplotlib and x11 forwarding. Other x11 applications had no issues, and we could sometimes get his script to work. After some digging it seems like matplotlib tries to connect to x11 about 3 times before displaying anything to the user. So we wrote a small reproducer that will work ok locally, or remotely over ssh but fails if you run it with srun. ----8<----------- #include <dlfcn.h> #include <stdlib.h> #include <stdio.h> /* Based on Matplotlib src/_c_internal_utils.cpp */ int mpl_xdisplay_is_valid(void) { void* libX11; // The getenv check is redundant but helps performance as it is much faster // than dlopen(). if (getenv("DISPLAY") && (libX11 = dlopen("libX11.so.6", RTLD_LAZY))) { typedef struct Display* (*XOpenDisplay_t)(char const*); typedef int (*XCloseDisplay_t)(struct Display*); struct Display* display = NULL; XOpenDisplay_t XOpenDisplay = (XOpenDisplay_t)dlsym(libX11, "XOpenDisplay"); XCloseDisplay_t XCloseDisplay = (XCloseDisplay_t)dlsym(libX11, "XCloseDisplay"); if (XOpenDisplay && XCloseDisplay && (display = XOpenDisplay(NULL))) { XCloseDisplay(display); } if (dlclose(libX11)) { printf("BAD CLOSE\n"); } if (display) { return 1; } } return 0; } void main(int argc, char **argv) { printf("success %d\n", mpl_xdisplay_is_valid()); printf("success %d\n", mpl_xdisplay_is_valid()); printf("success %d\n", mpl_xdisplay_is_valid()); printf("success %d\n", mpl_xdisplay_is_valid()); } ---->8----------- [jonst@tetra ~]$ gcc -o xtest xtest.c [jonst@tetra ~]$ ./xtest success 1 success 1 success 1 success 1 [jonst@tetra ~]$ srun -n1 --x11 ./xtest srun: job 43071435 queued and waiting for resources srun: job 43071435 has been allocated resources X connection to localhost:94.0 broken (explicit kill or server shutdown). success 1 success 0 success 0 srun: error: n341: task 0: Exited with exit code 1 srun: Terminating StepId=43071435.0 Adding a sleep between the calls or removing the XCloseDisplay calls, makes it work, but it is nothing we can add to the matplotlib-code the user has. We are trying some other workarounds, but we were wondering if this is a known issue and if you have any recommendations.
Created attachment 41254 [details] xtest.c
This does look like similar behavior to an on going issue with x11. I will try your example here and confirm itβs the same issue. Other users have reported issues with opening graphical matlab with srun and x11 forwarding.(https://support.schedmd.com/show_bug.cgi?id=21783) -Connor
Hi, just checking if there has been any progress on this. Is there any patch (even an untested one) that we may have a look at?
Hey Jonas, there has been progress on a patch. Still going through review unfortunately. I'll double check with the team if we want to share it at its current state. The other option is reverting a change that was done in the past. However depending on your workflow it could either be an issue or a non-issue. https://github.com/SchedMD/slurm/commit/ecfc7f6ff7 The patch was pushed to resolve leaking sockets for jobs that ran interactive sessions with "salloc" and opened and closed several x11 applications. However, it also causes some applications to not open at all which is the issue that brought about this ticket. If your interactive jobs that use x11 forwarding don't typically do that, reverting the patch in the link above will allow your application to run. -Connor