Hello, I've enabled today job_container/tmpfs on our cluster during a maintenance. Unfortunately users let us know there is an issue since this plugin is enabled. The issue is that it isn't possible to access an automount filesytem on a compute node if it wasn't already mounted before the job starts. Steps to reproduce first attempt, the share `/acanas/celen` isn't mounted yet on the compute node ``` (baobab)-[celen@admin1 ~]$ salloc -n1 -c1 --partition=shared-cpu --nodelist=cpu058 salloc: Pending job allocation 4776909 salloc: job 4776909 queued and waiting for resources salloc: job 4776909 has been allocated resources salloc: Granted job allocation 4776909 salloc: Nodes cpu058 are ready for job (baobab)-[celen@cpu058 ~]$ ls /acanas/celen ls: cannot open directory '/acanas/celen': Too many levels of symbolic links (baobab)-[celen@cpu058 ~]$ exit ``` Second attempt, the share `/acanas/celen` is already mounted by the previous attempt ``` srun: error: cpu058: task 0: Exited with exit code 2 salloc: Relinquishing job allocation 4776909 salloc: Job allocation 4776909 has been revoked. (baobab)-[celen@admin1 ~]$ salloc -n1 -c1 --partition=shared-cpu --nodelist=cpu058 salloc: Pending job allocation 4776913 salloc: job 4776913 queued and waiting for resources salloc: job 4776913 has been allocated resources salloc: Granted job allocation 4776913 salloc: Nodes cpu058 are ready for job (baobab)-[celen@cpu058 ~]$ ls /acanas/celen addpaths [...] This post was talking about the same issue, this is why I had the idea job_container may be involved.
Would you please attach your job_container.conf file? Thanks! --Tim
Hello, Is this still an issue? If so, could you send your job_container.conf? Thanks, --Megan
Yes still an issue but we reverted the changes. I'll redo the conf on one node and send you the conf file next week.
Sounds good, thank you for the update. Regards, --Megan
Hello, Were you able to recreate the job_container.conf? Thanks, --Megan
Hello, Is this still an issue? If so, are you able to send the job_container.conf? Thanks, --Megan
Hello, Since there has been no response for a few weeks I’m going to close the ticket. However, don’t hesitate to reopen the bug if needed. Regards, --Megan
Dear team, I was only able to play with this feature this week as our cluster is in maintenance. It seems I didn't saw the option "Shared" when I tried: https://slurm.schedmd.com/job_container.conf.html#OPT_Shared It seems it is working fine with that turned on. Thanks for your help Best Yann
I’m glad that your issue was able to be resolved. Thank you for letting us know what your solution was. Regards, --Megan