Ticket 17463

Summary: job_container not working with autoumount
Product: Slurm Reporter: Yann <yann.sagon>
Component: ConfigurationAssignee: Megan Dahl <megan>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: mcmullan, megan
Version: 23.02.1   
Hardware: Linux   
OS: Linux   
Site: Université de Genève Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Yann 2023-08-18 07:45:04 MDT
Hello,

I've enabled today job_container/tmpfs on our cluster during a maintenance.

Unfortunately users let us know there is an issue since this plugin is enabled.

The issue is that it isn't possible to access an automount filesytem on a compute node if it wasn't already mounted before the job starts.

Steps to reproduce

first attempt, the share `/acanas/celen` isn't mounted yet on the compute node
```
(baobab)-[celen@admin1 ~]$ salloc -n1 -c1 --partition=shared-cpu --nodelist=cpu058
salloc: Pending job allocation 4776909
salloc: job 4776909 queued and waiting for resources
salloc: job 4776909 has been allocated resources
salloc: Granted job allocation 4776909
salloc: Nodes cpu058 are ready for job
(baobab)-[celen@cpu058 ~]$ ls /acanas/celen
ls: cannot open directory '/acanas/celen': Too many levels of symbolic links
(baobab)-[celen@cpu058 ~]$ exit
``` 
Second attempt, the share `/acanas/celen` is already mounted by the previous attempt
```
srun: error: cpu058: task 0: Exited with exit code 2
salloc: Relinquishing job allocation 4776909
salloc: Job allocation 4776909 has been revoked.
(baobab)-[celen@admin1 ~]$ salloc -n1 -c1 --partition=shared-cpu --nodelist=cpu058
salloc: Pending job allocation 4776913
salloc: job 4776913 queued and waiting for resources
salloc: job 4776913 has been allocated resources
salloc: Granted job allocation 4776913
salloc: Nodes cpu058 are ready for job
(baobab)-[celen@cpu058 ~]$ ls /acanas/celen
 addpaths
[...]

This post was talking about the same issue, this is why I had the idea job_container may be involved.
Comment 1 Tim McMullan 2023-08-18 07:56:47 MDT
Would you please attach your job_container.conf file?

Thanks!
--Tim
Comment 2 Megan Dahl 2023-08-25 09:08:32 MDT
Hello,

Is this still an issue? If so, could you send your job_container.conf?

Thanks,
--Megan
Comment 3 Yann 2023-08-25 09:34:04 MDT
Yes still an issue but we reverted the changes. I'll redo the conf on one node and send you the conf file next week.
Comment 4 Megan Dahl 2023-08-25 09:36:53 MDT
Sounds good, thank you for the update.

Regards,
--Megan
Comment 5 Megan Dahl 2023-09-05 16:56:09 MDT
Hello,

Were you able to recreate the job_container.conf?

Thanks,
--Megan
Comment 6 Megan Dahl 2023-09-11 16:52:35 MDT
Hello,

Is this still an issue? If so, are you able to send the job_container.conf?

Thanks,
--Megan
Comment 7 Megan Dahl 2023-09-15 09:05:47 MDT
Hello,

Since there has been no response for a few weeks I’m going to close the ticket. However, don’t hesitate to reopen the bug if needed.

Regards,
--Megan
Comment 8 Yann 2023-10-12 08:52:54 MDT
Dear team, I was only able to play with this feature this week as our cluster is in maintenance.

It seems I didn't saw the option "Shared" when I tried: https://slurm.schedmd.com/job_container.conf.html#OPT_Shared

It seems it is working fine with that turned on.

Thanks for your help

Best

Yann
Comment 9 Megan Dahl 2023-10-12 09:09:44 MDT
I’m glad that your issue was able to be resolved. Thank you for letting us know what your solution was.

Regards,
--Megan