Ticket 17463

Summary:	job_container not working with autoumount
Product:	Slurm	Reporter:	Yann <yann.sagon>
Component:	Configuration	Assignee:	Megan Dahl <megan>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	mcmullan, megan
Version:	23.02.1
Hardware:	Linux
OS:	Linux
Site:	Université de Genève	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Yann 2023-08-18 07:45:04 MDT

Hello,

I've enabled today job_container/tmpfs on our cluster during a maintenance.

Unfortunately users let us know there is an issue since this plugin is enabled.

The issue is that it isn't possible to access an automount filesytem on a compute node if it wasn't already mounted before the job starts.

Steps to reproduce

first attempt, the share `/acanas/celen` isn't mounted yet on the compute node
```
(baobab)-[celen@admin1 ~]$ salloc -n1 -c1 --partition=shared-cpu --nodelist=cpu058
salloc: Pending job allocation 4776909
salloc: job 4776909 queued and waiting for resources
salloc: job 4776909 has been allocated resources
salloc: Granted job allocation 4776909
salloc: Nodes cpu058 are ready for job
(baobab)-[celen@cpu058 ~]$ ls /acanas/celen
ls: cannot open directory '/acanas/celen': Too many levels of symbolic links
(baobab)-[celen@cpu058 ~]$ exit
``` 
Second attempt, the share `/acanas/celen` is already mounted by the previous attempt
```
srun: error: cpu058: task 0: Exited with exit code 2
salloc: Relinquishing job allocation 4776909
salloc: Job allocation 4776909 has been revoked.
(baobab)-[celen@admin1 ~]$ salloc -n1 -c1 --partition=shared-cpu --nodelist=cpu058
salloc: Pending job allocation 4776913
salloc: job 4776913 queued and waiting for resources
salloc: job 4776913 has been allocated resources
salloc: Granted job allocation 4776913
salloc: Nodes cpu058 are ready for job
(baobab)-[celen@cpu058 ~]$ ls /acanas/celen
 addpaths
[...]

This post was talking about the same issue, this is why I had the idea job_container may be involved.

Comment 1 Tim McMullan 2023-08-18 07:56:47 MDT

Would you please attach your job_container.conf file?

Thanks!
--Tim

Comment 2 Megan Dahl 2023-08-25 09:08:32 MDT

Hello,

Is this still an issue? If so, could you send your job_container.conf?

Thanks,
--Megan

Comment 3 Yann 2023-08-25 09:34:04 MDT

Yes still an issue but we reverted the changes. I'll redo the conf on one node and send you the conf file next week.

Comment 4 Megan Dahl 2023-08-25 09:36:53 MDT

Sounds good, thank you for the update.

Regards,
--Megan

Comment 5 Megan Dahl 2023-09-05 16:56:09 MDT

Hello,

Were you able to recreate the job_container.conf?

Thanks,
--Megan

Comment 6 Megan Dahl 2023-09-11 16:52:35 MDT

Hello,

Is this still an issue? If so, are you able to send the job_container.conf?

Thanks,
--Megan

Comment 7 Megan Dahl 2023-09-15 09:05:47 MDT

Hello,

Since there has been no response for a few weeks I’m going to close the ticket. However, don’t hesitate to reopen the bug if needed.

Regards,
--Megan

Comment 8 Yann 2023-10-12 08:52:54 MDT

Dear team, I was only able to play with this feature this week as our cluster is in maintenance.

It seems I didn't saw the option "Shared" when I tried: https://slurm.schedmd.com/job_container.conf.html#OPT_Shared

It seems it is working fine with that turned on.

Thanks for your help

Best

Yann

Comment 9 Megan Dahl 2023-10-12 09:09:44 MDT

I’m glad that your issue was able to be resolved. Thank you for letting us know what your solution was.

Regards,
--Megan