Ticket 10318

Summary:	user not able to run the job
Product:	Slurm	Reporter:	Phong Nguyen <phnguyen>
Component:	Configuration	Assignee:	Director of Support <support>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	2 - High Impact
Priority:	---	CC:	felip.moll, nate
Version:	19.05.0
Hardware:	Linux
OS:	Linux
Site:	Raytheon Missile, Space and Airborne	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	19
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	job and error slurm.conf slurmdbd submit job script

Description Phong Nguyen 2020-11-30 13:50:17 MST

Created attachment 16870 [details]
job and error

We have a standalone box that we install SLURM Server/Client - This is so we can deliver to the customer

We are having issue running the job using sbatch command as root and grid users.

Please see attachments for more detail.

Comment 1 Phong Nguyen 2020-11-30 13:51:05 MST

Created attachment 16871 [details]
slurm.conf

Comment 2 Phong Nguyen 2020-11-30 13:51:31 MST

Created attachment 16872 [details]
slurmdbd

Comment 3 Phong Nguyen 2020-11-30 13:52:01 MST

Created attachment 16873 [details]
submit job script

Comment 4 Michael Hinton 2020-11-30 14:35:03 MST

Hi Phong,

I noticed this:

[Guest@HARM-Q6DOF ~]$ sacctmgr show user
      User   Def Acct     Admin 
---------- ---------- --------- 
      grid   hpcusers  Operator 
     guest   hpcusers Administ+ 
      root       root Administ+ 

Your current Linux user is `Guest` while the user you set up in Slurm is `guest`. I think that might be a problem - I think users are case sensitive. So they should probably both be `guest`. Can you confirm whether that fixes the problem?

Thanks,
-Michael

Comment 5 Phong Nguyen 2020-11-30 14:41:28 MST

My thought the same as your - I removed the guest/Guest and recreate and still getting the same error.

I need to be able to run the job as root/grid using the sbatch command but the error is always referring /var/log/slurm/job$id in the error file

Why it making call to /var/log/slurm? This already exist and perm is 755 nobody:nobody

I even change to 777 and still not working

Comment 6 Phong Nguyen 2020-11-30 15:05:04 MST

Hi,

I do have customer with me and do you have any update on the SBATCH fail and refereeing to /var/log/slurm/job$id?

Thanks

Comment 7 Michael Hinton 2020-11-30 15:15:44 MST

Can you attach the output to the following commands?:

scontrol show assoc
sacctmgr show assoc

-Michael

Comment 8 Michael Hinton 2020-11-30 16:37:37 MST

Besides the Guest/guest user issue, for root, the stepd is failing to run the batch script. I would double check permissions on /var/log/slurm/ and also restart the slurmd if you suspect permissions were changed recently. See https://bugs.schedmd.com/show_bug.cgi?id=2180#c5

-Michael

Comment 9 Phong Nguyen 2020-12-01 06:18:31 MST

Hi,

                The permission on /var/log/slurm is 755 and owned by nobody:nobody just like any other cluster that we setup

Phong Nguyen
High Performance Computing

O: +1 520.794.0817
P: +1 520-446-1397
C: +1 303.949.0861
phnguyen@rtx.com<mailto:phnguyen@rtx.com>

Raytheon Missile & Defense Systems
Digital Technologies
PO Box 11337
MS TU/M05
Tucson, Az 85734-1337

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, November 30, 2020 4:38 PM
To: Phong Nguyen <p-nguyen5@raytheon.com>
Subject: [External] [Bug 10318] user not able to run the job

Comment # 8<https://bugs.schedmd.com/show_bug.cgi?id=10318#c8> on bug 10318<https://bugs.schedmd.com/show_bug.cgi?id=10318> from Michael Hinton<mailto:hinton@schedmd.com>

Besides the Guest/guest user issue, for root, the stepd is failing to run the

batch script. I would double check permissions on /var/log/slurm/ and also

restart the slurmd if you suspect permissions were changed recently. See

https://bugs.schedmd.com/show_bug.cgi?id=2180#c5<show_bug.cgi?id=2180#c5>



-Michael

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 10 Phong Nguyen 2020-12-01 06:34:52 MST

Hi,

I have the SA to run the "scontrol show assoc" and "sacctmgr show assoc" when he get in this morning.

The situation is the standalone box is locate at 30m away from me and there is NO remote access to that box. It was build so they can ship to the customer.

Can we setup a time so we can support over the phone so I can drive there. They need to ship the box by Thurs or next Mon is the latest

I am off this Friday

Comment 11 Phong Nguyen 2020-12-01 07:52:52 MST

Here is the results after running those commands:

scontrol show assoc
UserName=grid(1000) DefAccount=hpcusers DefWckey= AdminLevel=Operator 
UserName=root (0) DefAccount=root DefWckey= AdminLevel=Administrator
UserName=Guest(4294967294) DefAccount=hpcusers DefWckey=(null) AdminLevel=Administrator


sacctmgr show assoc
All accounts show as hpc_clus+
Accounts section has root, grid, guest
Share is 1 for root. Share is 100 for grid and guest

Comment 12 Phong Nguyen 2020-12-01 08:21:33 MST

Hi,

I restarted SLURMD and still getting the same error when submitting job using SBATCH command

/var/log/slurm/job00364/slurm script: Permissions denied 


Can we escalate this issue up? Why SBATCH referring to /var/log/slurm?

Comment 14 Michael Hinton 2020-12-01 09:52:28 MST

Could you do `mount -a` and paste the output?

Is this a diskless installation? What's the NFS setup?

Could you do `cat /etc/exports` in the server exporting the filesystem?

Comment 15 Phong Nguyen 2020-12-01 09:56:44 MST

I will email the SA and hope he can get those for us.

No – This is Standalone box


Phong Nguyen
High Performance Computing

O: +1 520.794.0817
P: +1 520-446-1397
C: +1 303.949.0861
phnguyen@rtx.com<mailto:phnguyen@rtx.com>

Raytheon Missile & Defense Systems
Digital Technologies
PO Box 11337
MS TU/M05
Tucson, Az 85734-1337

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, December 1, 2020 9:52 AM
To: Phong Nguyen <p-nguyen5@raytheon.com>
Subject: [External] [Bug 10318] user not able to run the job

Comment # 14<https://bugs.schedmd.com/show_bug.cgi?id=10318#c14> on bug 10318<https://bugs.schedmd.com/show_bug.cgi?id=10318> from Michael Hinton<mailto:hinton@schedmd.com>

Could you do `mount -a` and paste the output?



Is this a diskless installation? What's the NFS setup?



Could you do `cat /etc/exports` in the server exporting the filesystem?

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 16 Michael Hinton 2020-12-01 10:02:30 MST

Ok, I might know the answer here:

Your NFS setup probably has root_squash set. Your SlurmdSpoolDir is set to /var/log/slurm, which is probably on the NFS mount. Slurmd/slurmstepd uses that folder to run the batch script. But root_squash makes it so that it is owned by user `nobody` instead of `root`.

I think the solution is to change your config to this:

SlurmdSpoolDir=/var/spool/slurmd

Then restart the slurmd.

Make sure /va/spool/slurmd is NOT under NFS, but on the local machine. And make sure it has proper execute permissions.

Comment 17 Michael Hinton 2020-12-01 10:12:46 MST

(In reply to Michael Hinton from comment #14)
> Could you do `mount -a` and paste the output?
Sorry, I meant just `mount`. Don't do `mount -a`.

Comment 18 Phong Nguyen 2020-12-01 12:29:06 MST

Hi
                Issue has been resolved by changing slurmdspooldir=/var/log/slurm to slurmdspooldir=/var/spool/slurmd

Phong Nguyen
High Performance Computing

O: +1 520.794.0817
P: +1 520-446-1397
C: +1 303.949.0861
phnguyen@rtx.com<mailto:phnguyen@rtx.com>

Raytheon Missile & Defense Systems
Digital Technologies
PO Box 11337
MS TU/M05
Tucson, Az 85734-1337

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, December 1, 2020 9:52 AM
To: Phong Nguyen <p-nguyen5@raytheon.com>
Subject: [External] [Bug 10318] user not able to run the job

Comment # 14<https://bugs.schedmd.com/show_bug.cgi?id=10318#c14> on bug 10318<https://bugs.schedmd.com/show_bug.cgi?id=10318> from Michael Hinton<mailto:hinton@schedmd.com>

Could you do `mount -a` and paste the output?



Is this a diskless installation? What's the NFS setup?



Could you do `cat /etc/exports` in the server exporting the filesystem?

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 19 Michael Hinton 2020-12-01 12:56:58 MST

Ok, great. I'll close this out. Feel free to reopen if needed.

I would also make these recommendations, to avoid more issues in the future:

* Set StateSaveLocation=/var/spool/ctld. This location should be local, not on NFS, so that it's fast for slurmctld. It should be separate from the logging location so you don't accidentally destroy your state when clearing or rotating logs. It should also be different than SlurmdSpoolDir, since the slurmctld and slurmd share the same machine.

* Change SlurmUser to slurm. You should avoid using NFS's 'nobody' user. Make sure StateSaveLocation is owned by user slurm. Then run ctld as user slurm and slurmd as user root.

* Change user "Guest" to "guest" to avoid any potential issues with case sensitivity.

* Avoid using NFS if it's a standalone, self-contained machine with no outside connectivity - I don't see the point.

Thanks,
-Michael