| Summary: | user not able to run the job | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Phong Nguyen <phnguyen> |
| Component: | Configuration | Assignee: | Director of Support <support> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | CC: | felip.moll, nate |
| Version: | 19.05.0 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Raytheon Missile, Space and Airborne | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 19 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
job and error
slurm.conf slurmdbd submit job script |
||
Created attachment 16871 [details]
slurm.conf
Created attachment 16872 [details]
slurmdbd
Created attachment 16873 [details]
submit job script
Hi Phong,
I noticed this:
[Guest@HARM-Q6DOF ~]$ sacctmgr show user
User Def Acct Admin
---------- ---------- ---------
grid hpcusers Operator
guest hpcusers Administ+
root root Administ+
Your current Linux user is `Guest` while the user you set up in Slurm is `guest`. I think that might be a problem - I think users are case sensitive. So they should probably both be `guest`. Can you confirm whether that fixes the problem?
Thanks,
-Michael
My thought the same as your - I removed the guest/Guest and recreate and still getting the same error. I need to be able to run the job as root/grid using the sbatch command but the error is always referring /var/log/slurm/job$id in the error file Why it making call to /var/log/slurm? This already exist and perm is 755 nobody:nobody I even change to 777 and still not working Hi, I do have customer with me and do you have any update on the SBATCH fail and refereeing to /var/log/slurm/job$id? Thanks Can you attach the output to the following commands?: scontrol show assoc sacctmgr show assoc -Michael Besides the Guest/guest user issue, for root, the stepd is failing to run the batch script. I would double check permissions on /var/log/slurm/ and also restart the slurmd if you suspect permissions were changed recently. See https://bugs.schedmd.com/show_bug.cgi?id=2180#c5 -Michael Hi,
The permission on /var/log/slurm is 755 and owned by nobody:nobody just like any other cluster that we setup
Phong Nguyen
High Performance Computing
O: +1 520.794.0817
P: +1 520-446-1397
C: +1 303.949.0861
phnguyen@rtx.com<mailto:phnguyen@rtx.com>
Raytheon Missile & Defense Systems
Digital Technologies
PO Box 11337
MS TU/M05
Tucson, Az 85734-1337
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, November 30, 2020 4:38 PM
To: Phong Nguyen <p-nguyen5@raytheon.com>
Subject: [External] [Bug 10318] user not able to run the job
Comment # 8<https://bugs.schedmd.com/show_bug.cgi?id=10318#c8> on bug 10318<https://bugs.schedmd.com/show_bug.cgi?id=10318> from Michael Hinton<mailto:hinton@schedmd.com>
Besides the Guest/guest user issue, for root, the stepd is failing to run the
batch script. I would double check permissions on /var/log/slurm/ and also
restart the slurmd if you suspect permissions were changed recently. See
https://bugs.schedmd.com/show_bug.cgi?id=2180#c5<show_bug.cgi?id=2180#c5>
-Michael
________________________________
You are receiving this mail because:
* You reported the bug.
Hi, I have the SA to run the "scontrol show assoc" and "sacctmgr show assoc" when he get in this morning. The situation is the standalone box is locate at 30m away from me and there is NO remote access to that box. It was build so they can ship to the customer. Can we setup a time so we can support over the phone so I can drive there. They need to ship the box by Thurs or next Mon is the latest I am off this Friday Here is the results after running those commands: scontrol show assoc UserName=grid(1000) DefAccount=hpcusers DefWckey= AdminLevel=Operator UserName=root (0) DefAccount=root DefWckey= AdminLevel=Administrator UserName=Guest(4294967294) DefAccount=hpcusers DefWckey=(null) AdminLevel=Administrator sacctmgr show assoc All accounts show as hpc_clus+ Accounts section has root, grid, guest Share is 1 for root. Share is 100 for grid and guest Hi, I restarted SLURMD and still getting the same error when submitting job using SBATCH command /var/log/slurm/job00364/slurm script: Permissions denied Can we escalate this issue up? Why SBATCH referring to /var/log/slurm? Could you do `mount -a` and paste the output? Is this a diskless installation? What's the NFS setup? Could you do `cat /etc/exports` in the server exporting the filesystem? I will email the SA and hope he can get those for us. No – This is Standalone box Phong Nguyen High Performance Computing O: +1 520.794.0817 P: +1 520-446-1397 C: +1 303.949.0861 phnguyen@rtx.com<mailto:phnguyen@rtx.com> Raytheon Missile & Defense Systems Digital Technologies PO Box 11337 MS TU/M05 Tucson, Az 85734-1337 From: bugs@schedmd.com <bugs@schedmd.com> Sent: Tuesday, December 1, 2020 9:52 AM To: Phong Nguyen <p-nguyen5@raytheon.com> Subject: [External] [Bug 10318] user not able to run the job Comment # 14<https://bugs.schedmd.com/show_bug.cgi?id=10318#c14> on bug 10318<https://bugs.schedmd.com/show_bug.cgi?id=10318> from Michael Hinton<mailto:hinton@schedmd.com> Could you do `mount -a` and paste the output? Is this a diskless installation? What's the NFS setup? Could you do `cat /etc/exports` in the server exporting the filesystem? ________________________________ You are receiving this mail because: * You reported the bug. Ok, I might know the answer here: Your NFS setup probably has root_squash set. Your SlurmdSpoolDir is set to /var/log/slurm, which is probably on the NFS mount. Slurmd/slurmstepd uses that folder to run the batch script. But root_squash makes it so that it is owned by user `nobody` instead of `root`. I think the solution is to change your config to this: SlurmdSpoolDir=/var/spool/slurmd Then restart the slurmd. Make sure /va/spool/slurmd is NOT under NFS, but on the local machine. And make sure it has proper execute permissions. (In reply to Michael Hinton from comment #14) > Could you do `mount -a` and paste the output? Sorry, I meant just `mount`. Don't do `mount -a`. Hi
Issue has been resolved by changing slurmdspooldir=/var/log/slurm to slurmdspooldir=/var/spool/slurmd
Phong Nguyen
High Performance Computing
O: +1 520.794.0817
P: +1 520-446-1397
C: +1 303.949.0861
phnguyen@rtx.com<mailto:phnguyen@rtx.com>
Raytheon Missile & Defense Systems
Digital Technologies
PO Box 11337
MS TU/M05
Tucson, Az 85734-1337
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, December 1, 2020 9:52 AM
To: Phong Nguyen <p-nguyen5@raytheon.com>
Subject: [External] [Bug 10318] user not able to run the job
Comment # 14<https://bugs.schedmd.com/show_bug.cgi?id=10318#c14> on bug 10318<https://bugs.schedmd.com/show_bug.cgi?id=10318> from Michael Hinton<mailto:hinton@schedmd.com>
Could you do `mount -a` and paste the output?
Is this a diskless installation? What's the NFS setup?
Could you do `cat /etc/exports` in the server exporting the filesystem?
________________________________
You are receiving this mail because:
* You reported the bug.
Ok, great. I'll close this out. Feel free to reopen if needed. I would also make these recommendations, to avoid more issues in the future: * Set StateSaveLocation=/var/spool/ctld. This location should be local, not on NFS, so that it's fast for slurmctld. It should be separate from the logging location so you don't accidentally destroy your state when clearing or rotating logs. It should also be different than SlurmdSpoolDir, since the slurmctld and slurmd share the same machine. * Change SlurmUser to slurm. You should avoid using NFS's 'nobody' user. Make sure StateSaveLocation is owned by user slurm. Then run ctld as user slurm and slurmd as user root. * Change user "Guest" to "guest" to avoid any potential issues with case sensitivity. * Avoid using NFS if it's a standalone, self-contained machine with no outside connectivity - I don't see the point. Thanks, -Michael |
Created attachment 16870 [details] job and error We have a standalone box that we install SLURM Server/Client - This is so we can deliver to the customer We are having issue running the job using sbatch command as root and grid users. Please see attachments for more detail.