Ticket 8657 - uesing sbatch to submit job, got 'NonZeroExitCode' error
Summary: uesing sbatch to submit job, got 'NonZeroExitCode' error
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other tickets)
Version: 19.05.5
Hardware: Linux Linux
: 6 - No support contract
Assignee: Jess
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-03-11 01:27 MDT by ethan wong
Modified: 2020-03-11 13:40 MDT (History)
1 user (show)

See Also:
Site: HPE
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: SL-T5-IRONNINJA
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description ethan wong 2020-03-11 01:27:34 MDT
Hello,

In user root sbatch job is proper, but in another user got error.


In root:

[root@manager ~]sacctmgr add user wsb1 Account=root

I created a user wsb1 with account root, and switch to wsb1 uesing sbatch to submit job, got 'NonZeroExitCode' error.

[wsb1@manager slurm_test]$ sbatch slurm_job.sh

JobId=23 JobName=test
   UserId=wsb1(1002) GroupId=wsb1(1002) MCS_label=N/A
   Priority=4294901741 Nice=0 Account=root QOS=normal
   JobState=FAILED Reason=NonZeroExitCode Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=1:0
   RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2020-03-11T07:03:19 EligibleTime=2020-03-11T07:03:19
   AccrueTime=2020-03-11T07:03:19
   StartTime=2020-03-11T07:03:20 EndTime=2020-03-11T07:03:20 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-03-11T07:03:20
   Partition=c4m6 AllocNode:Sid=manager:13244
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=cm-wsb-c4m8d200-[4,6]
   BatchHost=cm-wsb-c4m8d200-4
   NumNodes=2 NumCPUs=4 NumTasks=0 CPUs/Task=2 ReqB:S:C:T=0:0:*:*
   TRES=cpu=4,node=2,billing=4
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=2 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/slurm_test/slurm_job.sh
   WorkDir=/home/slurm_test
   StdErr=/home/slurm_test/slurm.sh.out
   StdIn=/dev/null
   StdOut=/home/slurm_test/slurm.sh.out
   Power=

Here is slurm_job.sh.
#!/usr/bin/env bash
 
#SBATCH -J test
#SBATCH -o slurm.sh.out
#SBATCH -N 2
#SBATCH --partition=c4m6
#SBATCH --cpus-per-task=2

echo "In the directory: `pwd`"
echo "As the user: `whoami`"
echo "write this is a file"
sleep 20
echo "finished 100" > analysis.output


But in wsb1, through srun job is proper.

[wsb1@manager slurm_test]$ srun -N1 -n4 singularity exec pytorch_cpu.sif ./MNIST_test1.py

[wsb1@manager slurm_test]$ scontrol show job
JobId=22 JobName=singularity
   UserId=wsb1(1002) GroupId=wsb1(1002) MCS_label=N/A
   Priority=4294901742 Nice=0 Account=root QOS=normal
   JobState=COMPLETED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:08:44 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2020-03-11T06:43:45 EligibleTime=2020-03-11T06:43:45
   AccrueTime=Unknown
   StartTime=2020-03-11T06:43:45 EndTime=2020-03-11T06:52:29 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-03-11T06:43:45
   Partition=c8m14 AllocNode:Sid=manager:13244
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=cm-wsb-c8m16d200-1
   BatchHost=cm-wsb-c8m16d200-1
   NumNodes=1 NumCPUs=4 NumTasks=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=4,node=1,billing=4
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=singularity
   WorkDir=/home/slurm_test
   Power=


Here is somr my setting:
[root@marger slurm_test]# sacctmgr list account
   Account                Descr                  Org
---------- -------------------- --------------------
      root default root account                 root

[root@manager slurm_test]# sacctmgr list user
      User   Def Acct     Admin
---------- ---------- ---------
      root       root Administ+
      wsb1       root      None


in slurm.conf
# LOGGING AND ACCOUNTING
AccountingStorageEnforce=associations
AccountingStorageHost=manager
AccountingStoragePort=6819
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreJobComment=YES

in slurmdbd.conf
# slurmDBD info
DbdAddr=manager
DbdHost=manager
DbdPort=6819
SlurmUser=slurm
#MessageTimeout=300
DebugLevel=verbose
#DefaultQOS=normal,standby
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/var/run/slurmdbd.pid
#PluginDir=/usr/lib/slurm
#PrivateData=accounts,users,usage,jobs
#TrackWCKey=yes
#
# Database info
StorageType=accounting_storage/mysql
StorageHost=manager
StoragePort=3306
StoragePass=!QAZ2wsx3edc
StorageUser=slurm
StorageLoc=slurm_acct_db