Ticket 13260

Summary:	Upgrading slurm version 20.11.8 to 21.08.05
Product:	Slurm	Reporter:	Praveen SV <vijayap>
Component:	slurmctld	Assignee:	Ben Roberts <ben>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	2 - High Impact
Priority:	---
Version:	21.08.5
Hardware:	Linux
OS:	Linux
Site:	Roche/PHCIX	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurmconf lua file

Description Praveen SV 2022-01-25 13:43:28 MST

Created attachment 23123 [details]
slurmconf

Hi Team,

Recently we have upgraded the slurm version 20.11.8 to 21.08.05. We got errors related to wckey.

see the error

Jan 25 17:53:44 spcdp-usw2-1031 slurmctld[20919]: slurmctld: fatal: Unable to process configuration file
Jan 25 17:53:44 spcdp-usw2-1031 slurmctld[20919]: error: Invalid parameter for AccountingStorageEnforce: wckey
Jan 25 17:53:44 spcdp-usw2-1031 slurmctld[20919]: error: AccountingStorageEnforce invalid: limits,qos,associations,wckey
Jan 25 17:53:44 spcdp-usw2-1031 slurmctld[20919]: fatal: Unable to process configuration file


After referring slurm documentation we see that we should specify wckeys in AccountingStorageEnforce rather than specifying wckey

and older versions where supporting AccountingStorageEnforce=wckey.


Now the issue is resolved the slurmctld is up,

But however while submitting sbatch slurm jobs the job runs and works fine but however getting this error in slurmctld.log
>>
[2022-01-25T18:02:59.675] error: _set_job_field: unrecognized field: wckeys

How to address this error.

Attaching slurm.conf file for your reference


Thanks in advance
Praveen

Comment 1 Jason Booth 2022-01-25 15:40:42 MST

Would you also attach your lua job submit script? These errors look like they are originating from that plugin script.

Comment 2 Praveen SV 2022-01-26 10:09:20 MST

Created attachment 23133 [details]
lua file

Hi Jason,

Please find the lua file attached.


Thanks 
Praveen

Comment 3 Jason Booth 2022-01-26 11:11:46 MST

Praveen the error is coming from the incorrect key 'wckeys' in your lua script. Please rename these to 'wckey'

This error has been around for some time, as well as the use of "job_desc.wckey" so I am not sure why you did not notice this sooner.

https://github.com/SchedMD/slurm/blob/master/src/plugins/job_submit/lua/job_submit_lua.c#L688

I am going to resolve this out for now since I do not see a bug here, rather a typo in your job_submit.

Comment 4 Praveen SV 2022-01-27 10:07:56 MST

Hi Jason,

Yes I did notice that. But doing that I get into some issues. Where as 

Our system partition we offer
C-16Cpu-30GB          up   infinite      1 drain~ spcdp-usw2-0005
C-16Cpu-30GB          up   infinite     19  idle~ spcdp-usw2-[0004,0006-0023]
C-36Cpu-69GB          up   infinite     20  idle~ spcdp-usw2-[0024-0043]
C-72Cpu-139GB         up   infinite     40  idle~ spcdp-usw2-[0044-0083]
M-16Cpu-123GB         up   infinite     20  idle~ spcdp-usw2-[0084-0103]
M-48Cpu-371GB         up   infinite      1   mix# spcdp-usw2-0104
M-48Cpu-371GB         up   infinite     19  idle~ spcdp-usw2-[0105-0123]
M-96Cpu-742GB         up   infinite     40  idle~ spcdp-usw2-[0124-0163]
G-1GPU-8Cpu-58GB      up   infinite      1   mix# spcdp-usw2-0164
G-1GPU-8Cpu-58GB      up   infinite     39  idle~ spcdp-usw2-[0165-0203]
G-4GPU-32Cpu-235GB    up   infinite     40  idle~ spcdp-usw2-[0204-0243]
G-8GPU-64Cpu-471GB    up   infinite     40  idle~ spcdp-usw2-[0244-0283]

User permission

      User   Def Acct  Def WCKey     Admin
---------- ---------- ---------- ---------
  stimpelb  palladium                 None
   vijayap  palladium c5.4xlarge Administ+


But during job submission

user - viajayap user is able to submit the job
vijayap@spcdp-usw2-1031:~$ sbatch -p G-1GPU-8Cpu-58GB --gres=gpu:1 test.sh
Submitted batch job 3082

user - whereas stimpelb user is not able to submit the job of same partition
stimpelb@spcdp-usw2-1031:~$ sbatch -p G-1GPU-8Cpu-58GB --gres=gpu:1 test.sh
sbatch: error: Batch job submission failed: Invalid wckey specification


See the above error. Both have same account permission and requesting for same partition. 

slurmlogs

[2022-01-27T17:04:56.476] error: _copy_job_desc_to_job_record: invalid wckey 'p3.2xlarge' for user 2001786.
[2022-01-27T17:04:56.476] _slurm_rpc_submit_batch_job: Invalid wckey specification
[2022-01-27T17:05:09.366] error: _shutdown_bu_thread:send/recv spcdp-usw2-1258: Connection refused

Comment 5 Jason Booth 2022-01-27 11:03:40 MST


> sbatch: error: Batch job submission failed: Invalid wckey specification

This error means that the wckey requested by the job/user is not in the database or the user is not able to access that wckey because they are not associated with it.

Please run the following and sent back the output.
> $ sacctmgr show wckey

If this output is too large then please focus on the user stimpelb and verify they have a wckey defined for their credential in slurmdbd.

For example:

> $ sacctmgr show wckey where user=stimpelb

You can also try modifying that user:


> sacctmgr mod user stimpelb set defaultwckey=c5.4xlarge

However, I suspect that the user is not associated with that wckey. In that case, you would need to add them

> sacctmgr add user stimpelb wckey=c5.4xlarge

Comment 6 Praveen SV 2022-01-27 11:30:12 MST

Hi Jason,

Yes it worked. But does the slurm version upgrade alter these settings ?? Because earlier this user did have this permission to submit this partition. After version upgrade he lost access.


Thanks
Praveen

Comment 7 Praveen SV 2022-01-31 02:04:28 MST

Hi Jason,


Doing > sacctmgr add user stimpelb wckey=c5.4xlarge the user was able to submit the job.

But does the slurm version upgrade alter these settings ?? Because earlier this user did have this permission to submit this job/partition. After version upgrade he lost access. Can you please confirm as soon as possible.


Thanks
Praveen

Comment 8 Ben Roberts 2022-01-31 12:32:58 MST

Hi Praveen,

Jason has been tied up in other projects and asked if I would look at what's going on in this ticket. I think I may see the chain of events that led up to where you are now. There was a change that went into the code ahead of 21.08 that did better error checking of parameters for AccountingStorageEnforce. https://github.com/SchedMD/slurm/commit/41135bdd2bd1dacdedaff8f50fe0d9358792c703

The AccountingStorageEnforce parameter didn't change between 20.11 and 21.08, it was always 'wckeys', but this change made it apparent that the parameter was missing an 's' if you just had 'wckey'. Before this change parameters that weren't correct got silently ignored.

With that change in mind, I think that you probably had 'wckey' listed in AccountingStorageEnforce and since it wasn't properly listed as 'wckeys' the wckey enforcement was not happening correctly. With the enforcement not happening, user 'stimpelb' was able to submit jobs with a wckey even though it wasn't correctly configured for his user. Once the code change caused you to realize that the parameter name wasn't correct and change it, then the enforcement did start happening. At that point it was apparent that user 'stimpelb' didn't have a wckey correctly associated with his user association. Once the issue was fixed with sacctmgr then things continued working for him.

Let me know if there's something in the chain of events that I'm missing where this doesn't make sense or if you have any questions about it.

Thanks,
Ben

Comment 9 Praveen SV 2022-02-02 06:50:34 MST

Hi Ben,

There are some additional errors popping out when submitting jobs. Can you please review and advice

The compute nodes was assigned but eventually all went to pending state with below error message


root@spcd-usw2-02191:/shared/slurm_SLURM-MASTER-USW2-HPC-UAT/etc# squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               616 C-72Cpu-1 test-scr     jhui PD       0:00      1 (launch failed requeued held)
               613 G-1GPU-8C  test.sh  vijayap PD       0:00      1 (launch failed requeued held)
               618 G-8GPU-64  test.sh stimpelb PD       0:00      1 (launch failed requeued held)
               614 G-8GPU-64  test.sh  vijayap PD       0:00      1 (launch failed requeued held)
               617 M-48Cpu-3  test.sh stimpelb PD       0:00      1 (launch failed requeued held)
               615 M-96Cpu-7 test-scr     jhui PD       0:00      1 (launch failed requeued held)


slurmctld.log errors
[2022-02-02T12:46:05.616] WARNING: A line in gres.conf for GRES gpu:V100 has 4 more configured than expected in slurm.conf. Ignoring extra GRES.
[2022-02-02T12:53:11.378] _slurm_rpc_submit_batch_job: JobId=616 InitPrio=4294901756 usec=487
[2022-02-02T12:53:11.713] sched: Allocate JobId=616 NodeList=spcd-usw2-02347 #CPUs=2 Partition=C-72Cpu-139GB
[2022-02-02T13:00:45.250] job_time_limit: Configuration for JobId=616 complete
[2022-02-02T13:00:45.250] Resetting JobId=616 start time for node power up
[2022-02-02T13:00:45.263] Killing non-startable batch JobId=616: Job credential revoked
[2022-02-02T13:00:45.263] _job_complete: JobId=616 WEXITSTATUS 1
[2022-02-02T13:00:45.266] _job_complete: JobId=616 done
[2022-02-02T13:00:45.524] error: job_epilog_complete: JobId=616 epilog error on spcd-usw2-02347, draining the node
[2022-02-02T13:00:45.524] error: _slurm_rpc_epilog_complete: epilog error JobId=616 Node=spcd-usw2-02347 Err=Job epilog failed

Best Regards,
Praveen

Comment 10 Ben Roberts 2022-02-02 09:11:41 MST

Hi Praveen,

The thing that stands out the most from the logs you sent is the entry that says "Job credential revoked".  It sounds like there may be some skew between the time on the controller and the compute node.  Can I have you run the following command from your slurm controller and send the output?
ssh spcd-usw2-02347 munge -n | unmunge

I'm looking at the slurm.conf you attached to the ticket and it looks like you are using at least some AWS nodes.  Are the nodes requested by the jobs you are showing all trying to go to nodes in the cloud?  Do you have local nodes as well?  Is there a difference in behavior between the local nodes and cloud nodes?  

Thanks,
Ben

Comment 11 Praveen SV 2022-02-02 21:54:02 MST

Hi Ben,

All nodes are hosted in AWS Cloud nodes only. We are not using any local nodes.

Also the command ssh spcd-usw2-02347 munge -n | unmunge does not return anything. Since that slurm compute node is terminated/ does not exist any more.

spcd-usw2-02191 host is our slurm controller

I have submitted  a new slurm job now. Please find the details below

job details

vijayap@spcd-usw2-02191:~$ sbatch -p C-16Cpu-30GB test.sh
Submitted batch job 619


root@spcd-usw2-02191:~# squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               619 C-16Cpu-3  test.sh  vijayap CF       9:37      1 spcd-usw2-02307

vijayap@spcd-usw2-02191:~$ scontrol show jobid=619
JobId=619 JobName=test.sh
   UserId=vijayap(93343) GroupId=dialout(20) MCS_label=N/A
   Priority=0 Nice=0 Account=platinum QOS=gold WCKey=c5.4xlarge
   JobState=PENDING Reason=launch_failed_requeued_held Dependency=(null)
   Requeue=1 Restarts=2 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2022-02-03T04:27:39 EligibleTime=2022-02-03T04:29:40
   AccrueTime=2022-02-03T04:29:40
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-02-03T04:29:43 Scheduler=Main
   Partition=C-16Cpu-30GB AllocNode:Sid=spcd-usw2-02191:13061
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=spcd-usw2-02307
   BatchHost=spcd-usw2-02307
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=1932M,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=1932M MinTmpDiskNode=0
   Features=us-west-2a DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/vijayap/test.sh
   WorkDir=/home/vijayap
   StdErr=/home/vijayap/slurm-619.out
   StdIn=/dev/null
   StdOut=/home/vijayap/slurm-619.out
   Power=


-----------------------------------------
slurmctldlog

[2022-02-03T04:17:37.081] _slurm_rpc_submit_batch_job: JobId=619 InitPrio=4294901753 usec=16012
[2022-02-03T04:17:37.934] sched: Allocate JobId=619 NodeList=spcd-usw2-02306 #CPUs=2 Partition=C-16Cpu-30GB
[2022-02-03T04:27:39.682] requeue job JobId=619 due to failure of node spcd-usw2-02306
[2022-02-03T04:27:39.683] Requeuing JobId=619
[2022-02-03T04:29:43.983] sched/backfill: _start_job: Started JobId=619 in C-16Cpu-30GB on spcd-usw2-02307
[2022-02-03T04:39:50.175] job_time_limit: Configuration for JobId=619 complete
[2022-02-03T04:39:50.175] Resetting JobId=619 start time for node power up
[2022-02-03T04:39:50.193] error: Prolog launch failure, JobId=619
[2022-02-03T04:39:50.525] error: job_epilog_complete: JobId=619 epilog error on spcd-usw2-02307, draining the node
[2022-02-03T04:39:50.525] error: _slurm_rpc_epilog_complete: epilog error JobId=619 Node=spcd-usw2-02307 Err=Job epilog failed
root@spcd-usw2-02191:~#
------------------------------------------
munge check

root@spcd-usw2-02191:~# su - vijayap
vijayap@spcd-usw2-02191:~$ ssh spcd-usw2-02307 munge -n | unmunge
Warning: Permanently added 'spcd-usw2-02307,10.175.233.5' (ECDSA) to the list of known hosts.
Password:
STATUS:           Success (0)
ENCODE_HOST:      spcd-usw2-02307.aws.science.roche.com (10.175.233.5)
ENCODE_TIME:      2022-02-03 04:50:30 +0000 (1643863830)
DECODE_TIME:      2022-02-03 04:50:30 +0000 (1643863830)
TTL:              300
CIPHER:           aes128 (4)
MAC:              sha256 (5)
ZIP:              none (0)
UID:              vijayap (93343)
GID:              dialout (20)
LENGTH:           0

----------------------------------------
munge status

root@spcd-usw2-02191:~# systemctl status munge
● munge.service - MUNGE authentication service
   Loaded: loaded (/lib/systemd/system/munge.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2022-02-02 11:27:35 UTC; 16h ago
     Docs: man:munged(8)
  Process: 946 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS)
 Main PID: 951 (munged)
    Tasks: 4 (limit: 4915)
   CGroup: /system.slice/munge.service
           └─951 /usr/sbin/munged

Feb 02 11:27:35 spcd-usw2-02191.aws.science.roche.com systemd[1]: Starting MUNGE authentication service...
Feb 02 11:27:35 spcd-usw2-02191.aws.science.roche.com systemd[1]: Started MUNGE authentication service.
---------------------------------


Thanks 
Praveen

Comment 12 Ben Roberts 2022-02-03 09:15:15 MST

Hi Praveen,

The thing that stands out to me is the SlurmdTimeout.  I see that you have it set to 60 seconds right now, less than the default 300 seconds.  If slurmd doesn't respond in the configured timeout then it will mark the node as down, which would also affect jobs trying to start on that node.  You can read more about this parameter in the slurm.conf documentation and see the reference to it in the ResumeProgram section in the Elastic Computing documentation:
https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdTimeout
https://slurm.schedmd.com/elastic_computing.html#config

If increasing that parameter doesn't resolve the issue then I would like to ask if you could move this to a new ticket.  This ticket has gone from an upgrade question to a (related) wckey question and now moving towards a cloud node/job starting question.  Keeping different issues in separate tickets helps our reporting and makes it easier for you to go back and find answers to questions you've previously asked without having to go through a bunch of tickets with unrelated titles.

Thanks,
Ben

Comment 13 Praveen SV 2022-02-03 13:25:09 MST

Hi Ben,

I have increased the time out to 300. But still same error

slurm.conf
SlurmdTimeout=300


squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               620 C-16Cpu-3  test.sh  vijayap PD       0:00      1 (launch failed requeued held)


log


[2022-02-03T20:01:07.491] _slurm_rpc_submit_batch_job: JobId=620 InitPrio=4294901759 usec=634
[2022-02-03T20:01:07.586] sched/backfill: _start_job: Started JobId=620 in C-16Cpu-30GB on spcd-usw2-02306
[2022-02-03T20:01:30.512] requeue job JobId=620 due to failure of node spcd-usw2-02306
[2022-02-03T20:01:30.513] Requeuing JobId=620
[2022-02-03T20:01:48.401] _slurm_rpc_requeue: Requeue of JobId=620 returned an error: Job is pending execution
[2022-02-03T20:01:48.436] sched: _update_job: setting features to us-west-2b for JobId=620
[2022-02-03T20:01:48.436] _slurm_rpc_update_job: complete JobId=620 uid=0 usec=825
[2022-02-03T20:03:37.587] sched/backfill: _start_job: Started JobId=620 in C-16Cpu-30GB on spcd-usw2-02586
[2022-02-03T20:03:59.937] requeue job JobId=620 due to failure of node spcd-usw2-02586
[2022-02-03T20:03:59.938] Requeuing JobId=620
[2022-02-03T20:04:18.088] _slurm_rpc_requeue: Requeue of JobId=620 returned an error: Job is pending execution
[2022-02-03T20:04:18.103] JobId=620 has invalid feature list: @DEFAULT_AZ@
[2022-02-03T20:04:18.103] sched: _update_job: invalid features(@DEFAULT_AZ@) for JobId=620
[2022-02-03T20:04:18.103] _slurm_rpc_update_job: JobId=620 uid=0: Invalid feature specification
[2022-02-03T20:06:07.588] sched/backfill: _start_job: Started JobId=620 in C-16Cpu-30GB on spcd-usw2-02587
[2022-02-03T20:06:29.961] requeue job JobId=620 due to failure of node spcd-usw2-02587
[2022-02-03T20:06:29.971] Requeuing JobId=620
[2022-02-03T20:06:48.136] _slurm_rpc_requeue: Requeue of JobId=620 returned an error: Job is pending execution
[2022-02-03T20:06:48.152] sched: _update_job: setting features to us-west-2a for JobId=620
[2022-02-03T20:06:48.152] _slurm_rpc_update_job: complete JobId=620 uid=0 usec=810
[2022-02-03T20:08:37.589] sched/backfill: _start_job: Started JobId=620 in C-16Cpu-30GB on spcd-usw2-02307
[2022-02-03T20:15:48.858] job_time_limit: Configuration for JobId=620 complete
[2022-02-03T20:15:48.858] Resetting JobId=620 start time for node power up
[2022-02-03T20:15:48.871] error: Prolog launch failure, JobId=620
[2022-02-03T20:15:49.056] error: job_epilog_complete: JobId=620 epilog error on spcd-usw2-02307, draining the node
[2022-02-03T20:15:49.056] error: _slurm_rpc_epilog_complete: epilog error JobId=620 Node=spcd-usw2-02307 Err=Job epilog failed


Best Regards,
Praveen

Comment 14 Praveen SV 2022-02-03 13:26:23 MST

do we need to run scontrol reconfigure after updating these timeout in conf file.


Thanks
Praveen

Comment 15 Ben Roberts 2022-02-03 13:38:55 MST

Yes, you would need to do a reconfigure to make it pick up that change.  You can see whether a value has been recognized with the scontrol command.  Here's an example of what it should look like.

$ scontrol show config | grep SlurmdTimeout
SlurmdTimeout           = 60 sec

$ scontrol reconfigure 

$ scontrol show config | grep SlurmdTimeout
SlurmdTimeout           = 300 sec

Thanks,
Ben

Comment 16 Praveen SV 2022-02-04 00:42:16 MST

Hi Ben,


root@spcd-usw2-02191:~# scontrol show config | grep SlurmdTimeout
SlurmdTimeout           = 300 sec
root@spcd-usw2-02191:~# squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               620 C-16Cpu-3  test.sh  vijayap PD       0:00      1 (launch failed requeued held)


Log

[2022-02-03T20:01:07.491] _slurm_rpc_submit_batch_job: JobId=620 InitPrio=4294901759 usec=634
[2022-02-03T20:01:07.586] sched/backfill: _start_job: Started JobId=620 in C-16Cpu-30GB on spcd-usw2-02306
[2022-02-03T20:01:30.512] requeue job JobId=620 due to failure of node spcd-usw2-02306
[2022-02-03T20:01:30.513] Requeuing JobId=620
[2022-02-03T20:01:48.401] _slurm_rpc_requeue: Requeue of JobId=620 returned an error: Job is pending execution
[2022-02-03T20:01:48.436] sched: _update_job: setting features to us-west-2b for JobId=620
[2022-02-03T20:01:48.436] _slurm_rpc_update_job: complete JobId=620 uid=0 usec=825
[2022-02-03T20:03:37.587] sched/backfill: _start_job: Started JobId=620 in C-16Cpu-30GB on spcd-usw2-02586
[2022-02-03T20:03:59.937] requeue job JobId=620 due to failure of node spcd-usw2-02586
[2022-02-03T20:03:59.938] Requeuing JobId=620
[2022-02-03T20:04:18.088] _slurm_rpc_requeue: Requeue of JobId=620 returned an error: Job is pending execution
[2022-02-03T20:04:18.103] JobId=620 has invalid feature list: @DEFAULT_AZ@
[2022-02-03T20:04:18.103] sched: _update_job: invalid features(@DEFAULT_AZ@) for JobId=620
[2022-02-03T20:04:18.103] _slurm_rpc_update_job: JobId=620 uid=0: Invalid feature specification
[2022-02-03T20:06:07.588] sched/backfill: _start_job: Started JobId=620 in C-16Cpu-30GB on spcd-usw2-02587
[2022-02-03T20:06:29.961] requeue job JobId=620 due to failure of node spcd-usw2-02587
[2022-02-03T20:06:29.971] Requeuing JobId=620
[2022-02-03T20:06:48.136] _slurm_rpc_requeue: Requeue of JobId=620 returned an error: Job is pending execution
[2022-02-03T20:06:48.152] sched: _update_job: setting features to us-west-2a for JobId=620
[2022-02-03T20:06:48.152] _slurm_rpc_update_job: complete JobId=620 uid=0 usec=810
[2022-02-03T20:08:37.589] sched/backfill: _start_job: Started JobId=620 in C-16Cpu-30GB on spcd-usw2-02307
[2022-02-03T20:15:48.858] job_time_limit: Configuration for JobId=620 complete
[2022-02-03T20:15:48.858] Resetting JobId=620 start time for node power up
[2022-02-03T20:15:48.871] error: Prolog launch failure, JobId=620
[2022-02-03T20:15:49.056] error: job_epilog_complete: JobId=620 epilog error on spcd-usw2-02307, draining the node
[2022-02-03T20:15:49.056] error: _slurm_rpc_epilog_complete: epilog error JobId=620 Node=spcd-usw2-02307 Err=Job epilog failed
[2022-02-03T22:27:30.620] error: _shutdown_bu_thread:send/recv spcd-usw2-02133: Connection refused
[2022-02-04T06:41:10.620] error: _shutdown_bu_thread:send/recv spcd-usw2-02133: Connection refused
root@spcd-usw2-02191:~#


Best Regards,
Praveen

Comment 17 Ben Roberts 2022-02-04 08:30:23 MST

One more thing I see is that the logs show it's taking more than 300 seconds to spin up the node.  The log entries have a gap of a little more than 7 minutes:

[2022-02-03T20:08:37.589] sched/backfill: _start_job: Started JobId=620 in C-16Cpu-30GB on spcd-usw2-02307
[2022-02-03T20:15:48.858] job_time_limit: Configuration for JobId=620 complete
[2022-02-03T20:15:48.858] Resetting JobId=620 start time for node power up
[2022-02-03T20:15:48.871] error: Prolog launch failure, JobId=620

Can you try setting the SlurmdTimeout to 600 seconds and see if that extra time allows the jobs to start?  

Thanks,
Ben

Comment 18 Praveen SV 2022-02-07 06:51:05 MST

Hi Ben,

I increased to 1500 secs. Still no luck

vijayap@spcd-usw2-02191:~$ scontrol show config | grep SlurmdTimeout
SlurmdTimeout           = 1500 sec


vijayap@spcd-usw2-02191:~$ sbatch -p C-16Cpu-30GB test.sh
Submitted batch job 622
vijayap@spcd-usw2-02191:~$ scontrol show jobid=622
JobId=622 JobName=test.sh
   UserId=vijayap(93343) GroupId=dialout(20) MCS_label=N/A
   Priority=4294901757 Nice=0 Account=platinum QOS=gold WCKey=c5.4xlarge
   JobState=CONFIGURING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:08:08 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2022-02-07T10:24:08 EligibleTime=2022-02-07T10:24:08
   AccrueTime=2022-02-07T10:24:08
   StartTime=2022-02-07T10:24:09 EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-02-07T10:24:09 
   Scheduler=Main
   Partition=C-16Cpu-30GB AllocNode:Sid=spcd-usw2-02191:28294
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=spcd-usw2-02306
   BatchHost=spcd-usw2-02306
   NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=2,mem=3864M,node=1,billing=2
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=1932M MinTmpDiskNode=0
   Features=us-west-2a DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/vijayap/test.sh
   WorkDir=/home/vijayap
   StdErr=/home/vijayap/slurm-622.out
   StdIn=/dev/null
   StdOut=/home/vijayap/slurm-622.out
   Power=


slurmctld.log

[2022-02-07T10:24:08.282] _slurm_rpc_submit_batch_job: JobId=622 InitPrio=4294901757 usec=14416
[2022-02-07T10:24:09.114] sched: Allocate JobId=622 NodeList=spcd-usw2-02306 #CPUs=2 Partition=C-16Cpu-30GB
[2022-02-07T10:34:10.689] requeue job JobId=622 due to failure of node spcd-usw2-02306
[2022-02-07T10:34:10.707] Requeuing JobId=622
[2022-02-07T10:36:24.388] sched/backfill: _start_job: Started JobId=622 in C-16Cpu-30GB on spcd-usw2-02306
[2022-02-07T10:46:37.469] job_time_limit: Configuration for JobId=622 complete
[2022-02-07T10:46:37.469] Resetting JobId=622 start time for node power up
[2022-02-07T10:46:37.484] error: Prolog launch failure, JobId=622
[2022-02-07T10:46:37.713] error: job_epilog_complete: JobId=622 epilog error on spcd-usw2-02306, draining the node
[2022-02-07T10:46:37.713] error: _slurm_rpc_epilog_complete: epilog error JobId=622 Node=spcd-usw2-02306 Err=Job epilog failed


Best Regards,
Praveen

Comment 19 Ben Roberts 2022-02-07 10:37:25 MST

Hi Praveen,

Thanks for trying the increased timeout value.  Since the issue isn't as simple as a timeout I do think it's time to move it to a separate ticket, as I mentioned in comment 12.  You can reference this ticket in the new one.  One additional thing that would be worth including that we haven't gotten to in this ticket yet is the log file used by your suspend and resume programs.  

When I see that you've logged a new ticket for this I'll go ahead and close this one.

Thanks,
Ben

Comment 20 Praveen SV 2022-02-07 11:53:06 MST

Created a new ticket 13350

Comment 21 Ben Roberts 2022-02-07 13:11:31 MST

Thank you, it looks like that ticket has had a response sent and been assigned to an engineer.  I'll go ahead and close this ticket.  

Thanks,
Ben

Comment 22 Jason Booth 2022-02-09 09:26:35 MST

We wanted to let you know that we will be resolving bug#13350 and re-opening this issue, bug#13260. We want to keep the two issues separate for our sanity as well as our ticket bookkeeping.

bug#13350comment#21

Comment 23 Praveen SV 2022-02-09 09:48:14 MST

Hi Ben,

how scontrol reconfigure change our way of enforcing WCKey?

it was working fine even after the version upgrade.

but after running scontrol reconfigure we are running in to these issues.

Also we are not changing any wckey settings from previous version to current version.
in previous version even if we run scontrol reconfigure it still allows the users to submit the required partition.

Best Regards,
Praveen

Comment 24 Ben Roberts 2022-02-09 09:57:51 MST

Hi Praveen,

It looks like you are running into another case where the user didn't have access to a wckey that they were requesting in a job script (r5.24xlarge).  Once you added the wckey for that user they were able to submit this job.

It sounds like you may want the previous behavior where you didn't have to specifically add each user to a wckey for them to use it.  If that is the case you can remove 'wckeys' from the AccountingStorageEnforce parameter in your slurm.conf.  Be aware that by removing this that the enforcement of wckeys won't happen properly anymore.  

If you do still want to enforce wckeys then it sounds like there will be a period of time where you find the cases where users are missing the wckeys and you have to add them (or you can proactively add wckeys for the appropriate users), but once this time is up and the wckeys are added properly then you shouldn't have to continue to add keys all the time.  The exception being when new users are added to the system.

I was typing up that response initially, but I see your recent update now too.  It sounds like you're saying that this user was able to submit a job requesting the r5.24xlarge wckey after the upgrade until you did a 'scontrol reconfigure'.  Is that right?  Can I have you send sacct information about that job if that's the case?  If you have the job id you can get that information like this, replacing <job_id> with the correct job id:
sacct -j <job_id> --format=jobid,submit,end,wckey

Thanks,
Ben

Comment 25 Praveen SV 2022-02-09 13:11:37 MST

Hi Ben,

In the earlier version 20.11.8 we had the wckey enforcement like this in slrum.conf

AccountingStorageEnforce=limits,qos,associations,wckey

Since it should be wckeys and not wckey in the older version slurm conf itself. this error was ignored and so the user had no wckey restriction.


Now after the version upgrade we had the same enforcement in slrum.conf
AccountingStorageEnforce=limits,qos,associations,wckey

But after slurmctld service restart it throwed error like below
slurmctld[20919]: error: AccountingStorageEnforce invalid: limits,qos,associations,wckey
slurmctld[20919]: fatal: Unable to process configuration file

so in the latest version this error is not ignored 


after updating the settings to AccountingStorageEnforce=limits,qos,associations,wckey[s]

the service was up and now only the enforecement is really working.

So my understanding is in the older version the wckey was ignored and users had privileges to access all partitions

Now in the new version it is recognizing the wckey[s] and now only it is working properly. Where as we need to map the wckey to each users based on their request.

Is my understanding correct ?


Regards,
praveen

Comment 26 Ben Roberts 2022-02-09 13:47:25 MST

Yes, your understanding is correct with one minor correction to your statement.  You probably mean the right thing and just typed 'partitions' instead of 'wckeys'.  You say:
  So my understanding is in the older version the wckey was
  ignored and users had privileges to access all partitions.

It isn't that they had access to all partitions, but the issue was that they could request any WCKey without having been explicitly authorized to use that WCKey.  So, the requested WCKeys would still have been added to the jobs, but users could just request any available WCKey.  Now that it is being enforced, Slurm is checking that they have access to the WCKey they are requesting at submit time.  If they don't then they get an error.  Once they are given access to this WCKey then they can submit using that key without an error.

Thanks,
Ben

Comment 27 Ben Roberts 2022-02-16 11:14:04 MST

Hi Praveen,

We've gone over what happened with WCKeys in your environment with the upgrade and it sounds from your last comment like you understand.  I haven't heard any follow up questions so I'll go ahead and close this ticket.

Thanks,
Ben