Ticket 17670

Summary:	error: job_manager: exiting abnormally: Slurmd could not execve job
Product:	Slurm	Reporter:	Ahmed Fathy <ahmed.moustafa>
Component:	slurmd	Assignee:	Ben Roberts <ben>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	ramy.ghattas
Version:	23.02.5
Hardware:	Linux
OS:	Linux
Site:	KAUST	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	conf files slurmctld log slurmd log

Description Ahmed Fathy 2023-09-12 10:35:13 MDT

Dears,

We recently upgraded to Slurm 23.02.5, and since then, we've been encountering an issue with most of the jobs submitted to CPU nodes. The nodes are getting drained with the reason "batch job complete failure," and the jobs are getting requeued with the error "launch failed, requeued, held."


Slurmctld Logs:

[2023-09-12T17:31:57.303] error: slurmd error running JobId=27283748 on node(s)=cn605-24-r: Slurmd could not execve job
[2023-09-12T17:31:57.303] drain_nodes: node cn605-24-r state set to DRAIN

Slurmd Log:

[2023-09-12T17:41:04.204] [27283868.batch] private-tmpdir: removed /local/tmp/27283868.batch.0 (6 files) in 0.000321 seconds
[2023-09-12T17:41:04.205] [27283868.batch] error: job_manager: exiting abnormally: Slurmd could not execve job


Please inform me if you require any additional logs or information to assist in diagnosing and resolving this issue.

Thanks,
Ahmed

Comment 1 Ben Roberts 2023-09-12 10:54:14 MDT

Hi Ahmed,

Can you send a current copy of your slurm.conf and any other *.conf files you have?  It would also be good to have a copy of the full slurmctld.log and slurmd.log files that show the error you're talking about.  Is this happening when jobs are starting or ending?  You mention that this is happening to jobs submitted to CPU nodes.  Does this mean jobs going to a particular partition, or is there something else that makes these jobs unique?  Is it every job of this type?

Comment 2 Ben Roberts 2023-09-12 12:16:04 MDT

I'm just following up to see if you have had a chance to collect the information requested.  For tickets marked as Severity 1 we commit to actively work on them until the issue is no longer blocking workflow.  Is this something that is preventing you from using the system or is it an issue that affects a more specific portion of the cluster?  This may be more appropriate as a severity 2 or 3 issue.  Please see our descriptions of severity levels here:
https://www.schedmd.com/support.php#severity

Thanks,
Ben

Comment 3 Ben Roberts 2023-09-12 13:29:29 MDT

I haven't heard a response yet, so I'm going to lower the severity of the ticket to 3.  I understand that it's late where you are and you may have ended your work day.  When you are ready to work on this again please send the information requested in comment 1.

Thanks,
Ben

Comment 4 Ramy Adly 2023-09-12 16:48:15 MDT

Created attachment 32220 [details]
conf files

Comment 5 Greg Wickham 2023-09-12 16:48:28 MDT

I’m on leave until Tuesday 26th of September.


For urgent requests:


   - send a request to the Ibex slack channel #general

      (sign up at https://kaust-ibex.slack.com/)


   - open a ticket by sending an email to ibex@kaust.edu.sa


 -Greg


--

Comment 6 Ramy Adly 2023-09-12 16:48:42 MDT

Created attachment 32221 [details]
slurmctld log

Comment 7 Ramy Adly 2023-09-12 16:48:58 MDT

Created attachment 32222 [details]
slurmd log

Comment 8 Ramy Adly 2023-09-12 16:49:50 MDT

Dear Ben,

Thank you for your reply.

I have attached the requested logs/files
Let me try to answer some of the question till Ahmed is back.

Is this happening when jobs are starting or ending?  
>>> while the jobs are starting

You mention that this is happening to jobs submitted to CPU nodes.  Does this mean jobs going to a particular partition, or is there something else that makes these jobs unique? 
>>> The CPU partition is the batch partition

Is it every job of this type?
>>> No, not for every job, what is clear so far is that it happens when we allocate all CPUs of the node.


Regards,
Ramy

Comment 9 Ahmed Fathy 2023-09-12 23:07:35 MDT

Dear Ben,

Sorry for the delay in replying, and thank you for the prompt response. We have identified the cause of these issues. What happened is that we removed CpuSpecList from the CPU configuration for the CPU nodes, but the cgroups configuration was still limiting Slurm from using all the CPU cores.

root@cn509-17-r: /sys/fs/cgroup/cpuset # cat /sys/fs/cgroup/cpuset/slurm/cpuset.cpus
0,5-39

root@cn509-17-r: ~ # scontrol show node cn509-17-r
NodeName=cn509-17-r Arch=x86_64 CoresPerSocket=20 
   CPUAlloc=0 CPUEfctv=40 CPUTot=40 CPULoad=6.12
   AvailableFeatures=cascadelake,cpu_intel_gold_6248,el7,ibex2019,intel,local_200G,local_400G,local_500G,local_950G,nogpu,nolmem
   ActiveFeatures=cascadelake,cpu_intel_gold_6248,el7,ibex2019,intel,local_200G,local_400G,local_500G,local_950G,nogpu,nolmem
   Gres=(null)
   NodeAddr=cn509-17-r NodeHostName=cn509-17-r Version=23.02.5
   OS=Linux 3.10.0-1160.76.1.el7.x86_64 #1 SMP Wed Aug 10 16:21:17 UTC 2022 
   RealMemory=382976 AllocMem=0 FreeMem=353327 Sockets=2 Boards=1
   State=IDLE+DRAIN+MAINTENANCE+RESERVED ThreadsPerCore=1 TmpDisk=0 Weight=5860 Owner=N/A MCS_label=N/A
   Partitions=ALL,batch 
   BootTime=2023-09-12T11:32:53 SlurmdStartTime=2023-09-13T00:25:16
   LastBusyTime=2023-09-13T01:50:02 ResumeAfterTime=None
   CfgTRES=cpu=40,mem=374G,billing=40
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=batch job complete failure [root@2023-09-13T00:04:31]
   ReservationName=MAINT202309


Verbose slurmd logs when submitting a job that will utilize all the cores of a node:

slurmd: debug3: task/affinity: _get_avail_map: slurmctld s 2 c 20; hw s 2 c 20 t 1
slurmd: debug3: task/affinity: _get_avail_map: StepId=27285811.batch core mask from slurmctld: 0xFFFFFFFFFF
slurmd: debug3: task/affinity: _get_avail_map: StepId=27285811.batch CPU final mask for local node: 0xFFFFFFFFFF
slurmd: task/affinity: batch_bind: job 27285811 CPU input mask for node: 0xFFFFFFFFFF
slurmd: debug3: task/affinity: _lllp_map_abstract_masks: _lllp_map_abstract_masks
slurmd: task/affinity: batch_bind: job 27285811 CPU final HW mask for node: 0xFFFFFFFFFF
slurmd: debug:  Waiting for job 27285811's prolog to complete
slurmd: debug2: prep/script: _run_subpath_command: prolog success rc:0 output:
slurmd: debug3: _spawn_prolog_stepd: call to _forkexec_slurmstepd
slurmd: debug3: slurmstepd rank 0 (cn512-23-l), parent rank -1 (NONE), children 0, depth 0, max_depth 0
slurmd: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: No such process


This issue can be resolved by rebooting the node
 
After rebooting:

root@cn509-17-r: /sys/fs/cgroup/cpuset # cat /sys/fs/cgroup/cpuset/slurm/cpuset.cpus
0-39


Is this the normal behavior, do we have to reboot a node after removing CpuSpecList?

Thanks,
Ahmed

Comment 10 Ben Roberts 2023-09-13 08:27:48 MDT

Hi Ahmed,

I'm glad you were able to find the source of this error.  The scenario you describe does make sense that the controller thinks all the cores are available to be scheduled but cgroups still think that it should be reserving some of the cores for system tasks.  There are ways that you could manually cleanup the cgroups, but they are error prone, so rebooting the nodes is the way I would recommend.  Let me know how things look after rebooting the nodes.

Thanks,
Ben

Comment 12 Ben Roberts 2023-09-29 11:10:30 MDT

Hi Ahmed,

I haven't heard any follow up questions about this, so I assume that things are going well after rebooting the nodes.  I'll close this ticket, but let us know if there's anything else we can do to help.

Thanks,
Ben