| Summary: | error: job_manager: exiting abnormally: Slurmd could not execve job | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Ahmed Fathy <ahmed.moustafa> |
| Component: | slurmd | Assignee: | Ben Roberts <ben> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | ramy.ghattas |
| Version: | 23.02.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | KAUST | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
conf files
slurmctld log slurmd log |
||
|
Description
Ahmed Fathy
2023-09-12 10:35:13 MDT
Hi Ahmed, Can you send a current copy of your slurm.conf and any other *.conf files you have? It would also be good to have a copy of the full slurmctld.log and slurmd.log files that show the error you're talking about. Is this happening when jobs are starting or ending? You mention that this is happening to jobs submitted to CPU nodes. Does this mean jobs going to a particular partition, or is there something else that makes these jobs unique? Is it every job of this type? I'm just following up to see if you have had a chance to collect the information requested. For tickets marked as Severity 1 we commit to actively work on them until the issue is no longer blocking workflow. Is this something that is preventing you from using the system or is it an issue that affects a more specific portion of the cluster? This may be more appropriate as a severity 2 or 3 issue. Please see our descriptions of severity levels here: https://www.schedmd.com/support.php#severity Thanks, Ben I haven't heard a response yet, so I'm going to lower the severity of the ticket to 3. I understand that it's late where you are and you may have ended your work day. When you are ready to work on this again please send the information requested in comment 1. Thanks, Ben Created attachment 32220 [details]
conf files
I’m on leave until Tuesday 26th of September.
For urgent requests:
- send a request to the Ibex slack channel #general
(sign up at https://kaust-ibex.slack.com/)
- open a ticket by sending an email to ibex@kaust.edu.sa
-Greg
--
Created attachment 32221 [details]
slurmctld log
Created attachment 32222 [details]
slurmd log
Dear Ben, Thank you for your reply. I have attached the requested logs/files Let me try to answer some of the question till Ahmed is back. Is this happening when jobs are starting or ending? >>> while the jobs are starting You mention that this is happening to jobs submitted to CPU nodes. Does this mean jobs going to a particular partition, or is there something else that makes these jobs unique? >>> The CPU partition is the batch partition Is it every job of this type? >>> No, not for every job, what is clear so far is that it happens when we allocate all CPUs of the node. Regards, Ramy Dear Ben, Sorry for the delay in replying, and thank you for the prompt response. We have identified the cause of these issues. What happened is that we removed CpuSpecList from the CPU configuration for the CPU nodes, but the cgroups configuration was still limiting Slurm from using all the CPU cores. root@cn509-17-r: /sys/fs/cgroup/cpuset # cat /sys/fs/cgroup/cpuset/slurm/cpuset.cpus 0,5-39 root@cn509-17-r: ~ # scontrol show node cn509-17-r NodeName=cn509-17-r Arch=x86_64 CoresPerSocket=20 CPUAlloc=0 CPUEfctv=40 CPUTot=40 CPULoad=6.12 AvailableFeatures=cascadelake,cpu_intel_gold_6248,el7,ibex2019,intel,local_200G,local_400G,local_500G,local_950G,nogpu,nolmem ActiveFeatures=cascadelake,cpu_intel_gold_6248,el7,ibex2019,intel,local_200G,local_400G,local_500G,local_950G,nogpu,nolmem Gres=(null) NodeAddr=cn509-17-r NodeHostName=cn509-17-r Version=23.02.5 OS=Linux 3.10.0-1160.76.1.el7.x86_64 #1 SMP Wed Aug 10 16:21:17 UTC 2022 RealMemory=382976 AllocMem=0 FreeMem=353327 Sockets=2 Boards=1 State=IDLE+DRAIN+MAINTENANCE+RESERVED ThreadsPerCore=1 TmpDisk=0 Weight=5860 Owner=N/A MCS_label=N/A Partitions=ALL,batch BootTime=2023-09-12T11:32:53 SlurmdStartTime=2023-09-13T00:25:16 LastBusyTime=2023-09-13T01:50:02 ResumeAfterTime=None CfgTRES=cpu=40,mem=374G,billing=40 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=batch job complete failure [root@2023-09-13T00:04:31] ReservationName=MAINT202309 Verbose slurmd logs when submitting a job that will utilize all the cores of a node: slurmd: debug3: task/affinity: _get_avail_map: slurmctld s 2 c 20; hw s 2 c 20 t 1 slurmd: debug3: task/affinity: _get_avail_map: StepId=27285811.batch core mask from slurmctld: 0xFFFFFFFFFF slurmd: debug3: task/affinity: _get_avail_map: StepId=27285811.batch CPU final mask for local node: 0xFFFFFFFFFF slurmd: task/affinity: batch_bind: job 27285811 CPU input mask for node: 0xFFFFFFFFFF slurmd: debug3: task/affinity: _lllp_map_abstract_masks: _lllp_map_abstract_masks slurmd: task/affinity: batch_bind: job 27285811 CPU final HW mask for node: 0xFFFFFFFFFF slurmd: debug: Waiting for job 27285811's prolog to complete slurmd: debug2: prep/script: _run_subpath_command: prolog success rc:0 output: slurmd: debug3: _spawn_prolog_stepd: call to _forkexec_slurmstepd slurmd: debug3: slurmstepd rank 0 (cn512-23-l), parent rank -1 (NONE), children 0, depth 0, max_depth 0 slurmd: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: No such process This issue can be resolved by rebooting the node After rebooting: root@cn509-17-r: /sys/fs/cgroup/cpuset # cat /sys/fs/cgroup/cpuset/slurm/cpuset.cpus 0-39 Is this the normal behavior, do we have to reboot a node after removing CpuSpecList? Thanks, Ahmed Hi Ahmed, I'm glad you were able to find the source of this error. The scenario you describe does make sense that the controller thinks all the cores are available to be scheduled but cgroups still think that it should be reserving some of the cores for system tasks. There are ways that you could manually cleanup the cgroups, but they are error prone, so rebooting the nodes is the way I would recommend. Let me know how things look after rebooting the nodes. Thanks, Ben Hi Ahmed, I haven't heard any follow up questions about this, so I assume that things are going well after rebooting the nodes. I'll close this ticket, but let us know if there's anything else we can do to help. Thanks, Ben |