Ticket 10275 - Job runs but never complete after 20.11 upgrade
Summary: Job runs but never complete after 20.11 upgrade
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 20.11.0
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
: 10259 10344 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2020-11-23 17:01 MST by lhuang
Modified: 2020-12-17 04:43 MST (History)
4 users (show)

See Also:
Site: NY Genome
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 20.11.1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (5.51 KB, text/plain)
2020-11-23 17:01 MST, lhuang
Details
slurmd log from hpc node (291.88 KB, application/x-gzip)
2020-11-25 13:12 MST, Chris Black
Details
slurmdbd log (14.80 KB, application/x-gzip)
2020-11-25 13:13 MST, Chris Black
Details
slurmctld log (237 bytes, application/x-gzip)
2020-11-25 13:13 MST, Chris Black
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description lhuang 2020-11-23 17:01:13 MST
Created attachment 16794 [details]
slurm.conf

After upgrading to 20.11 on our slurmctld. Submitted jobs have completed but it is not registering on slurmctld.

This is the error shown on slurmctld
[2020-11-23T18:51:07.992] error: step_partial_comp: batch step received for JobId=8786771. This should never happen.
[2020-11-23T18:52:24.017] error: step_partial_comp: batch step received for JobId=8786772. This should never happen.


This is on the slurmd clients.

[root@pe2cc3-003 ~]# grep 8786772 /var/log/slurmd.log
[2020-11-23T18:52:22.780] _run_prolog: prolog with lock for job 8786772 ran for 0 seconds
[2020-11-23T18:52:22.829] [8786772.extern] task/cgroup: /slurm/uid_20116/job_8786772: alloc=125MB mem.limit=125MB memsw.limit=unlimited
[2020-11-23T18:52:22.836] [8786772.extern] task/cgroup: /slurm/uid_20116/job_8786772/step_extern: alloc=125MB mem.limit=125MB memsw.limit=unlimited
[2020-11-23T18:52:23.875] task_p_slurmd_batch_request: 8786772
[2020-11-23T18:52:23.875] task/affinity: job 8786772 CPU input mask for node: 0x00000000003000
[2020-11-23T18:52:23.875] task/affinity: job 8786772 CPU final HW mask for node: 0x00000400000040
[2020-11-23T18:52:23.876] Launching batch job 8786772 for UID 20116
[2020-11-23T18:52:23.901] [8786772.4294967291] task/cgroup: /slurm/uid_20116/job_8786772: alloc=125MB mem.limit=125MB memsw.limit=unlimited
[2020-11-23T18:52:23.910] [8786772.4294967291] task/cgroup: /slurm/uid_20116/job_8786772/step_4294967291: alloc=125MB mem.limit=125MB memsw.limit=unlimited
[2020-11-23T18:52:23.983] [8786772.4294967291] task_p_pre_launch: Using sched_affinity for tasks
[2020-11-23T18:52:24.019] [8786772.4294967291] done with job
Comment 1 lhuang 2020-11-23 17:02:15 MST
Just to clarify. We only upgraded slurmctld and slurmdbd to 20.11. The compute nodes are still on 19.05.3-2
Comment 5 Jason Booth 2020-11-24 08:48:28 MST
Just giving you and update. We are looking into this and will let you know what we find.
Comment 8 Marshall Garey 2020-11-24 09:20:39 MST
*** Ticket 10259 has been marked as a duplicate of this ticket. ***
Comment 9 Dominik Bartkiewicz 2020-11-24 09:41:56 MST
Hi

This patch should resolve the issue:
https://github.com/SchedMD/slurm/commit/aaa219f75

It will be included in 20.11.1.
Could you apply this patch and check whether it helps?

Dominik
Comment 10 lhuang 2020-11-24 10:33:41 MST
Can confirm the patch resolved the issue.
Comment 11 lhuang 2020-11-24 10:58:52 MST
Just noticed it didn't fix all of the jobs. Some jobs are still not reporting they have completed.


Here is an example job

[lhuang@pe2-login01 slurm]$ scontrol show job 8787106_9
JobId=8787113 ArrayJobId=8787106 ArrayTaskId=9 JobName=shift_positions_back.sh
   UserId=mbyrska-bishop(20158) GroupId=compbio(9043) MCS_label=N/A
   Priority=408276 Nice=0 Account=compbio QOS=compbio
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=02:21:09 TimeLimit=7-00:00:00 TimeMin=N/A
   SubmitTime=2020-11-24T10:33:58 EligibleTime=2020-11-24T10:33:58
   AccrueTime=Unknown
   StartTime=2020-11-24T10:33:58 EndTime=2020-12-01T10:33:58 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-11-24T10:33:58
   Partition=pe2 AllocNode:Sid=mbyrska-vm.nygenome.org:116650
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=pe2cc3-014
   BatchHost=pe2cc3-014
   NumNodes=1 NumCPUs=6 NumTasks=1 CPUs/Task=6 ReqB:S:C:T=0:0:*:*
   TRES=cpu=6,mem=20G,node=1,billing=6
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=6 MinMemoryNode=20G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/gpfs/commons/groups/compbio/projects/1KGP_3202_SNV_INDEL_phasing/shapeit_no_duohmm/shift_positions_back.sh shift_positions_back.manifest
   WorkDir=/gpfs/commons/groups/compbio/projects/1KGP_3202_SNV_INDEL_phasing/shapeit_no_duohmm
   StdErr=/gpfs/commons/groups/compbio/projects/1KGP_3202_SNV_INDEL_phasing/shapeit_no_duohmm/logs/shift_positions_back.sh.8787106.9
   StdIn=/dev/null
   StdOut=/gpfs/commons/groups/compbio/projects/1KGP_3202_SNV_INDEL_phasing/shapeit_no_duohmm/logs/shift_positions_back.sh.8787106.9
   Power=



[root@pe2cc3-014 ~]# zgrep 8787113 /var/log/slurmd.log
[2020-11-24T10:33:59.797] _run_prolog: prolog with lock for job 8787113 ran for 0 seconds
[2020-11-24T10:33:59.835] [8787113.extern] task/cgroup: /slurm/uid_20158/job_8787113: alloc=20480MB mem.limit=20480MB memsw.limit=unlimited
[2020-11-24T10:33:59.844] [8787113.extern] task/cgroup: /slurm/uid_20158/job_8787113/step_extern: alloc=20480MB mem.limit=20480MB memsw.limit=unlimited
[2020-11-24T10:33:59.942] task_p_slurmd_batch_request: 8787113
[2020-11-24T10:33:59.942] task/affinity: job 8787113 CPU input mask for node: 0x0000000000FC00
[2020-11-24T10:33:59.942] task/affinity: job 8787113 CPU final HW mask for node: 0x00000E000000E0
[2020-11-24T10:33:59.942] Launching batch job 8787113 for UID 20158
[2020-11-24T10:33:59.967] [8787113.4294967291] task/cgroup: /slurm/uid_20158/job_8787113: alloc=20480MB mem.limit=20480MB memsw.limit=unlimited
[2020-11-24T10:33:59.972] [8787113.4294967291] task/cgroup: /slurm/uid_20158/job_8787113/step_4294967291: alloc=20480MB mem.limit=20480MB memsw.limit=unlimited
[2020-11-24T10:34:00.038] [8787113.4294967291] task_p_pre_launch: Using sched_affinity for tasks
[2020-11-24T12:53:49.949] [8787113.4294967291] done with job


From the slurmctld logs
[2020-11-24T12:53:49.941] error: step_partial_comp: batch step received for JobId=8787106_9(8787113). This should never happen.
[2020-11-24T12:57:22.979] error: step_partial_comp: batch step received for JobId=8787106_10(8787114). This should never happen.
Comment 12 Dominik Bartkiewicz 2020-11-25 03:02:05 MST
Hi

Could you send me full slurmd.log?

Dominik
Comment 13 Chris Black 2020-11-25 13:12:36 MST
Created attachment 16836 [details]
slurmd log from hpc node
Comment 14 Chris Black 2020-11-25 13:13:18 MST
Created attachment 16837 [details]
slurmdbd log
Comment 15 Chris Black 2020-11-25 13:13:42 MST
Created attachment 16838 [details]
slurmctld log
Comment 16 Kevin Buckley 2020-12-03 00:24:43 MST
Just noted that these lines appear in the slurmd log:

[2020-12-03T10:22:52.410] debug3: Trying to load plugin /opt/slurm/20.11.0/lib64/slurm/gres_craynetwork.so
[2020-12-03T10:22:52.410] debug4: /opt/slurm/20.11.0/lib64/slurm/gres_craynetwork.so: Does not exist or not a regular file.
[2020-12-03T10:22:52.410] debug:  gres: Couldn't find the specified plugin name for gres/craynetwork looking at all files
[2020-12-03T10:22:52.411] debug:  Cannot find plugin of type gres/craynetwork, just track gres counts
[2020-12-03T10:22:52.411] debug:  Plugin of type gres/craynetwork only tracks gres counts

Did something go away?
Comment 17 lhuang 2020-12-03 08:15:00 MST
We don’t use cray here.

This isn’t in the logs that we sent.

From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Wednesday, December 2, 2020 at 11:24 PM
To: Luis Huang <lhuang@NYGENOME.ORG>
Subject: [Bug 10275] Job runs but never complete after 20.11 upgrade

Kevin Buckley<mailto:kevin.buckley@pawsey.org.au> changed bug 10275<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=10275__;!!C6sPl7C9qQ!HfZKFkG7AarRJIp8j3ryUrqUkRgCxQJ8fnUoBkkC1pN7xCr-IVV6zkNtJMe7qys$>
What

Removed

Added

CC



kevin.buckley@pawsey.org.au

Comment # 16<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=10275*c16__;Iw!!C6sPl7C9qQ!HfZKFkG7AarRJIp8j3ryUrqUkRgCxQJ8fnUoBkkC1pN7xCr-IVV6zkNtXQkmbfI$> on bug 10275<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=10275__;!!C6sPl7C9qQ!HfZKFkG7AarRJIp8j3ryUrqUkRgCxQJ8fnUoBkkC1pN7xCr-IVV6zkNtJMe7qys$> from Kevin Buckley<mailto:kevin.buckley@pawsey.org.au>

Just noted that these lines appear in the slurmd log:



[2020-12-03T10:22:52.410] debug3: Trying to load plugin

/opt/slurm/20.11.0/lib64/slurm/gres_craynetwork.so

[2020-12-03T10:22:52.410] debug4:

/opt/slurm/20.11.0/lib64/slurm/gres_craynetwork.so: Does not exist or not a

regular file.

[2020-12-03T10:22:52.410] debug:  gres: Couldn't find the specified plugin name

for gres/craynetwork looking at all files

[2020-12-03T10:22:52.411] debug:  Cannot find plugin of type gres/craynetwork,

just track gres counts

[2020-12-03T10:22:52.411] debug:  Plugin of type gres/craynetwork only tracks

gres counts



Did something go away?

________________________________
You are receiving this mail because:

  *   You reported the bug.

________________________________
This message is for the recipient’s use only, and may contain confidential, privileged or protected information. Any unauthorized use or dissemination of this communication is prohibited. If you received this message in error, please immediately notify the sender and destroy all copies of this message. The recipient should check this email and any attachments for the presence of viruses, as we accept no liability for any damage caused by any virus transmitted by this email.
Comment 18 Jason Booth 2020-12-03 10:21:16 MST
*** Ticket 10344 has been marked as a duplicate of this ticket. ***
Comment 19 Dominik Bartkiewicz 2020-12-07 07:25:23 MST
Hi

Do you still use slurmd in 19.05 version?
Still can't find any code path to get such behavior in the configuration:
slurmctld -- 20.11 + patch
slurmd -- 19.05
slurmstepd -- 19.05

Could you check on nodes if slurmd and stepd versions match?

Dominik
Comment 20 lhuang 2020-12-07 07:25:30 MST
I'm out of office and will have limited access to the internet. Please email linuxhelp@nygenome.org for any urgent issues.

________________________________
This message is for the recipient's use only, and may contain confidential, privileged or protected information. Any unauthorized use or dissemination of this communication is prohibited. If you received this message in error, please immediately notify the sender and destroy all copies of this message. The recipient should check this email and any attachments for the presence of viruses, as we accept no liability for any damage caused by any virus transmitted by this email.
Comment 21 lhuang 2020-12-07 09:41:08 MST
We no longer have slurmd that is running on an older version. Everything is up to date now and we no longer see this issue.
Comment 27 Dominik Bartkiewicz 2020-12-17 04:43:23 MST
Hi

We still have some transition problems in 20.11, and we work on them in internal bug 10467.
If this is no longer problem on your site, I'll go ahead and close this. Feel free to reopen if needed.

Dominik