Created attachment 16794 [details] slurm.conf After upgrading to 20.11 on our slurmctld. Submitted jobs have completed but it is not registering on slurmctld. This is the error shown on slurmctld [2020-11-23T18:51:07.992] error: step_partial_comp: batch step received for JobId=8786771. This should never happen. [2020-11-23T18:52:24.017] error: step_partial_comp: batch step received for JobId=8786772. This should never happen. This is on the slurmd clients. [root@pe2cc3-003 ~]# grep 8786772 /var/log/slurmd.log [2020-11-23T18:52:22.780] _run_prolog: prolog with lock for job 8786772 ran for 0 seconds [2020-11-23T18:52:22.829] [8786772.extern] task/cgroup: /slurm/uid_20116/job_8786772: alloc=125MB mem.limit=125MB memsw.limit=unlimited [2020-11-23T18:52:22.836] [8786772.extern] task/cgroup: /slurm/uid_20116/job_8786772/step_extern: alloc=125MB mem.limit=125MB memsw.limit=unlimited [2020-11-23T18:52:23.875] task_p_slurmd_batch_request: 8786772 [2020-11-23T18:52:23.875] task/affinity: job 8786772 CPU input mask for node: 0x00000000003000 [2020-11-23T18:52:23.875] task/affinity: job 8786772 CPU final HW mask for node: 0x00000400000040 [2020-11-23T18:52:23.876] Launching batch job 8786772 for UID 20116 [2020-11-23T18:52:23.901] [8786772.4294967291] task/cgroup: /slurm/uid_20116/job_8786772: alloc=125MB mem.limit=125MB memsw.limit=unlimited [2020-11-23T18:52:23.910] [8786772.4294967291] task/cgroup: /slurm/uid_20116/job_8786772/step_4294967291: alloc=125MB mem.limit=125MB memsw.limit=unlimited [2020-11-23T18:52:23.983] [8786772.4294967291] task_p_pre_launch: Using sched_affinity for tasks [2020-11-23T18:52:24.019] [8786772.4294967291] done with job
Just to clarify. We only upgraded slurmctld and slurmdbd to 20.11. The compute nodes are still on 19.05.3-2
Just giving you and update. We are looking into this and will let you know what we find.
*** Ticket 10259 has been marked as a duplicate of this ticket. ***
Hi This patch should resolve the issue: https://github.com/SchedMD/slurm/commit/aaa219f75 It will be included in 20.11.1. Could you apply this patch and check whether it helps? Dominik
Can confirm the patch resolved the issue.
Just noticed it didn't fix all of the jobs. Some jobs are still not reporting they have completed. Here is an example job [lhuang@pe2-login01 slurm]$ scontrol show job 8787106_9 JobId=8787113 ArrayJobId=8787106 ArrayTaskId=9 JobName=shift_positions_back.sh UserId=mbyrska-bishop(20158) GroupId=compbio(9043) MCS_label=N/A Priority=408276 Nice=0 Account=compbio QOS=compbio JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=02:21:09 TimeLimit=7-00:00:00 TimeMin=N/A SubmitTime=2020-11-24T10:33:58 EligibleTime=2020-11-24T10:33:58 AccrueTime=Unknown StartTime=2020-11-24T10:33:58 EndTime=2020-12-01T10:33:58 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-11-24T10:33:58 Partition=pe2 AllocNode:Sid=mbyrska-vm.nygenome.org:116650 ReqNodeList=(null) ExcNodeList=(null) NodeList=pe2cc3-014 BatchHost=pe2cc3-014 NumNodes=1 NumCPUs=6 NumTasks=1 CPUs/Task=6 ReqB:S:C:T=0:0:*:* TRES=cpu=6,mem=20G,node=1,billing=6 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=6 MinMemoryNode=20G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/gpfs/commons/groups/compbio/projects/1KGP_3202_SNV_INDEL_phasing/shapeit_no_duohmm/shift_positions_back.sh shift_positions_back.manifest WorkDir=/gpfs/commons/groups/compbio/projects/1KGP_3202_SNV_INDEL_phasing/shapeit_no_duohmm StdErr=/gpfs/commons/groups/compbio/projects/1KGP_3202_SNV_INDEL_phasing/shapeit_no_duohmm/logs/shift_positions_back.sh.8787106.9 StdIn=/dev/null StdOut=/gpfs/commons/groups/compbio/projects/1KGP_3202_SNV_INDEL_phasing/shapeit_no_duohmm/logs/shift_positions_back.sh.8787106.9 Power= [root@pe2cc3-014 ~]# zgrep 8787113 /var/log/slurmd.log [2020-11-24T10:33:59.797] _run_prolog: prolog with lock for job 8787113 ran for 0 seconds [2020-11-24T10:33:59.835] [8787113.extern] task/cgroup: /slurm/uid_20158/job_8787113: alloc=20480MB mem.limit=20480MB memsw.limit=unlimited [2020-11-24T10:33:59.844] [8787113.extern] task/cgroup: /slurm/uid_20158/job_8787113/step_extern: alloc=20480MB mem.limit=20480MB memsw.limit=unlimited [2020-11-24T10:33:59.942] task_p_slurmd_batch_request: 8787113 [2020-11-24T10:33:59.942] task/affinity: job 8787113 CPU input mask for node: 0x0000000000FC00 [2020-11-24T10:33:59.942] task/affinity: job 8787113 CPU final HW mask for node: 0x00000E000000E0 [2020-11-24T10:33:59.942] Launching batch job 8787113 for UID 20158 [2020-11-24T10:33:59.967] [8787113.4294967291] task/cgroup: /slurm/uid_20158/job_8787113: alloc=20480MB mem.limit=20480MB memsw.limit=unlimited [2020-11-24T10:33:59.972] [8787113.4294967291] task/cgroup: /slurm/uid_20158/job_8787113/step_4294967291: alloc=20480MB mem.limit=20480MB memsw.limit=unlimited [2020-11-24T10:34:00.038] [8787113.4294967291] task_p_pre_launch: Using sched_affinity for tasks [2020-11-24T12:53:49.949] [8787113.4294967291] done with job From the slurmctld logs [2020-11-24T12:53:49.941] error: step_partial_comp: batch step received for JobId=8787106_9(8787113). This should never happen. [2020-11-24T12:57:22.979] error: step_partial_comp: batch step received for JobId=8787106_10(8787114). This should never happen.
Hi Could you send me full slurmd.log? Dominik
Created attachment 16836 [details] slurmd log from hpc node
Created attachment 16837 [details] slurmdbd log
Created attachment 16838 [details] slurmctld log
Just noted that these lines appear in the slurmd log: [2020-12-03T10:22:52.410] debug3: Trying to load plugin /opt/slurm/20.11.0/lib64/slurm/gres_craynetwork.so [2020-12-03T10:22:52.410] debug4: /opt/slurm/20.11.0/lib64/slurm/gres_craynetwork.so: Does not exist or not a regular file. [2020-12-03T10:22:52.410] debug: gres: Couldn't find the specified plugin name for gres/craynetwork looking at all files [2020-12-03T10:22:52.411] debug: Cannot find plugin of type gres/craynetwork, just track gres counts [2020-12-03T10:22:52.411] debug: Plugin of type gres/craynetwork only tracks gres counts Did something go away?
We don’t use cray here. This isn’t in the logs that we sent. From: "bugs@schedmd.com" <bugs@schedmd.com> Date: Wednesday, December 2, 2020 at 11:24 PM To: Luis Huang <lhuang@NYGENOME.ORG> Subject: [Bug 10275] Job runs but never complete after 20.11 upgrade Kevin Buckley<mailto:kevin.buckley@pawsey.org.au> changed bug 10275<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=10275__;!!C6sPl7C9qQ!HfZKFkG7AarRJIp8j3ryUrqUkRgCxQJ8fnUoBkkC1pN7xCr-IVV6zkNtJMe7qys$> What Removed Added CC kevin.buckley@pawsey.org.au Comment # 16<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=10275*c16__;Iw!!C6sPl7C9qQ!HfZKFkG7AarRJIp8j3ryUrqUkRgCxQJ8fnUoBkkC1pN7xCr-IVV6zkNtXQkmbfI$> on bug 10275<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=10275__;!!C6sPl7C9qQ!HfZKFkG7AarRJIp8j3ryUrqUkRgCxQJ8fnUoBkkC1pN7xCr-IVV6zkNtJMe7qys$> from Kevin Buckley<mailto:kevin.buckley@pawsey.org.au> Just noted that these lines appear in the slurmd log: [2020-12-03T10:22:52.410] debug3: Trying to load plugin /opt/slurm/20.11.0/lib64/slurm/gres_craynetwork.so [2020-12-03T10:22:52.410] debug4: /opt/slurm/20.11.0/lib64/slurm/gres_craynetwork.so: Does not exist or not a regular file. [2020-12-03T10:22:52.410] debug: gres: Couldn't find the specified plugin name for gres/craynetwork looking at all files [2020-12-03T10:22:52.411] debug: Cannot find plugin of type gres/craynetwork, just track gres counts [2020-12-03T10:22:52.411] debug: Plugin of type gres/craynetwork only tracks gres counts Did something go away? ________________________________ You are receiving this mail because: * You reported the bug. ________________________________ This message is for the recipient’s use only, and may contain confidential, privileged or protected information. Any unauthorized use or dissemination of this communication is prohibited. If you received this message in error, please immediately notify the sender and destroy all copies of this message. The recipient should check this email and any attachments for the presence of viruses, as we accept no liability for any damage caused by any virus transmitted by this email.
*** Ticket 10344 has been marked as a duplicate of this ticket. ***
Hi Do you still use slurmd in 19.05 version? Still can't find any code path to get such behavior in the configuration: slurmctld -- 20.11 + patch slurmd -- 19.05 slurmstepd -- 19.05 Could you check on nodes if slurmd and stepd versions match? Dominik
I'm out of office and will have limited access to the internet. Please email linuxhelp@nygenome.org for any urgent issues. ________________________________ This message is for the recipient's use only, and may contain confidential, privileged or protected information. Any unauthorized use or dissemination of this communication is prohibited. If you received this message in error, please immediately notify the sender and destroy all copies of this message. The recipient should check this email and any attachments for the presence of viruses, as we accept no liability for any damage caused by any virus transmitted by this email.
We no longer have slurmd that is running on an older version. Everything is up to date now and we no longer see this issue.
Hi We still have some transition problems in 20.11, and we work on them in internal bug 10467. If this is no longer problem on your site, I'll go ahead and close this. Feel free to reopen if needed. Dominik