Dear Slurm team, This ticket is probably a spinoff of #6769. Just want to mention here that it causes some troubles using this particular version. After migrating to 19.05.0 version we get a lot of following errors: > [2019-07-17T07:24:37.268] error: select/cons_res: node gwdd034 memory is under-allocated (25600-32000) for JobId=751905 > [2019-07-17T07:24:37.270] error: select/cons_res: node gwdd029 memory is under-allocated (38400-51200) for JobId=751223 > [2019-07-17T07:24:37.273] error: select/cons_res: node gwdd034 memory is under-allocated (25600-32000) for JobId=751905 > [2019-07-17T07:24:37.275] error: select/cons_res: node gwdd029 memory is under-allocated (38400-51200) for JobId=751223 This leads the jobs to pend with launch_failed_requeued_held error message and using "scontrol release" often doesn't help because jobs fall into the same error again. For users whose jobs requires days to be scheduled this error causes a lot of troubles. Also these messages increase an amount of logs dramatically (the error rate is dozens of such message in a millisecond during the jobs' starting time). Is there any workaround to avoid memory is under-allocated errors or should we wait for a patch?
Azat, Can you please provide whole output of 'scontrol show job $JOBID' of one of these failed jobs? (You can remove user ids if required). The error "launch failed requeued held" should only be set to a requeued job that failed to launch. Is it possible to get the slurmd logs of the head node of job too? Thanks, --Nate
Please also attach an updated slurm.conf and the slurmctld log.
(In reply to Nate Rini from comment #1) > Azat, > > Can you please provide whole output of 'scontrol show job $JOBID' of one of > these failed jobs? (You can remove user ids if required). Hi Nate, Here is the job info of such job of user YYY with the user ID XXX gwdu105:0 09:17:06 ~ # scontrol show job 739867 JobId=739867 JobName=rasrec UserId=YYY(XXX) GroupId=MBPC(9100) MCS_label=N/A Priority=0 Nice=0 Account=all QOS=normal JobState=PENDING Reason=launch_failed_requeued_held Dependency=(null) Requeue=1 Restarts=4 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A SubmitTime=2019-07-16T16:05:37 EligibleTime=2019-07-16T16:07:38 AccrueTime=Unknown StartTime=Unknown EndTime=Unknown Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-07-16T16:04:24 Partition=medium-fmz,medium-fas AllocNode:Sid=gwdu103.global.gwdg.cluster:16538 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) BatchHost=gwdd002 NumNodes=66 NumCPUs=256 NumTasks=256 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=256,mem=800G,node=1,billing=256 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=3200M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/home/mpg08/YYY/rosetta/rasrec_runs/Syrb1_CS_BMRB/01_rasrec_w_abmem/production.slurm.job WorkDir=/home/mpg08/YYY/rosetta/rasrec_runs/Syrb1_CS_BMRB/01_rasrec_w_abmem StdErr=/home/mpg08/YYY/rosetta/rasrec_runs/Syrb1_CS_BMRB/01_rasrec_w_abmem/%J.err StdIn=/dev/null StdOut=/home/mpg08/YYY/rosetta/rasrec_runs/Syrb1_CS_BMRB/01_rasrec_w_abmem/%J.out Power=
Created attachment 10944 [details] The relevant part of slurmd logs for the job > Is it possible to get the slurmd logs of the head node of job too? Here is the relevant part of slurmd logs from the head node of the job.
Azat, Could you please attach also slurmctld log and slurm.conf. Do you mind sharing whole slurmd log? Could you please find also last occurence of "under-allocated" in logs and check NODENAME where it happened and then execute and share results of the command below with us: >squeue -w NODENAME -o %i -h | xargs scontrol show job -d cheers, Marcin
Created attachment 10946 [details] slurm.conf > Please also attach an updated slurm.conf and the slurmctld log. I have attached slurm.conf file with several included configs for nodes and partitions. Since our slurmctld log file still has immense size, I cannot attach it here. Are you looking for some particular errors? Here are some errors with that JOBID: There are tons of: > [2019-07-16T16:05:10.236] error: select/cons_res: node gwdd070 memory is under-allocated (999-12800) for JobId=739867 > [2019-07-16T16:05:10.236] error: select/cons_res: node gwdd100 memory is under-allocated (999-12800) for JobId=739867 > [2019-07-16T16:05:10.236] error: select/cons_res: node gwdd106 memory is under-allocated (999-12800) for JobId=739867 > [2019-07-16T16:05:10.236] error: select/cons_res: node gwdd108 memory is under-allocated (999-12800) for JobId=739867 > [2019-07-16T16:05:10.236] error: select/cons_res: node gwdd113 memory is under-allocated (999-12800) for JobId=739867 more precisely 37174 errors! Here is the output of the logs where the JobID is mentioned but without under-allocated messages: > gwdu105:8 10:12:48 ~ # cat /var/log/slurm/slurmctld.log | grep "739867" | grep -v "under-allocated" > /tmp/slurmctld.739867.log > [2019-07-11T14:28:34.537] _slurm_rpc_submit_batch_job: JobId=739867 InitPrio=768 usec=30201 > [2019-07-11T21:11:42.655] Recovered JobId=739867 Assoc=1169 > [2019-07-11T23:31:44.940] backfill: Started JobId=739867 in medium-fmz on gwdd[002,004-005,010,020,050-051,053-055,057-058,060-062,064-066,071-078,080-085,091,094,096-097] > [2019-07-11T23:31:45.350] prolog_running_decr: Configuration for JobId=739867 is complete > [2019-07-11T23:31:45.614] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-11T23:31:45.614] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-11T23:31:45.617] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-11T23:31:45.617] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-11T23:31:45.617] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-11T23:31:45.617] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-11T23:31:45.618] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-11T23:31:45.619] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-11T23:31:45.620] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-11T23:31:45.622] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-11T23:32:56.244] cleanup_completing: JobId=739867 completion process took 71 seconds > [2019-07-11T23:32:56.244] Requeuing JobId=739867 > [2019-07-12T14:12:38.622] sched: _release_job_rec: release hold on JobId=739867 by uid 0 > [2019-07-12T14:12:38.623] _slurm_rpc_update_job: complete JobId=739867 uid=0 usec=706 > [2019-07-12T14:14:32.679] backfill: Started JobId=739867 in medium-fmz on gwdd[027-028,030-031,033-035,037,039,042-043,045-048,076,078,082-083,085,087-094,097,166] > [2019-07-12T14:14:34.225] prolog_running_decr: Configuration for JobId=739867 is complete > [2019-07-12T14:14:34.466] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-12T14:14:34.468] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-12T14:14:34.469] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-12T14:14:34.516] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-12T14:14:34.517] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-12T14:14:34.517] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-12T14:14:34.519] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-12T14:15:45.147] cleanup_completing: JobId=739867 completion process took 71 seconds > [2019-07-12T14:15:45.147] Requeuing JobId=739867 > [2019-07-12T14:33:02.667] sched: _release_job_rec: release hold on JobId=739867 by uid 0 > [2019-07-12T14:33:02.667] _slurm_rpc_update_job: complete JobId=739867 uid=0 usec=561 > [2019-07-12T14:34:24.022] sched: Allocate JobId=739867 NodeList=gwdd[028,030-031,033-035,037,042-043,045-048,076,078,082-083,085,087-091,093-094,097,104] #CPUs=256 Partition=medium-fmz > [2019-07-12T14:34:24.255] prolog_running_decr: Configuration for JobId=739867 is complete > [2019-07-12T14:34:24.417] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-12T14:34:24.418] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-12T14:34:24.423] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-12T14:34:24.424] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-12T14:34:24.428] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-12T14:35:35.224] cleanup_completing: JobId=739867 completion process took 71 seconds > [2019-07-12T14:35:35.225] Requeuing JobId=739867 > [2019-07-16T14:03:46.317] sched: _release_job_rec: release hold on JobId=739867 by uid 0 > [2019-07-16T14:03:46.318] _slurm_rpc_update_job: complete JobId=739867 uid=0 usec=660 > [2019-07-16T15:44:34.701] Recovered JobId=739867 Assoc=1169 > [2019-07-16T16:04:24.164] backfill: Started JobId=739867 in medium-fmz on gwdd> [002-003,005-006,008,013-014,018-021,023-024,051-052,054-057,060,062-066,068-070,072-083,085,087-093,095,097,099-102,104-106,108,111-115,117-118,170] > [2019-07-16T16:04:26.014] prolog_running_decr: Configuration for JobId=739867 is complete > [2019-07-16T16:04:26.649] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-16T16:04:26.650] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-16T16:04:26.650] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-16T16:04:26.651] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-16T16:04:26.651] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-16T16:04:26.651] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-16T16:04:26.652] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-16T16:04:26.653] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-16T16:04:26.655] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-16T16:04:26.657] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-16T16:04:26.659] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-16T16:04:26.660] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-16T16:04:26.663] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-16T16:04:26.663] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-16T16:04:26.666] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-16T16:04:26.669] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-16T16:04:26.703] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-16T16:04:26.704] slurmd error running JobId=739867 from node(s)=gwdd002: Job credential revoked > [2019-07-16T16:04:26.704] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-16T16:04:26.705] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-16T16:04:26.707] _slurm_rpc_requeue: 739867: Job is pending execution > [2019-07-16T16:05:37.482] cleanup_completing: JobId=739867 completion process took 71 seconds > [2019-07-16T16:05:37.482] Requeuing JobId=739867 > [2019-07-17T13:42:52.614] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=739867 uid ZZZ > [2019-07-17T13:42:52.616] error: Security violation, REQUEST_KILL_JOB RPC for JobId=739867 from uid ZZZ > [2019-07-17T13:42:52.616] _slurm_rpc_kill_job: job_str_signal() JobId=739867 sig 9 returned Access/permission denied
Created attachment 10947 [details] slurmd.log full (In reply to Marcin Stolarek from comment #5) > Azat, > > Do you mind sharing whole slurmd log? Hi Marcin, I attached the full version of slurmd log (masked UIDs and usernames). > Could you please find also last occurence of "under-allocated" in logs and > check NODENAME where it happened and then execute and share results of the > command below with us: > >squeue -w NODENAME -o %i -h | xargs scontrol show job -d Yes, here is the output: > gwdu105:0 10:46:58 ~ # squeue -w gwdd016 -o %i -h | xargs scontrol show job -d > JobId=754890 JobName=I10-117-1 > UserId=YYY(XXX) GroupId=UCPC(13400) MCS_label=N/A > Priority=925 Nice=0 Account=all QOS=normal > JobState=RUNNING Reason=None Dependency=(null) > Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 > DerivedExitCode=0:0 > RunTime=00:21:08 TimeLimit=05:00:00 TimeMin=N/A > SubmitTime=2019-07-16T09:34:37 EligibleTime=2019-07-16T09:34:37 > AccrueTime=2019-07-16T09:34:37 > StartTime=2019-07-18T10:26:08 EndTime=2019-07-18T15:26:09 Deadline=N/A > SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-07-18T10:26:08 > Partition=medium-fmz AllocNode:Sid=gwdu102.global.gwdg.cluster:28785 > ReqNodeList=(null) ExcNodeList=(null) > NodeList=gwdd016 > BatchHost=gwdd016 > NumNodes=1 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:* > TRES=cpu=20,mem=62.50G,node=1,billing=20 > Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* > Nodes=gwdd016 CPU_IDs=0-19 Mem=64000 GRES= > MinCPUsNode=1 MinMemoryCPU=3200M MinTmpDiskNode=0 > Features=(null) DelayBoot=00:00:00 > OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) > Command=(null) > WorkDir=/home/uni08/YYY/Promotion-MH-gwdg/PBE0-structures/IRMOF-10-250K/117/1-pbe0 > StdErr=/usr/users/YYY/stdout+stderr/out.%J > StdIn=/dev/null > StdOut=/usr/users/YYY/stdout+stderr/out.%J > Power=
Azat, Could you please try to compress slurmctld log? Could you please attach results of the following commands from the node: > free > dmesg From the attached log it looks like you're running version 18.08.06 is that correct? (In the bug report form you mentioned version 19.05.0. Just to be sure did you attempt to upgrade or version in the form is just a mistake?) cheers, Marcin
(In reply to Marcin Stolarek from comment #8) Hi Marcin > From the attached log it looks like you're running version 18.08.06 is that > correct? (In the bug report form you mentioned version 19.05.0. Just to be > sure did you attempt to upgrade or version in the form is just a mistake?) the version is 19.05, but we did not set up logrotate. Scroll down a bit to 2019-06-25, there's the restart with 19.05. (I'll set up logrotate now so that this doesn't happen again...) Cheers, Marcus
Created attachment 10949 [details] slurmctld.log (In reply to Marcin Stolarek from comment #8) > Could you please try to compress slurmctld log? Yes, I have attached last 10M lines of it > Could you please attach results of the following commands from the node: > > free > > dmesg Sure: > gwdd016:0 15:02:13 ~ # free > total used free shared buff/cache available > Mem: 65941536 32026920 25638872 744976 8275744 31685136 > Swap: 12582908 615504 11967404 and > gwdd016:0 15:02:14 ~ # dmesg -T > ... > [Thu Jul 11 15:43:50 2019] CVFS: Lost connection to Disk Proxy Server at 10.108.96.167 port 55334 > [Thu Jul 11 15:43:50 2019] Connected to Disk Proxy Server 10.108.96.167 via 10.108.96.167 port 55334 > [Thu Jul 11 15:44:00 2019] Reconnecting to Disk Proxy Server 10.108.96.167 via 10.108.96.167 port 55334 > [Thu Jul 11 15:44:00 2019] Connected to Disk Proxy Server 10.108.96.167 via 10.108.96.167 port 55334 > [Thu Jul 11 15:44:58 2019] SNFS (info) proxy read timeout changed:576 -> 864 > [Thu Jul 11 15:49:56 2019] Reconnecting to FSS 'UNI08' > [Thu Jul 11 15:49:56 2019] StorNext mounted file system UNI08: No FSS registered with PortMapper on host 10.108.96.161, retrying... > [Thu Jul 11 15:50:13 2019] Reconnecting to FSS 'UNI08' > [Thu Jul 11 15:50:24 2019] Reconnecting to FSS 'UNI08' > [Thu Jul 11 15:50:35 2019] Reconnecting to FSS 'UNI08' > [Thu Jul 11 15:50:46 2019] Reconnecting to FSS 'UNI08' > [Thu Jul 11 15:50:57 2019] Reconnecting to FSS 'UNI08' > [Thu Jul 11 15:50:57 2019] Reconnect successful to FSS 'UNI08' on host '10.108.96.160'. > [Thu Jul 11 15:50:57 2019] Using v3 readdir for 'UNI08' > [Thu Jul 11 15:51:10 2019] All partition groups for filesystem 'UNI08' online. > [Thu Jul 11 17:20:08 2019] SNFS (info) proxy write timeout changed:768 -> 1153 > [Fri Jul 12 10:03:26 2019] Reconnecting to FSS 'MPG08' > [Fri Jul 12 10:03:26 2019] StorNext mounted file system MPG08: No FSS registered with PortMapper on host 10.108.96.161, retrying... > [Fri Jul 12 10:03:41 2019] fs 'MPG08': FSS on host 10.108.96.160 not currently accessible (possibly stopped administratively). Thread 0xffff88085b906eb0 pid 15061, still trying. > [Fri Jul 12 10:03:51 2019] Reconnecting to FSS 'MPG08' > [Fri Jul 12 10:04:05 2019] Reconnect successful to FSS 'MPG08' on host '10.108.96.160'. > [Fri Jul 12 10:04:05 2019] Using v3 readdir for 'MPG08' > [Fri Jul 12 22:47:52 2019] ib0: Budget exhausted after napi rescheduled > [Sun Jul 14 04:43:46 2019] SNFS (info) proxy read timeout changed:15 -> 22 > [Sun Jul 14 14:38:55 2019] SNFS (info) proxy read timeout changed:22 -> 33 > [Sun Jul 14 20:01:17 2019] orted[29853]: segfault at 2aaaaeda2970 ip 00002aaaac3330b8 sp 00007fffffffb4c0 error 4 in libgcc_s-4.8.5-20150702.so.1[2aaaac324000+15000] > [Mon Jul 15 16:23:20 2019] Reconnecting to FSS 'UNI08' > [Mon Jul 15 16:23:20 2019] StorNext mounted file system UNI08: No FSS registered with PortMapper on host 10.108.96.160, retrying... > [Mon Jul 15 16:23:34 2019] fs 'UNI08': FSS on host 10.108.96.161 not currently accessible (possibly stopped administratively). Thread 0xffff88105cf24f10 pid 14767, still trying. > [Mon Jul 15 16:23:40 2019] Reconnect successful to FSS 'UNI08' on host '10.108.96.161'. > [Mon Jul 15 16:23:40 2019] Using v3 readdir for 'UNI08' > [Mon Jul 15 17:06:19 2019] orted[1204]: segfault at 2aaaaeda2970 ip 00002aaaac3330b8 sp 00007fffffffb480 error 4 in libgcc_s-4.8.5-20150702.so.1[2aaaac324000+15000] > [Tue Jul 16 09:45:12 2019] Reconnecting to FSS 'UNI05' > [Tue Jul 16 09:45:12 2019] StorNext mounted file system UNI05: No FSS registered with PortMapper on host 10.108.96.161, retrying... > [Tue Jul 16 09:45:24 2019] fs 'UNI05': FSS on host 10.108.96.160 not currently accessible (possibly stopped administratively). Thread 0xffff8808583d8000 pid 14615, still trying. > [Tue Jul 16 09:45:34 2019] Reconnecting to FSS 'UNI05' > [Tue Jul 16 09:45:54 2019] Reconnecting to FSS 'UNI05' > [Tue Jul 16 09:45:54 2019] Reconnect successful to FSS 'UNI05' on host '10.108.96.160'. > [Tue Jul 16 09:45:54 2019] Using v3 readdir for 'UNI05' > [Tue Jul 16 09:50:32 2019] Reconnecting to FSS 'UNI05' > [Tue Jul 16 09:50:32 2019] StorNext mounted file system UNI05: No FSS registered with PortMapper on host 10.108.96.160, retrying... > [Tue Jul 16 09:50:43 2019] Reconnect successful to FSS 'UNI05' on host '10.108.96.161'. > [Tue Jul 16 09:50:43 2019] Using v3 readdir for 'UNI05' > [Tue Jul 16 15:48:23 2019] nfs: server fs-base.fs.gwdg.de not responding, still trying > [Tue Jul 16 15:50:59 2019] nfs: server fs-base.fs.gwdg.de not responding, still trying > [Tue Jul 16 15:54:19 2019] nfs: server fs-base.fs.gwdg.de OK > [Tue Jul 16 15:54:19 2019] nfs: server fs-base.fs.gwdg.de OK > [Tue Jul 16 16:02:21 2019] Reconnecting to FSS 'UNI05' > [Tue Jul 16 16:02:21 2019] No connections found for any Disk Proxy Server > [Tue Jul 16 16:02:21 2019] Stripe group '__VACANT_SG__12' is VACANT on filesystem UNI05 (was DOWN) > [Tue Jul 16 16:02:21 2019] Reconnect successful to FSS 'UNI05' on host '10.108.96.161'. > [Tue Jul 16 16:02:21 2019] Using v3 readdir for 'UNI05' > [Tue Jul 16 16:02:25 2019] All partition groups for filesystem 'UNI05' online. > [Tue Jul 16 17:59:36 2019] slurmstepd[25388]: segfault at 42 ip 0000000000413c54 sp 00007fffffffea20 error 4 in slurmstepd[400000+37000] > [Tue Jul 16 19:37:48 2019] Reconnecting to FSS 'UNI05' > [Tue Jul 16 19:37:48 2019] No connections found for any Disk Proxy Server > [Tue Jul 16 19:37:48 2019] Stripe group '__VACANT_SG__15' is VACANT on filesystem UNI05 (was UP) > [Tue Jul 16 19:37:48 2019] Reconnect successful to FSS 'UNI05' on host '10.108.96.161'. > [Tue Jul 16 19:37:48 2019] Using v3 readdir for 'UNI05' > [Tue Jul 16 19:38:45 2019] All partition groups for filesystem 'UNI05' online. > [Wed Jul 17 18:21:28 2019] orted[19560]: segfault at 2aaaaeda2970 ip 00002aaaac3330b8 sp 00007fffffffb500 error 4 in libgcc_s-4.8.5-20150702.so.1[2aaaac324000+15000]
The error 'memory is under-allocated' is a scheduler error. It shouldn't cause a job failure at launch with this error 'launch failed requeued held' but instead should only delay job launch until the memory allocations are corrected (which happens when you see the under-allocated error). (In reply to Azat from comment #10) > > [Thu Jul 11 15:43:50 2019] CVFS: Lost connection to Disk Proxy Server at 10.108.96.167 port 55334 > > [Thu Jul 11 15:43:50 2019] Connected to Disk Proxy Server 10.108.96.167 via 10.108.96.167 port 55334 > > [Thu Jul 11 15:44:00 2019] Reconnecting to Disk Proxy Server 10.108.96.167 via 10.108.96.167 port 55334 > > [Thu Jul 11 15:44:00 2019] Connected to Disk Proxy Server 10.108.96.167 via 10.108.96.167 port 55334 > > [Thu Jul 11 15:44:58 2019] SNFS (info) proxy read timeout changed:576 -> 864 Is Slurm (or any component of Slurm) on this filesystem that appears to be reconnecting quite a bit? (In reply to Azat from comment #10) > > [Mon Jul 15 17:06:19 2019] orted[1204]: segfault at 2aaaaeda2970 ip 00002aaaac3330b8 sp 00007fffffffb480 error 4 in libgcc_s-4.8.5-20150702.so.1[2aaaac324000+15000] > > [Tue Jul 16 17:59:36 2019] slurmstepd[25388]: segfault at 42 ip 0000000000413c54 sp 00007fffffffea20 error 4 in slurmstepd[400000+37000] > > [Wed Jul 17 18:21:28 2019] orted[19560]: segfault at 2aaaaeda2970 ip 00002aaaac3330b8 sp 00007fffffffb500 error 4 in libgcc_s-4.8.5-20150702.so.1[2aaaac324000+15000] Are there any cores from orted and/or slurmstepd segfaulting, can you please provide 'thread apply all bt full' from them? This is more in line with what I would expect to cause 'launch failed requeued held' error. If there are no cores, is it possible to run gdb against slurmd with 'set follow-fork-mode child' to grab the backtrace when they SEGFAULT? Can you please call the following on gwdd016 as root: > cat /proc/mounts > ldd $(which slurmd) > gdb $(which slurmd) >> b main >> r >> info files >> info symbol 0x0000000000413c54 >> info symbol 0x00002aaaac3330b8 >> info symbol 0x00007fffffffb500 >> info symbol 0x00007fffffffb500 >> info symbol 0x2aaaac324000 (I don't expect all the info symbols to succeed if you have ASLR active).
(In reply to Azat from comment #4) > Created attachment 10944 [details] > The relevant part of slurmd logs for the job > [2019-07-11T23:31:45.358] error: Credential signature check: Credential data size mismatch > [2019-07-11T23:31:45.358] error: _convert_job_mem: slurm_cred_verify failed: Invalid job credential This is an unexpected error (ESIG_BUF_SIZE_MISMATCH) with your packets that will need to be addressed. Do you have the same munge version installed on all nodes? Is is possible that your packets are getting corrupted but not being caught by the normal TCP crc checks? Can you please verify that slurm is linking (via ldd) to the same munge on all nodes. Is this error seen on other nodes or is it isolated to this node (possibly due to bad memory)? > [739542.extern] _oom_event_monitor: oom-kill event count: 1 This error may explain why orted failed, there should be addition information in the nodes syslogs/journalctl logs. Can you please provide those logs?
(In reply to Azat from comment #7) > Created attachment 10947 [details] > slurmd.log full > [2019-04-12T09:42:40.654] TOPOLOGY: warning -- no switch can reach all nodes through its descendants.Do not use route/topology Is intentional? Jobs cannot cross unconnected nodes in a topology, please see doc changes: https://github.com/SchedMD/slurm/commit/acea46d8402c7d99916d9b8121ba5c38fa613ae1 > [60983.extern] _oom_event_monitor: oom-kill event count: 1 There are good number of oom events on this node. I assume this is job triggered since the job cgroup appears to be correctly catching the oom events. Is it possible to see why this job keeps causing oom events?
(In reply to Azat from comment #7) > I attached the full version of slurmd log (masked UIDs and usernames). Did you use a script to do this or was it manually done? I also marked all of your attachments as private just in case something was missed.
(In reply to Azat from comment #10) > Created attachment 10949 [details] > slurmctld.log Looking at this log file: > [2019-06-25T11:29:54.898] error: Job allocate response msg send failure, killing JobId=616782 > [2019-06-25T11:31:35.326] error: Node ping apparently hung, many nodes may be DOWN or configured SlurmdTimeout should be increased > [2019-07-15T14:23:38.870] error: _find_node_record(756): lookup failure for dge1 > [2019-07-15T14:23:38.870] error: node_name2bitmap: invalid node specified dge1 Have there been changes to the node lineup in Slurm or possibly issues with your DNS server? > [2019-06-25T12:41:24.950] error: slurm_persist_conn_open_without_init: failed to open persistent connection to gwdu104:6819: Connection refused What is the status of your backup controller currently? > [2019-06-25T13:05:07.086] error: slurm_unpack_received_msg: Invalid Protocol Version 8704 from uid=-1 at 10.108.96.104:45391 > [2019-06-25T13:05:07.086] error: slurm_unpack_received_msg: Incompatible versions of client and server code Can you please check the node @10.108.96.104 to make sure it's Slurm version is within 2 releases of your controller. > [2019-06-25T13:57:43.130] error: _slurm_rpc_node_registration node=dmp008: Invalid argument > [2019-06-25T13:57:44.399] error: Node dmp008 has low real_memory size (112699 < 128000) > [2019-06-25T13:58:43.016] error: Node dmp008 appears to have a different slurm.conf than the slurmctld. Looks like this node lost a DIMM, please make sure to update its slurm.conf or turn off the slurm daemons on it to avoid the log spam. > error: slurm_jobcomp plugin context not initialized Was there a different jobcomp configuration to cause this error? It should never happen with your current none plugin: >JobCompType=jobcomp/none > error: job_str_signal: Security violation JOB_CANCEL RPC for JobId=759301_* from uid XXX This user appears to have tried to kill all jobs, maybe a conversation with them is warranted. > [2019-07-16T13:47:50.447] error: job_submit/lua: /opt/slurm/etc/job_submit.lua: [string "slurm.user_msg (string.format(unpack({...})))"]:1: bad argument #1 to 'format' (string expected, got no value) Has this script bug been resolved (doubt it is causing the current issues though). > [2019-07-11T20:17:39.703] fatal: _start_msg_tree_internal: pthread_create error Resource temporarily unavailable > [2019-07-11T20:17:39.703] fatal: _start_msg_tree_internal: pthread_create error Resource temporarily unavailable Looks like the slurmctld node ran out of memory. Has this issue been resolved? > [2019-07-11T21:13:33.928] error: Registered JobId=714588 StepId=Batch on wrong node gwdd084 > [2019-07-11T21:13:33.928] error: Registered JobId=714588 StepId=4294967295 on wrong node gwdd084 > [2019-07-11T21:13:34.869] error: Registered JobId=714589 StepId=Batch on wrong node gwdd057 > [2019-07-11T21:13:34.869] error: Registered JobId=714589 StepId=4294967295 on wrong node gwdd057 > [2019-07-11T21:13:34.897] error: Registered JobId=714590 StepId=Batch on wrong node gwdd016 > [2019-07-11T21:13:34.897] error: Registered JobId=714590 StepId=4294967295 on wrong node gwdd016 I see a lot of errors along these lines. Can you please verify that there isn't something odd going on with the DNS or node hostnames where there are duplicate nodes getting the same hostname or IP?
(In reply to Nate Rini from comment #11) > The error 'memory is under-allocated' is a scheduler error. It shouldn't > cause a job failure at launch with this error 'launch failed requeued held' > but instead should only delay job launch until the memory allocations are > corrected (which happens when you see the under-allocated error). Sounds reasonable. Probably 2 separate issues. > > > [Thu Jul 11 15:43:50 2019] CVFS: Lost connection to Disk Proxy Server at 10.108.96.167 port 55334 > > > ... > Is Slurm (or any component of Slurm) on this filesystem that appears to be > reconnecting quite a bit? No, Slurm is located on NFS, where we don't get such connection problems. > > > ... > > > [Tue Jul 16 17:59:36 2019] slurmstepd[25388]: segfault at 42 ip 0000000000413c54 sp 00007fffffffea20 error 4 in slurmstepd[400000+37000] > > > ... > Are there any cores from orted and/or slurmstepd segfaulting, can you please > provide 'thread apply all bt full' from them? This is more in line with what > I would expect to cause 'launch failed requeued held' error. Agree, probably it is a cause of failed launch. I just hope that we can debug the cause of the segfault. And there are 2 identical cores from the slurmstepd. Here is the one from the logs (25388): > gwdd016:16 09:55:41 /tmp # gdb /opt/slurm/sbin/slurmstepd core.25388 > GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7 > Copyright (C) 2013 Free Software Foundation, Inc. > License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> > This is free software: you are free to change and redistribute it. > There is NO WARRANTY, to the extent permitted by law. Type "show copying" > and "show warranty" for details. > This GDB was configured as "x86_64-redhat-linux-gnu". > For bug reporting instructions, please see: > <http://www.gnu.org/software/gdb/bugs/>... > Reading symbols from /opt/lsf/slurm/slurm/19.05.0/install/sbin/slurmstepd...done. > [New LWP 25388] > [New LWP 25407] > [New LWP 25408] > [Thread debugging using libthread_db enabled] > Using host libthread_db library "/lib64/libthread_db.so.1". > Core was generated by `slurmstepd: [756929.extern] '. > Program terminated with signal 11, Segmentation fault. > #0 0x0000000000413c54 in _job_init_task_info (efname=0x648590 "/dev/null", ofname=0x646ec0 "/dev/null", ifname=0x658bc0 "/dev/null", gtid=0x64b990, job=0x645bc0) > at ../../../../source/src/slurmd/slurmstepd/slurmstepd_job.c:164 > 164 gtid[node_id][i] + pack_offset, > Missing separate debuginfos, use: debuginfo-install audit-libs-2.6.5-3.el7.x86_64 glibc-2.17-260.el7.x86_64 libcap-ng-0.7.5-4.el7.x86_64 libxml2-2.9.1-6.el7_2.3.x86_64 pam-1.1.8-18.el7.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-17.el7.x86_64 > (gdb) thread apply all bt full > > Thread 3 (Thread 0x2aaaae740700 (LWP 25408)): > #0 0x00002aaaabb2cd12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 > No symbol table info available. > #1 0x00002aaaaadaa570 in _timer_thread (args=<optimized out>) at ../../../source/src/common/slurm_acct_gather_profile.c:205 > err = <optimized out> > i = 4 > now = 1563293060 > diff = <optimized out> > tvnow = {tv_sec = 1563293060, tv_usec = 83769} > abs = {tv_sec = 1563293061, tv_nsec = 83769000} > __func__ = "_timer_thread" > #2 0x00002aaaabb28dd5 in start_thread () from /lib64/libpthread.so.0 > No symbol table info available. > #3 0x00002aaaabe3aead in clone () from /lib64/libc.so.6 > No symbol table info available. > > Thread 2 (Thread 0x2aaaaabf4700 (LWP 25407)): > #0 0x00002aaaabb2c965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 > No symbol table info available. > #1 0x00002aaaaadb0be3 in _watch_tasks (arg=<optimized out>) at ../../../source/src/common/slurm_jobacct_gather.c:366 > err = <optimized out> > __func__ = "_watch_tasks" > #2 0x00002aaaabb28dd5 in start_thread () from /lib64/libpthread.so.0 > No symbol table info available. > #3 0x00002aaaabe3aead in clone () from /lib64/libc.so.6 > No symbol table info available. > > Thread 1 (Thread 0x2aaaaaaf2480 (LWP 25388)): > #0 0x0000000000413c54 in _job_init_task_info (efname=0x648590 "/dev/null", ofname=0x646ec0 "/dev/null", ifname=0x658bc0 "/dev/null", gtid=0x64b990, job=0x645bc0) > at ../../../../source/src/slurmd/slurmstepd/slurmstepd_job.c:164 > i = 0 > err = <optimized out> > pack_offset = 0 > node_id = 8 > in = <optimized out> > out = <optimized out> > #1 stepd_step_rec_create (msg=msg@entry=0x64f260, protocol_version=<optimized out>) at ../../../../source/src/slurmd/slurmstepd/slurmstepd_job.c:498 > job = 0x645bc0 > srun = <optimized out> > resp_addr = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"} > io_addr = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"} > i = <optimized out> > nodeid = <optimized out> > __func__ = "stepd_step_rec_create" > #2 0x000000000040f134 in mgr_launch_tasks_setup (msg=0x64f260, cli=cli@entry=0x645530, self=self@entry=0x64b9b0, protocol_version=<optimized out>) at ../../../../source/src/slurmd/slurmstepd/mgr.c:206 > job = 0x0 > #3 0x000000000040cca8 in _step_setup (msg=0x64cb90, self=0x64b9b0, cli=0x645530) at ../../../../source/src/slurmd/slurmstepd/slurmstepd.c:677 > job = 0x0 > #4 main (argc=1, argv=0x7fffffffed78) at ../../../../source/src/slurmd/slurmstepd/slurmstepd.c:132 > cli = 0x645530 > self = 0x64b9b0 > msg = 0x64cb90 > rc = 0 > launch_params = 0x42b30d <__libc_csu_init+77> "H\203\303\001H9\353u\352H\203\304\b[]A\\A]A^A_Ðf.\017\037\204" > __func__ = "main" > (gdb) > > Thread 3 (Thread 0x2aaaae740700 (LWP 25408)): > #0 0x00002aaaabb2cd12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 > No symbol table info available. > #1 0x00002aaaaadaa570 in _timer_thread (args=<optimized out>) at ../../../source/src/common/slurm_acct_gather_profile.c:205 > err = <optimized out> > i = 4 > now = 1563293060 > diff = <optimized out> > tvnow = {tv_sec = 1563293060, tv_usec = 83769} > abs = {tv_sec = 1563293061, tv_nsec = 83769000} > __func__ = "_timer_thread" > #2 0x00002aaaabb28dd5 in start_thread () from /lib64/libpthread.so.0 > No symbol table info available. > #3 0x00002aaaabe3aead in clone () from /lib64/libc.so.6 > No symbol table info available. > > Thread 2 (Thread 0x2aaaaabf4700 (LWP 25407)): > #0 0x00002aaaabb2c965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 > No symbol table info available. > #1 0x00002aaaaadb0be3 in _watch_tasks (arg=<optimized out>) at ../../../source/src/common/slurm_jobacct_gather.c:366 > err = <optimized out> > __func__ = "_watch_tasks" > #2 0x00002aaaabb28dd5 in start_thread () from /lib64/libpthread.so.0 > No symbol table info available. > #3 0x00002aaaabe3aead in clone () from /lib64/libc.so.6 > No symbol table info available. > > Thread 1 (Thread 0x2aaaaaaf2480 (LWP 25388)): > #0 0x0000000000413c54 in _job_init_task_info (efname=0x648590 "/dev/null", ofname=0x646ec0 "/dev/null", ifname=0x658bc0 "/dev/null", gtid=0x64b990, job=0x645bc0) > at ../../../../source/src/slurmd/slurmstepd/slurmstepd_job.c:164 > i = 0 > err = <optimized out> > pack_offset = 0 > node_id = 8 > in = <optimized out> > out = <optimized out> > #1 stepd_step_rec_create (msg=msg@entry=0x64f260, protocol_version=<optimized out>) at ../../../../source/src/slurmd/slurmstepd/slurmstepd_job.c:498 > job = 0x645bc0 > srun = <optimized out> > resp_addr = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"} > io_addr = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"} > i = <optimized out> > nodeid = <optimized out> > __func__ = "stepd_step_rec_create" > #2 0x000000000040f134 in mgr_launch_tasks_setup (msg=0x64f260, cli=cli@entry=0x645530, self=self@entry=0x64b9b0, protocol_version=<optimized out>) at ../../../../source/src/slurmd/slurmstepd/mgr.c:206 > job = 0x0 > #3 0x000000000040cca8 in _step_setup (msg=0x64cb90, self=0x64b9b0, cli=0x645530) at ../../../../source/src/slurmd/slurmstepd/slurmstepd.c:677 > job = 0x0 > #4 main (argc=1, argv=0x7fffffffed78) at ../../../../source/src/slurmd/slurmstepd/slurmstepd.c:132 > cli = 0x645530 > self = 0x64b9b0 > msg = 0x64cb90 > rc = 0 > launch_params = 0x42b30d <__libc_csu_init+77> "H\203\303\001H9\353u\352H\203\304\b[]A\\A]A^A_Ðf.\017\037\204" > __func__ = "main"
Avat, Could you please load the core dump one more time and execute following commands in gdb: (gdb) set print pretty on (gdb) print *job (gdb) f 1 (gdb) print *msg (gdb) print *conf It looks like the issue may be coming from node name resolution. Are you sure that you don't have some DNS (/etc/hosts) inconsistency? cheers, Marcin
Created attachment 10959 [details] gdb long output (In reply to Marcin Stolarek from comment #17) > Could you please load the core dump one more time and execute following > commands in gdb: > (gdb) set print pretty on > (gdb) print *job > (gdb) f 1 > (gdb) print *msg > (gdb) print *conf Yes, see the attachment. > It looks like the issue may be coming from node name resolution. Are you > sure that you don't have some DNS (/etc/hosts) inconsistency? No problems that we aware of... DNS seems to be fine.
(In reply to Nate Rini from comment #12) > > [2019-07-11T23:31:45.358] error: Credential signature check: Credential data size mismatch > > [2019-07-11T23:31:45.358] error: _convert_job_mem: slurm_cred_verify failed: Invalid job credential > This is an unexpected error (ESIG_BUF_SIZE_MISMATCH) with your packets that > will need to be addressed. Do you have the same munge version installed on > all nodes? Yes, all off them have the same version. > Is is possible that your packets are getting corrupted but not > being caught by the normal TCP crc checks? I think it is unlikely. > Can you please verify that slurm > is linking (via ldd) to the same munge on all nodes. I didn't find a binary from Slurm which links to libmunge but Slurm is installed in the shared FS, so wherever they link, it is the same on all nodes :) > Is this error seen on > other nodes or is it isolated to this node (possibly due to bad memory)? We didn't inspect that error. Also at around 21:00 that day the master control daemon crashed and the secondary took over (that is related to one of your comments above and also there is a bug #7360 open for it). Might the error be because of that? > > [739542.extern] _oom_event_monitor: oom-kill event count: 1 > This error may explain why orted failed, there should be addition > information in the nodes syslogs/journalctl logs. Can you please provide > those logs? not much there: > Jul 15 17:11:00 gwdd016 kernel: orted[1204]: segfault at 2aaaaeda2970 ip 00002aaaac3330b8 sp 00007fffffffb480 error 4 in libgcc_s-4.8.5-20150702.so.1[2aaaac324000+15000]
Azat, Could you please check if the time on the host where slurmctld runs is in sync with the nodes? cheers, Marcin
(In reply to Nate Rini from comment #13) > > [2019-04-12T09:42:40.654] TOPOLOGY: warning -- no switch can reach all nodes through its descendants.Do not use route/topology > Is intentional? Jobs cannot cross unconnected nodes in a topology Yes, it is intentional, we have 2 sites where the nodes are located. Instead of making 2 clusters, for convenience we created 2 partitions for both sites and corresponding switch config in Slurm. And jobs should not run on the nodes from 2 sites simultaneously. > > [60983.extern] _oom_event_monitor: oom-kill event count: 1 > There are good number of oom events on this node. I assume this is job > triggered since the job cgroup appears to be correctly catching the oom > events. Is it possible to see why this job keeps causing oom events? It seems that indeed it causes orted segfault. But I have no idea how to find out why the jobs cause oom events, might be misconfiguration of the job by the user too.
(In reply to Marcin Stolarek from comment #20) > Could you please check if the time on the host where slurmctld runs is in > sync with the nodes? Yes, > gwdd016:1 16:04:51 ~ # ntpq -p > remote refid st t when poll reach delay offset jitter > ============================================================================== > *gwdu105.global. 134.76.249.102 3 u 351 1024 377 0.119 -0.299 0.242 > +gwdu104.global. 134.76.249.102 3 u 13 1024 377 0.183 -0.381 0.250 Looks good. gwdu105 is the node for master slurmctld.
(In reply to Nate Rini from comment #14) > (In reply to Azat from comment #7) > > I attached the full version of slurmd log (masked UIDs and usernames). > Did you use a script to do this or was it manually done? That one was done manually, but afterwards I made a script (for slurmcld logs ;) ) > I also marked all > of your attachments as private just in case something was missed. Thanks! I have missed something, indeed!
(In reply to Nate Rini from comment #15) > > [2019-06-25T11:29:54.898] error: Job allocate response msg send failure, killing JobId=616782 > > [2019-06-25T11:31:35.326] error: Node ping apparently hung, many nodes may be DOWN or configured SlurmdTimeout should be increased > > [2019-07-15T14:23:38.870] error: _find_node_record(756): lookup failure for dge1 > > [2019-07-15T14:23:38.870] error: node_name2bitmap: invalid node specified dge1 > Have there been changes to the node lineup in Slurm or possibly issues with > your DNS server? I have no clue where dge1 came from... I couldn't find it in any of the config files. dge nodes are all with 3 digits in the name. And that one should probably be dge001. Can you tell me please where node_name2bitmap gets it? > > [2019-06-25T12:41:24.950] error: slurm_persist_conn_open_without_init: failed to open persistent connection to gwdu104:6819: Connection refused > What is the status of your backup controller currently? It was the time when we updated Slurm, probably because of that. > > [2019-06-25T13:05:07.086] error: slurm_unpack_received_msg: Invalid Protocol Version 8704 from uid=-1 at 10.108.96.104:45391 > > [2019-06-25T13:05:07.086] error: slurm_unpack_received_msg: Incompatible versions of client and server code > Can you please check the node @10.108.96.104 to make sure it's Slurm version > is within 2 releases of your controller. Our installation of Slurm is on shared FS. 2019-06-25T13 is the time when we updated the Slurm. Probably because of that, however it was still within 2 release versions. We updated from 18.08. > > [2019-06-25T13:57:43.130] error: _slurm_rpc_node_registration node=dmp008: Invalid argument > > [2019-06-25T13:57:44.399] error: Node dmp008 has low real_memory size (112699 < 128000) > > [2019-06-25T13:58:43.016] error: Node dmp008 appears to have a different slurm.conf than the slurmctld. > Looks like this node lost a DIMM, please make sure to update its slurm.conf > or turn off the slurm daemons on it to avoid the log spam. Thanks, we did it. > > error: slurm_jobcomp plugin context not initialized > Was there a different jobcomp configuration to cause this error? It should > never happen with your current none plugin: > >JobCompType=jobcomp/none I believe no. I should clarify it with my colleagues later. > > error: job_str_signal: Security violation JOB_CANCEL RPC for JobId=759301_* from uid XXX > This user appears to have tried to kill all jobs, maybe a conversation with > them is warranted. As long as the are not able to do it we are good :) > > [2019-07-16T13:47:50.447] error: job_submit/lua: /opt/slurm/etc/job_submit.lua: [string "slurm.user_msg (string.format(unpack({...})))"]:1: bad argument #1 to 'format' (string expected, got no value) > Has this script bug been resolved (doubt it is causing the current issues > though). Thank you! We will look into that too. > > [2019-07-11T20:17:39.703] fatal: _start_msg_tree_internal: pthread_create error Resource temporarily unavailable > > [2019-07-11T20:17:39.703] fatal: _start_msg_tree_internal: pthread_create error Resource temporarily unavailable > Looks like the slurmctld node ran out of memory. Has this issue been > resolved? Not yet, that one has its one thread at bug #7360 > > [2019-07-11T21:13:33.928] error: Registered JobId=714588 StepId=Batch on wrong node gwdd084 > > [2019-07-11T21:13:33.928] error: Registered JobId=714588 StepId=4294967295 on wrong node gwdd084 > > [2019-07-11T21:13:34.869] error: Registered JobId=714589 StepId=Batch on wrong node gwdd057 > > [2019-07-11T21:13:34.869] error: Registered JobId=714589 StepId=4294967295 on wrong node gwdd057 > > [2019-07-11T21:13:34.897] error: Registered JobId=714590 StepId=Batch on wrong node gwdd016 > > [2019-07-11T21:13:34.897] error: Registered JobId=714590 StepId=4294967295 on wrong node gwdd016 > I see a lot of errors along these lines. Can you please verify that there > isn't something odd going on with the DNS or node hostnames where there are > duplicate nodes getting the same hostname or IP? It is done by cluster manager software and so far we didn't have issues with it. Both IPs and hostnames currently look fine to me.
Azat, Could you please add this configuration line to your slurm.conf: >AuthInfo=cred_expire=600 It looks like the issue is related to credentials being expired in the meantime of job being started. We're looking into the issue if we can better handle the case, but according to man slurm.conf > AuthInfo > Additional information to be used for authentication of communications between the Slurm daemons (slurmctld and slurmd) > and the Slurm clients. The interpretation of this option is specific to the configured AuthType. Multiple options may be > specified in a comma delimited list. If not specified, the default authentication information will be used. > cred_expire Default job step credential lifetime, in seconds (e.g. "cred_expire=1200"). It must be sufficiently long > enough to load user environment, run prolog, deal with the slurmd getting paged out of memory, etc. This > also controls how long a requeued job must wait before starting again. The default value is 120 seconds. cheers, Marcin
(In reply to Azat from comment #25) > (In reply to Nate Rini from comment #15) > > > [2019-06-25T11:29:54.898] error: Job allocate response msg send failure, killing JobId=616782 > > > [2019-06-25T11:31:35.326] error: Node ping apparently hung, many nodes may be DOWN or configured SlurmdTimeout should be increased > > > [2019-07-15T14:23:38.870] error: _find_node_record(756): lookup failure for dge1 > > > [2019-07-15T14:23:38.870] error: node_name2bitmap: invalid node specified dge1 > > Have there been changes to the node lineup in Slurm or possibly issues with > > your DNS server? > I have no clue where dge1 came from... I couldn't find it in any of the > config files. dge nodes are all with 3 digits in the name. And that one > should probably be dge001. Can you tell me please where node_name2bitmap > gets it? node_name2bitmap() matches hosts based on a REGEX, it's possible that a user requested this incorrectly. node_name2bitmap() is called from all over the codebase, so I can't pinpoint the cause without higher level of debug (debug2+) logs to catch the requesting RPC. > > > [2019-06-25T13:05:07.086] error: slurm_unpack_received_msg: Invalid Protocol Version 8704 from uid=-1 at 10.108.96.104:45391 > > > [2019-06-25T13:05:07.086] error: slurm_unpack_received_msg: Incompatible versions of client and server code > > Can you please check the node @10.108.96.104 to make sure it's Slurm version > > is within 2 releases of your controller. > Our installation of Slurm is on shared FS. 2019-06-25T13 is the time when we > updated the Slurm. Probably because of that, however it was still within 2 > release versions. We updated from 18.08. Version 8704 is SLURM_19_05_PROTOCOL_VERSION, so maybe a 17.11 (or older) node may have still been around when this log was taken. > > > error: slurm_jobcomp plugin context not initialized > > Was there a different jobcomp configuration to cause this error? It should > > never happen with your current none plugin: > > >JobCompType=jobcomp/none > I believe no. I should clarify it with my colleagues later. If you can find out what caused this error, please open a ticket about it. (In reply to Azat from comment #22) > > > [60983.extern] _oom_event_monitor: oom-kill event count: 1 > > There are good number of oom events on this node. I assume this is job > > triggered since the job cgroup appears to be correctly catching the oom > > events. Is it possible to see why this job keeps causing oom events? > It seems that indeed it causes orted segfault. But I have no idea how to > find out why the jobs cause oom events, might be misconfiguration of the job > by the user too. We are mainly concerned with the slurmstepd crashing, but an OOM event in the cgroup may take out the slurmstepd (that runs as the user) for that job although it should be less likely. Based on your backtrace, that is not happening here.
(In reply to Marcin Stolarek from comment #26) Marcin, > Could you please add this configuration line to your slurm.conf: > >AuthInfo=cred_expire=600 > It looks like the issue is related to credentials being expired in the > meantime of job being started. Thank you! Can you please tell how did you calculate credential expiration value 600 sec. (probably using some difference between credential checks from the logs)? Just in case if 600 is not enough and we need to increase it in future. And is "scontrol reconfigure" enough to activate that config line? Or do slurmd daemons also need to be restarted (I have read it in some old ticket here #2125)? Cheers, Azat
Azat, It is not based on any calculation - it was just something much larger than the default which is 120s. It looks like you're affected by the situation where the credential expires while it's being processed. In this case, part of the job information is "guessed" (we're having an internal discussion on how to better handle the case). In many environments, whole "processing" happens within second/seconds, depending on your system - the way user environment is constructed, contents of your prolog and potential swapping of slurmd it may take much more time. A full answer will require some investigation on your side - for instance, do you measure the time spent in prolog? You'll have to restart slurmd for this configuration to be effective. cheers, Marcin
Azat, Could you please confirm that configuration change was enough to eliminate the issue from the cluster? If this is true, can we lower the bug report severity to 4? cheers, Marcin
I'm currently out of the office until 27.07.2019. If you need support with the GWDG scientific compute cluster, please contact hpc@gwdg.de. In all other cases, including HLRN related topics please contact hpc-team@gwdg.de. Best regards, Azat Khuziyakhmetov
Hi, (In reply to Marcin Stolarek from comment #32) > Could you please confirm that configuration change was enough to eliminate > the issue from the cluster? > If this is true, can we lower the bug report severity to 4? Sadly, no. The slurmctld just crashed again with: > [2019-07-26T09:32:39.238] fatal: _start_msg_tree_internal: pthread_create error Resource temporarily unavailable > [2019-07-26T09:32:39.238] fatal: _start_msg_tree_internal: pthread_create error Resource temporarily unavailable Best, Marcus
It looks like a different issue. I just searched through the bugs and it looks like you already reported this as Bug 7360. Do you still see jobs failing to start and then remaining pending with "launch_failed_requeued_held" reason? cheers, Marcin
(In reply to Marcin Stolarek from comment #32) Marcin, > Could you please confirm that configuration change was enough to eliminate > the issue from the cluster? Yes, there are no new cases were detected according to logs (I have searched for "slurmd error running"). > If this is true, can we lower the bug report severity to 4? Sure. Thank you once again for help, cheers, Azat
Lowering bug report severity to 4. The root cause was job credentials getting expired, while the job was starting.
Azat, The fix has landed for 20.11 (9fd790a83..7ca9c753fd2f6). Closing ticket, please respond if you have any questions or issues. Thanks, --Nate