7423 – Invalid job cred causes slurmstepd to segfault

Ticket 7423 - Invalid job cred causes slurmstepd to segfault

Summary: Invalid job cred causes slurmstepd to segfault

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	19.05.0
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Nate Rini
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2019-07-17 01:58 MDT by Azat
Modified:	2020-05-04 10:39 MDT (History)
CC List:	5 users (show)

See Also:	6769 7427 7435 7436 7360 8974
Site:	GWDG
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	20.11pre1
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Azat 2019-07-17 01:58:05 MDT

Dear Slurm team,

This ticket is probably a spinoff of #6769. Just want to mention here that it causes some troubles using this particular version.  

After migrating to 19.05.0 version we get a lot of following errors:
> [2019-07-17T07:24:37.268] error: select/cons_res: node gwdd034 memory is under-allocated (25600-32000) for JobId=751905
> [2019-07-17T07:24:37.270] error: select/cons_res: node gwdd029 memory is under-allocated (38400-51200) for JobId=751223
> [2019-07-17T07:24:37.273] error: select/cons_res: node gwdd034 memory is under-allocated (25600-32000) for JobId=751905
> [2019-07-17T07:24:37.275] error: select/cons_res: node gwdd029 memory is under-allocated (38400-51200) for JobId=751223

This leads the jobs to pend with launch_failed_requeued_held error message and using "scontrol release" often doesn't help because jobs fall into the same error again.

For users whose jobs requires days to be scheduled this error causes a lot of troubles. 
 
Also these messages increase an amount of logs dramatically (the error rate is dozens of such message in a millisecond during the jobs' starting time).  

Is there any workaround to avoid memory is under-allocated errors or should we wait for a patch?

Comment 1 Nate Rini 2019-07-17 09:14:46 MDT

Azat,

Can you please provide whole output of 'scontrol show job $JOBID' of one of these failed jobs? (You can remove user ids if required).

The error "launch failed requeued held" should only be set to a requeued job that failed to launch. Is it possible to get the slurmd logs of the head node of job too?

Thanks,
--Nate

Comment 2 Nate Rini 2019-07-17 09:20:04 MDT

Please also attach an updated slurm.conf and the slurmctld log.

Comment 3 Azat 2019-07-18 01:24:33 MDT

(In reply to Nate Rini from comment #1)
> Azat,
> 
> Can you please provide whole output of 'scontrol show job $JOBID' of one of
> these failed jobs? (You can remove user ids if required).

Hi Nate,
Here is the job info of such job of user YYY with the user ID XXX

gwdu105:0 09:17:06 ~ # scontrol show job 739867
JobId=739867 JobName=rasrec
   UserId=YYY(XXX) GroupId=MBPC(9100) MCS_label=N/A
   Priority=0 Nice=0 Account=all QOS=normal
   JobState=PENDING Reason=launch_failed_requeued_held Dependency=(null)
   Requeue=1 Restarts=4 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2019-07-16T16:05:37 EligibleTime=2019-07-16T16:07:38
   AccrueTime=Unknown
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-07-16T16:04:24
   Partition=medium-fmz,medium-fas AllocNode:Sid=gwdu103.global.gwdg.cluster:16538
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   BatchHost=gwdd002
   NumNodes=66 NumCPUs=256 NumTasks=256 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=256,mem=800G,node=1,billing=256
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=3200M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/mpg08/YYY/rosetta/rasrec_runs/Syrb1_CS_BMRB/01_rasrec_w_abmem/production.slurm.job
   WorkDir=/home/mpg08/YYY/rosetta/rasrec_runs/Syrb1_CS_BMRB/01_rasrec_w_abmem
   StdErr=/home/mpg08/YYY/rosetta/rasrec_runs/Syrb1_CS_BMRB/01_rasrec_w_abmem/%J.err
   StdIn=/dev/null
   StdOut=/home/mpg08/YYY/rosetta/rasrec_runs/Syrb1_CS_BMRB/01_rasrec_w_abmem/%J.out
   Power=

Comment 4 Azat 2019-07-18 01:41:29 MDT

Created attachment 10944 [details]
The relevant part of slurmd logs for the job

> Is it possible to get the slurmd logs of the head node of job too?
Here is the relevant part of slurmd logs from the head node of the job.

Comment 5 Marcin Stolarek 2019-07-18 02:12:35 MDT

Azat, 

Could you please attach also slurmctld log and slurm.conf. Do you mind sharing whole slurmd log?


Could you please find also last occurence of "under-allocated" in logs and check NODENAME where it happened and then execute and share results of the command below with us: 
>squeue  -w NODENAME -o %i -h | xargs  scontrol show job -d

cheers,
Marcin

Comment 6 Azat 2019-07-18 02:21:46 MDT

Created attachment 10946 [details]
slurm.conf

> Please also attach an updated slurm.conf and the slurmctld log.
I have attached slurm.conf file with several included configs for nodes and partitions. 

Since our slurmctld log file still has immense size, I cannot attach it here. Are you looking for some particular errors? 
Here are some errors with that JOBID:

There are tons of:
> [2019-07-16T16:05:10.236] error: select/cons_res: node gwdd070 memory is under-allocated (999-12800) for JobId=739867
> [2019-07-16T16:05:10.236] error: select/cons_res: node gwdd100 memory is under-allocated (999-12800) for JobId=739867
> [2019-07-16T16:05:10.236] error: select/cons_res: node gwdd106 memory is under-allocated (999-12800) for JobId=739867
> [2019-07-16T16:05:10.236] error: select/cons_res: node gwdd108 memory is under-allocated (999-12800) for JobId=739867
> [2019-07-16T16:05:10.236] error: select/cons_res: node gwdd113 memory is under-allocated (999-12800) for JobId=739867
more precisely 37174 errors!

Here is the output of the logs where the JobID is mentioned but without under-allocated messages:
> gwdu105:8 10:12:48 ~ # cat /var/log/slurm/slurmctld.log | grep "739867" | grep -v "under-allocated" > /tmp/slurmctld.739867.log
> [2019-07-11T14:28:34.537] _slurm_rpc_submit_batch_job: JobId=739867 InitPrio=768 usec=30201
> [2019-07-11T21:11:42.655] Recovered JobId=739867 Assoc=1169
> [2019-07-11T23:31:44.940] backfill: Started JobId=739867 in medium-fmz on gwdd[002,004-005,010,020,050-051,053-055,057-058,060-062,064-066,071-078,080-085,091,094,096-097]
> [2019-07-11T23:31:45.350] prolog_running_decr: Configuration for JobId=739867 is complete
> [2019-07-11T23:31:45.614] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-11T23:31:45.614] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-11T23:31:45.617] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-11T23:31:45.617] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-11T23:31:45.617] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-11T23:31:45.617] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-11T23:31:45.618] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-11T23:31:45.619] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-11T23:31:45.620] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-11T23:31:45.622] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-11T23:32:56.244] cleanup_completing: JobId=739867 completion process took 71 seconds
> [2019-07-11T23:32:56.244] Requeuing JobId=739867
> [2019-07-12T14:12:38.622] sched: _release_job_rec: release hold on JobId=739867 by uid 0
> [2019-07-12T14:12:38.623] _slurm_rpc_update_job: complete JobId=739867 uid=0 usec=706
> [2019-07-12T14:14:32.679] backfill: Started JobId=739867 in medium-fmz on gwdd[027-028,030-031,033-035,037,039,042-043,045-048,076,078,082-083,085,087-094,097,166]
> [2019-07-12T14:14:34.225] prolog_running_decr: Configuration for JobId=739867 is complete
> [2019-07-12T14:14:34.466] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-12T14:14:34.468] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-12T14:14:34.469] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-12T14:14:34.516] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-12T14:14:34.517] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-12T14:14:34.517] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-12T14:14:34.519] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-12T14:15:45.147] cleanup_completing: JobId=739867 completion process took 71 seconds
> [2019-07-12T14:15:45.147] Requeuing JobId=739867
> [2019-07-12T14:33:02.667] sched: _release_job_rec: release hold on JobId=739867 by uid 0
> [2019-07-12T14:33:02.667] _slurm_rpc_update_job: complete JobId=739867 uid=0 usec=561
> [2019-07-12T14:34:24.022] sched: Allocate JobId=739867 NodeList=gwdd[028,030-031,033-035,037,042-043,045-048,076,078,082-083,085,087-091,093-094,097,104] #CPUs=256 Partition=medium-fmz
> [2019-07-12T14:34:24.255] prolog_running_decr: Configuration for JobId=739867 is complete
> [2019-07-12T14:34:24.417] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-12T14:34:24.418] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-12T14:34:24.423] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-12T14:34:24.424] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-12T14:34:24.428] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-12T14:35:35.224] cleanup_completing: JobId=739867 completion process took 71 seconds
> [2019-07-12T14:35:35.225] Requeuing JobId=739867
> [2019-07-16T14:03:46.317] sched: _release_job_rec: release hold on JobId=739867 by uid 0
> [2019-07-16T14:03:46.318] _slurm_rpc_update_job: complete JobId=739867 uid=0 usec=660
> [2019-07-16T15:44:34.701] Recovered JobId=739867 Assoc=1169
> [2019-07-16T16:04:24.164] backfill: Started JobId=739867 in medium-fmz on gwdd> [002-003,005-006,008,013-014,018-021,023-024,051-052,054-057,060,062-066,068-070,072-083,085,087-093,095,097,099-102,104-106,108,111-115,117-118,170]
> [2019-07-16T16:04:26.014] prolog_running_decr: Configuration for JobId=739867 is complete
> [2019-07-16T16:04:26.649] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-16T16:04:26.650] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-16T16:04:26.650] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-16T16:04:26.651] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-16T16:04:26.651] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-16T16:04:26.651] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-16T16:04:26.652] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-16T16:04:26.653] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-16T16:04:26.655] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-16T16:04:26.657] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-16T16:04:26.659] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-16T16:04:26.660] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-16T16:04:26.663] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-16T16:04:26.663] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-16T16:04:26.666] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-16T16:04:26.669] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-16T16:04:26.703] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-16T16:04:26.704] slurmd error running JobId=739867 from node(s)=gwdd002: Job credential revoked
> [2019-07-16T16:04:26.704] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-16T16:04:26.705] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-16T16:04:26.707] _slurm_rpc_requeue: 739867: Job is pending execution
> [2019-07-16T16:05:37.482] cleanup_completing: JobId=739867 completion process took 71 seconds
> [2019-07-16T16:05:37.482] Requeuing JobId=739867
> [2019-07-17T13:42:52.614] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=739867 uid ZZZ
> [2019-07-17T13:42:52.616] error: Security violation, REQUEST_KILL_JOB RPC for JobId=739867 from uid ZZZ
> [2019-07-17T13:42:52.616] _slurm_rpc_kill_job: job_str_signal() JobId=739867 sig 9 returned Access/permission denied

Comment 7 Azat 2019-07-18 02:49:06 MDT

Created attachment 10947 [details]
slurmd.log full

(In reply to Marcin Stolarek from comment #5)
> Azat, 
> 
> Do you mind sharing whole slurmd log?
Hi Marcin,

I attached the full version of slurmd log (masked UIDs and usernames). 

> Could you please find also last occurence of "under-allocated" in logs and
> check NODENAME where it happened and then execute and share results of the
> command below with us: 
> >squeue  -w NODENAME -o %i -h | xargs  scontrol show job -d
Yes, here is the output:
> gwdu105:0 10:46:58 ~ # squeue  -w gwdd016 -o %i -h | xargs  scontrol show job -d
> JobId=754890 JobName=I10-117-1
>    UserId=YYY(XXX) GroupId=UCPC(13400) MCS_label=N/A
>    Priority=925 Nice=0 Account=all QOS=normal
>    JobState=RUNNING Reason=None Dependency=(null)
>    Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>    DerivedExitCode=0:0
>    RunTime=00:21:08 TimeLimit=05:00:00 TimeMin=N/A
>    SubmitTime=2019-07-16T09:34:37 EligibleTime=2019-07-16T09:34:37
>    AccrueTime=2019-07-16T09:34:37
>    StartTime=2019-07-18T10:26:08 EndTime=2019-07-18T15:26:09 Deadline=N/A
>    SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-07-18T10:26:08
>    Partition=medium-fmz AllocNode:Sid=gwdu102.global.gwdg.cluster:28785
>    ReqNodeList=(null) ExcNodeList=(null)
>    NodeList=gwdd016
>    BatchHost=gwdd016
>    NumNodes=1 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>    TRES=cpu=20,mem=62.50G,node=1,billing=20
>    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>      Nodes=gwdd016 CPU_IDs=0-19 Mem=64000 GRES=
>    MinCPUsNode=1 MinMemoryCPU=3200M MinTmpDiskNode=0
>    Features=(null) DelayBoot=00:00:00
>    OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>    Command=(null)
>    WorkDir=/home/uni08/YYY/Promotion-MH-gwdg/PBE0-structures/IRMOF-10-250K/117/1-pbe0
>    StdErr=/usr/users/YYY/stdout+stderr/out.%J
>    StdIn=/dev/null
>    StdOut=/usr/users/YYY/stdout+stderr/out.%J
>    Power=

Comment 8 Marcin Stolarek 2019-07-18 04:34:23 MDT

Azat, 

Could you please try to compress slurmctld log?  
Could you please attach results of the following commands from the node:
> free
> dmesg


From the attached log it looks like you're running version 18.08.06 is that correct? (In the bug report form you mentioned version 19.05.0. Just to be sure did you attempt to upgrade or version in the form is just a mistake?)

cheers,
Marcin

Comment 9 Marcus Boden 2019-07-18 05:21:30 MDT

(In reply to Marcin Stolarek from comment #8)

Hi Marcin 
 
> From the attached log it looks like you're running version 18.08.06 is that
> correct? (In the bug report form you mentioned version 19.05.0. Just to be
> sure did you attempt to upgrade or version in the form is just a mistake?)

the version is 19.05, but we did not set up logrotate. Scroll down a bit to 2019-06-25, there's the restart with 19.05.

(I'll set up logrotate now so that this doesn't happen again...)

Cheers,
Marcus

Comment 10 Azat 2019-07-18 07:06:43 MDT

Created attachment 10949 [details]
slurmctld.log

(In reply to Marcin Stolarek from comment #8)
> Could you please try to compress slurmctld log?  
Yes, I have attached last 10M lines of it
> Could you please attach results of the following commands from the node:
> > free
> > dmesg
Sure:
> gwdd016:0 15:02:13 ~ # free
>               total        used        free      shared  buff/cache   available
> Mem:       65941536    32026920    25638872      744976     8275744    31685136
> Swap:      12582908      615504    11967404
and
> gwdd016:0 15:02:14 ~ # dmesg -T
> ...
> [Thu Jul 11 15:43:50 2019] CVFS: Lost connection to Disk Proxy Server at 10.108.96.167 port 55334
> [Thu Jul 11 15:43:50 2019] Connected to Disk Proxy Server 10.108.96.167 via 10.108.96.167 port 55334
> [Thu Jul 11 15:44:00 2019] Reconnecting to Disk Proxy Server 10.108.96.167 via 10.108.96.167 port 55334
> [Thu Jul 11 15:44:00 2019] Connected to Disk Proxy Server 10.108.96.167 via 10.108.96.167 port 55334
> [Thu Jul 11 15:44:58 2019] SNFS (info) proxy read timeout changed:576 -> 864
> [Thu Jul 11 15:49:56 2019] Reconnecting to FSS 'UNI08'
> [Thu Jul 11 15:49:56 2019] StorNext mounted file system UNI08: No FSS registered with PortMapper on host 10.108.96.161, retrying...
> [Thu Jul 11 15:50:13 2019] Reconnecting to FSS 'UNI08'
> [Thu Jul 11 15:50:24 2019] Reconnecting to FSS 'UNI08'
> [Thu Jul 11 15:50:35 2019] Reconnecting to FSS 'UNI08'
> [Thu Jul 11 15:50:46 2019] Reconnecting to FSS 'UNI08'
> [Thu Jul 11 15:50:57 2019] Reconnecting to FSS 'UNI08'
> [Thu Jul 11 15:50:57 2019] Reconnect successful to FSS 'UNI08' on host '10.108.96.160'.
> [Thu Jul 11 15:50:57 2019] Using v3 readdir for 'UNI08'
> [Thu Jul 11 15:51:10 2019] All partition groups for filesystem 'UNI08' online.
> [Thu Jul 11 17:20:08 2019] SNFS (info) proxy write timeout changed:768 -> 1153
> [Fri Jul 12 10:03:26 2019] Reconnecting to FSS 'MPG08'
> [Fri Jul 12 10:03:26 2019] StorNext mounted file system MPG08: No FSS registered with PortMapper on host 10.108.96.161, retrying...
> [Fri Jul 12 10:03:41 2019] fs 'MPG08': FSS on host 10.108.96.160 not currently accessible (possibly stopped administratively). Thread 0xffff88085b906eb0 pid 15061, still trying.
> [Fri Jul 12 10:03:51 2019] Reconnecting to FSS 'MPG08'
> [Fri Jul 12 10:04:05 2019] Reconnect successful to FSS 'MPG08' on host '10.108.96.160'.
> [Fri Jul 12 10:04:05 2019] Using v3 readdir for 'MPG08'
> [Fri Jul 12 22:47:52 2019] ib0: Budget exhausted after napi rescheduled
> [Sun Jul 14 04:43:46 2019] SNFS (info) proxy read timeout changed:15 -> 22
> [Sun Jul 14 14:38:55 2019] SNFS (info) proxy read timeout changed:22 -> 33
> [Sun Jul 14 20:01:17 2019] orted[29853]: segfault at 2aaaaeda2970 ip 00002aaaac3330b8 sp 00007fffffffb4c0 error 4 in libgcc_s-4.8.5-20150702.so.1[2aaaac324000+15000]
> [Mon Jul 15 16:23:20 2019] Reconnecting to FSS 'UNI08'
> [Mon Jul 15 16:23:20 2019] StorNext mounted file system UNI08: No FSS registered with PortMapper on host 10.108.96.160, retrying...
> [Mon Jul 15 16:23:34 2019] fs 'UNI08': FSS on host 10.108.96.161 not currently accessible (possibly stopped administratively). Thread 0xffff88105cf24f10 pid 14767, still trying.
> [Mon Jul 15 16:23:40 2019] Reconnect successful to FSS 'UNI08' on host '10.108.96.161'.
> [Mon Jul 15 16:23:40 2019] Using v3 readdir for 'UNI08'
> [Mon Jul 15 17:06:19 2019] orted[1204]: segfault at 2aaaaeda2970 ip 00002aaaac3330b8 sp 00007fffffffb480 error 4 in libgcc_s-4.8.5-20150702.so.1[2aaaac324000+15000]
> [Tue Jul 16 09:45:12 2019] Reconnecting to FSS 'UNI05'
> [Tue Jul 16 09:45:12 2019] StorNext mounted file system UNI05: No FSS registered with PortMapper on host 10.108.96.161, retrying...
> [Tue Jul 16 09:45:24 2019] fs 'UNI05': FSS on host 10.108.96.160 not currently accessible (possibly stopped administratively). Thread 0xffff8808583d8000 pid 14615, still trying.
> [Tue Jul 16 09:45:34 2019] Reconnecting to FSS 'UNI05'
> [Tue Jul 16 09:45:54 2019] Reconnecting to FSS 'UNI05'
> [Tue Jul 16 09:45:54 2019] Reconnect successful to FSS 'UNI05' on host '10.108.96.160'.
> [Tue Jul 16 09:45:54 2019] Using v3 readdir for 'UNI05'
> [Tue Jul 16 09:50:32 2019] Reconnecting to FSS 'UNI05'
> [Tue Jul 16 09:50:32 2019] StorNext mounted file system UNI05: No FSS registered with PortMapper on host 10.108.96.160, retrying...
> [Tue Jul 16 09:50:43 2019] Reconnect successful to FSS 'UNI05' on host '10.108.96.161'.
> [Tue Jul 16 09:50:43 2019] Using v3 readdir for 'UNI05'
> [Tue Jul 16 15:48:23 2019] nfs: server fs-base.fs.gwdg.de not responding, still trying
> [Tue Jul 16 15:50:59 2019] nfs: server fs-base.fs.gwdg.de not responding, still trying
> [Tue Jul 16 15:54:19 2019] nfs: server fs-base.fs.gwdg.de OK
> [Tue Jul 16 15:54:19 2019] nfs: server fs-base.fs.gwdg.de OK
> [Tue Jul 16 16:02:21 2019] Reconnecting to FSS 'UNI05'
> [Tue Jul 16 16:02:21 2019] No connections found for any Disk Proxy Server
> [Tue Jul 16 16:02:21 2019] Stripe group '__VACANT_SG__12' is VACANT on filesystem UNI05 (was DOWN)
> [Tue Jul 16 16:02:21 2019] Reconnect successful to FSS 'UNI05' on host '10.108.96.161'.
> [Tue Jul 16 16:02:21 2019] Using v3 readdir for 'UNI05'
> [Tue Jul 16 16:02:25 2019] All partition groups for filesystem 'UNI05' online.
> [Tue Jul 16 17:59:36 2019] slurmstepd[25388]: segfault at 42 ip 0000000000413c54 sp 00007fffffffea20 error 4 in slurmstepd[400000+37000]
> [Tue Jul 16 19:37:48 2019] Reconnecting to FSS 'UNI05'
> [Tue Jul 16 19:37:48 2019] No connections found for any Disk Proxy Server
> [Tue Jul 16 19:37:48 2019] Stripe group '__VACANT_SG__15' is VACANT on filesystem UNI05 (was UP)
> [Tue Jul 16 19:37:48 2019] Reconnect successful to FSS 'UNI05' on host '10.108.96.161'.
> [Tue Jul 16 19:37:48 2019] Using v3 readdir for 'UNI05'
> [Tue Jul 16 19:38:45 2019] All partition groups for filesystem 'UNI05' online.
> [Wed Jul 17 18:21:28 2019] orted[19560]: segfault at 2aaaaeda2970 ip 00002aaaac3330b8 sp 00007fffffffb500 error 4 in libgcc_s-4.8.5-20150702.so.1[2aaaac324000+15000]

Comment 11 Nate Rini 2019-07-18 11:26:11 MDT

The error 'memory is under-allocated' is a scheduler error. It shouldn't cause a job failure at launch with this error 'launch failed requeued held' but instead should only delay job launch until the memory allocations are corrected (which happens when you see the under-allocated error).

(In reply to Azat from comment #10)
> > [Thu Jul 11 15:43:50 2019] CVFS: Lost connection to Disk Proxy Server at 10.108.96.167 port 55334
> > [Thu Jul 11 15:43:50 2019] Connected to Disk Proxy Server 10.108.96.167 via 10.108.96.167 port 55334
> > [Thu Jul 11 15:44:00 2019] Reconnecting to Disk Proxy Server 10.108.96.167 via 10.108.96.167 port 55334
> > [Thu Jul 11 15:44:00 2019] Connected to Disk Proxy Server 10.108.96.167 via 10.108.96.167 port 55334
> > [Thu Jul 11 15:44:58 2019] SNFS (info) proxy read timeout changed:576 -> 864
Is Slurm (or any component of Slurm) on this filesystem that appears to be reconnecting quite a bit?

(In reply to Azat from comment #10)
> > [Mon Jul 15 17:06:19 2019] orted[1204]: segfault at 2aaaaeda2970 ip 00002aaaac3330b8 sp 00007fffffffb480 error 4 in libgcc_s-4.8.5-20150702.so.1[2aaaac324000+15000]
> > [Tue Jul 16 17:59:36 2019] slurmstepd[25388]: segfault at 42 ip 0000000000413c54 sp 00007fffffffea20 error 4 in slurmstepd[400000+37000]
> > [Wed Jul 17 18:21:28 2019] orted[19560]: segfault at 2aaaaeda2970 ip 00002aaaac3330b8 sp 00007fffffffb500 error 4 in libgcc_s-4.8.5-20150702.so.1[2aaaac324000+15000]
Are there any cores from orted and/or slurmstepd segfaulting, can you please provide 'thread apply all bt full' from them? This is more in line with what I would expect to cause 'launch failed requeued held' error. 

If there are no cores, is it possible to run gdb against slurmd with 'set follow-fork-mode child' to grab the backtrace when they SEGFAULT?

Can you please call the following on gwdd016 as root:
> cat /proc/mounts
> ldd $(which slurmd)
> gdb $(which slurmd)
>> b main
>> r
>> info files
>> info symbol 0x0000000000413c54
>> info symbol 0x00002aaaac3330b8
>> info symbol 0x00007fffffffb500
>> info symbol 0x00007fffffffb500
>> info symbol 0x2aaaac324000
(I don't expect all the info symbols to succeed if you have ASLR active).

Comment 12 Nate Rini 2019-07-18 11:38:00 MDT

(In reply to Azat from comment #4)
> Created attachment 10944 [details]
> The relevant part of slurmd logs for the job

> [2019-07-11T23:31:45.358] error: Credential signature check: Credential data size mismatch
> [2019-07-11T23:31:45.358] error: _convert_job_mem: slurm_cred_verify failed: Invalid job credential
This is an unexpected error (ESIG_BUF_SIZE_MISMATCH) with your packets that will need to be addressed. Do you have the same munge version installed on all nodes? Is is possible that your packets are getting corrupted but not being caught by the normal TCP crc checks? Can you please verify that slurm is linking (via ldd) to the same munge on all nodes. Is this error seen on other nodes or is it isolated to this node (possibly due to bad memory)?

> [739542.extern] _oom_event_monitor: oom-kill event count: 1
This error may explain why orted failed, there should be addition information in the nodes syslogs/journalctl logs. Can you please provide those logs?

Comment 13 Nate Rini 2019-07-18 11:44:25 MDT

(In reply to Azat from comment #7)
> Created attachment 10947 [details]
> slurmd.log full

> [2019-04-12T09:42:40.654] TOPOLOGY: warning -- no switch can reach all nodes through its descendants.Do not use route/topology
Is intentional? Jobs cannot cross unconnected nodes in a topology, please see doc changes: https://github.com/SchedMD/slurm/commit/acea46d8402c7d99916d9b8121ba5c38fa613ae1

> [60983.extern] _oom_event_monitor: oom-kill event count: 1
There are good number of oom events on this node. I assume this is job triggered since the job cgroup appears to be correctly catching the oom events. Is it possible to see why this job keeps causing oom events?

Comment 14 Nate Rini 2019-07-18 11:48:23 MDT

(In reply to Azat from comment #7)
> I attached the full version of slurmd log (masked UIDs and usernames). 
Did you use a script to do this or was it manually done? I also marked all of your attachments as private just in case something was missed.

Comment 15 Nate Rini 2019-07-18 12:12:08 MDT

(In reply to Azat from comment #10)
> Created attachment 10949 [details]
> slurmctld.log
Looking at this log file:

> [2019-06-25T11:29:54.898] error: Job allocate response msg send failure, killing JobId=616782
> [2019-06-25T11:31:35.326] error: Node ping apparently hung, many nodes may be DOWN or configured SlurmdTimeout should be increased
> [2019-07-15T14:23:38.870] error: _find_node_record(756): lookup failure for dge1
> [2019-07-15T14:23:38.870] error: node_name2bitmap: invalid node specified dge1
Have there been changes to the node lineup in Slurm or possibly issues with your DNS server?

> [2019-06-25T12:41:24.950] error: slurm_persist_conn_open_without_init: failed to open persistent connection to gwdu104:6819: Connection refused
What is the status of your backup controller currently?

> [2019-06-25T13:05:07.086] error: slurm_unpack_received_msg: Invalid Protocol Version 8704 from uid=-1 at 10.108.96.104:45391
> [2019-06-25T13:05:07.086] error: slurm_unpack_received_msg: Incompatible versions of client and server code
Can you please check the node @10.108.96.104 to make sure it's Slurm version is within 2 releases of your controller.

> [2019-06-25T13:57:43.130] error: _slurm_rpc_node_registration node=dmp008: Invalid argument
> [2019-06-25T13:57:44.399] error: Node dmp008 has low real_memory size (112699 < 128000)
> [2019-06-25T13:58:43.016] error: Node dmp008 appears to have a different slurm.conf than the slurmctld.
Looks like this node lost a DIMM, please make sure to update its slurm.conf or turn off the slurm daemons on it to avoid the log spam.

> error: slurm_jobcomp plugin context not initialized
Was there a different jobcomp configuration to cause this error? It should never happen with your current none plugin:
>JobCompType=jobcomp/none

> error: job_str_signal: Security violation JOB_CANCEL RPC for JobId=759301_* from uid XXX
This user appears to have tried to kill all jobs, maybe a conversation with them is warranted.

> [2019-07-16T13:47:50.447] error: job_submit/lua: /opt/slurm/etc/job_submit.lua: [string "slurm.user_msg (string.format(unpack({...})))"]:1: bad argument #1 to 'format' (string expected, got no value)
Has this script bug been resolved (doubt it is causing the current issues though).

> [2019-07-11T20:17:39.703] fatal: _start_msg_tree_internal: pthread_create error Resource temporarily unavailable
> [2019-07-11T20:17:39.703] fatal: _start_msg_tree_internal: pthread_create error Resource temporarily unavailable
Looks like the slurmctld node ran out of memory. Has this issue been resolved?

> [2019-07-11T21:13:33.928] error: Registered JobId=714588 StepId=Batch on wrong node gwdd084
> [2019-07-11T21:13:33.928] error: Registered JobId=714588 StepId=4294967295 on wrong node gwdd084
> [2019-07-11T21:13:34.869] error: Registered JobId=714589 StepId=Batch on wrong node gwdd057
> [2019-07-11T21:13:34.869] error: Registered JobId=714589 StepId=4294967295 on wrong node gwdd057
> [2019-07-11T21:13:34.897] error: Registered JobId=714590 StepId=Batch on wrong node gwdd016
> [2019-07-11T21:13:34.897] error: Registered JobId=714590 StepId=4294967295 on wrong node gwdd016
I see a lot of errors along these lines. Can you please verify that there isn't something odd going on with the DNS or node hostnames where there are duplicate nodes getting the same hostname or IP?

Comment 16 Azat 2019-07-19 02:12:13 MDT

(In reply to Nate Rini from comment #11)
> The error 'memory is under-allocated' is a scheduler error. It shouldn't
> cause a job failure at launch with this error 'launch failed requeued held'
> but instead should only delay job launch until the memory allocations are
> corrected (which happens when you see the under-allocated error).
Sounds reasonable. Probably 2 separate issues. 

> > > [Thu Jul 11 15:43:50 2019] CVFS: Lost connection to Disk Proxy Server at 10.108.96.167 port 55334
> > > ...
> Is Slurm (or any component of Slurm) on this filesystem that appears to be
> reconnecting quite a bit?
No, Slurm is located on NFS, where we don't get such connection problems. 

> > > ...
> > > [Tue Jul 16 17:59:36 2019] slurmstepd[25388]: segfault at 42 ip 0000000000413c54 sp 00007fffffffea20 error 4 in slurmstepd[400000+37000]
> > > ...
> Are there any cores from orted and/or slurmstepd segfaulting, can you please
> provide 'thread apply all bt full' from them? This is more in line with what
> I would expect to cause 'launch failed requeued held' error. 
Agree, probably it is a cause of failed launch. I just hope that we can debug the cause of the segfault. And there are 2 identical cores from the slurmstepd. Here is the one from the logs (25388):

> gwdd016:16 09:55:41 /tmp # gdb /opt/slurm/sbin/slurmstepd core.25388 
> GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7
> Copyright (C) 2013 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-redhat-linux-gnu".
> For bug reporting instructions, please see:
> <http://www.gnu.org/software/gdb/bugs/>...
> Reading symbols from /opt/lsf/slurm/slurm/19.05.0/install/sbin/slurmstepd...done.
> [New LWP 25388]
> [New LWP 25407]
> [New LWP 25408]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> Core was generated by `slurmstepd: [756929.extern]                     '.
> Program terminated with signal 11, Segmentation fault.
> #0  0x0000000000413c54 in _job_init_task_info (efname=0x648590 "/dev/null", ofname=0x646ec0 "/dev/null", ifname=0x658bc0 "/dev/null", gtid=0x64b990, job=0x645bc0)
>     at ../../../../source/src/slurmd/slurmstepd/slurmstepd_job.c:164
> 164						     gtid[node_id][i] + pack_offset,
> Missing separate debuginfos, use: debuginfo-install audit-libs-2.6.5-3.el7.x86_64 glibc-2.17-260.el7.x86_64 libcap-ng-0.7.5-4.el7.x86_64 libxml2-2.9.1-6.el7_2.3.x86_64 pam-1.1.8-18.el7.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-17.el7.x86_64
> (gdb) thread apply all bt full
> 
> Thread 3 (Thread 0x2aaaae740700 (LWP 25408)):
> #0  0x00002aaaabb2cd12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
> No symbol table info available.
> #1  0x00002aaaaadaa570 in _timer_thread (args=<optimized out>) at ../../../source/src/common/slurm_acct_gather_profile.c:205
>         err = <optimized out>
>         i = 4
>         now = 1563293060
>         diff = <optimized out>
>         tvnow = {tv_sec = 1563293060, tv_usec = 83769}
>         abs = {tv_sec = 1563293061, tv_nsec = 83769000}
>         __func__ = "_timer_thread"
> #2  0x00002aaaabb28dd5 in start_thread () from /lib64/libpthread.so.0
> No symbol table info available.
> #3  0x00002aaaabe3aead in clone () from /lib64/libc.so.6
> No symbol table info available.
> 
> Thread 2 (Thread 0x2aaaaabf4700 (LWP 25407)):
> #0  0x00002aaaabb2c965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
> No symbol table info available.
> #1  0x00002aaaaadb0be3 in _watch_tasks (arg=<optimized out>) at ../../../source/src/common/slurm_jobacct_gather.c:366
>         err = <optimized out>
>         __func__ = "_watch_tasks"
> #2  0x00002aaaabb28dd5 in start_thread () from /lib64/libpthread.so.0
> No symbol table info available.
> #3  0x00002aaaabe3aead in clone () from /lib64/libc.so.6
> No symbol table info available.
> 
> Thread 1 (Thread 0x2aaaaaaf2480 (LWP 25388)):
> #0  0x0000000000413c54 in _job_init_task_info (efname=0x648590 "/dev/null", ofname=0x646ec0 "/dev/null", ifname=0x658bc0 "/dev/null", gtid=0x64b990, job=0x645bc0)
>     at ../../../../source/src/slurmd/slurmstepd/slurmstepd_job.c:164
>         i = 0
>         err = <optimized out>
>         pack_offset = 0
>         node_id = 8
>         in = <optimized out>
>         out = <optimized out>
> #1  stepd_step_rec_create (msg=msg@entry=0x64f260, protocol_version=<optimized out>) at ../../../../source/src/slurmd/slurmstepd/slurmstepd_job.c:498
>         job = 0x645bc0
>         srun = <optimized out>
>         resp_addr = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}
>         io_addr = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}
>         i = <optimized out>
>         nodeid = <optimized out>
>         __func__ = "stepd_step_rec_create"
> #2  0x000000000040f134 in mgr_launch_tasks_setup (msg=0x64f260, cli=cli@entry=0x645530, self=self@entry=0x64b9b0, protocol_version=<optimized out>) at ../../../../source/src/slurmd/slurmstepd/mgr.c:206
>         job = 0x0
> #3  0x000000000040cca8 in _step_setup (msg=0x64cb90, self=0x64b9b0, cli=0x645530) at ../../../../source/src/slurmd/slurmstepd/slurmstepd.c:677
>         job = 0x0
> #4  main (argc=1, argv=0x7fffffffed78) at ../../../../source/src/slurmd/slurmstepd/slurmstepd.c:132
>         cli = 0x645530
>         self = 0x64b9b0
>         msg = 0x64cb90
>         rc = 0
>         launch_params = 0x42b30d <__libc_csu_init+77> "H\203\303\001H9\353u\352H\203\304\b[]A\\A]A^A_Ðf.\017\037\204"
>         __func__ = "main"
> (gdb) 
> 
> Thread 3 (Thread 0x2aaaae740700 (LWP 25408)):
> #0  0x00002aaaabb2cd12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
> No symbol table info available.
> #1  0x00002aaaaadaa570 in _timer_thread (args=<optimized out>) at ../../../source/src/common/slurm_acct_gather_profile.c:205
>         err = <optimized out>
>         i = 4
>         now = 1563293060
>         diff = <optimized out>
>         tvnow = {tv_sec = 1563293060, tv_usec = 83769}
>         abs = {tv_sec = 1563293061, tv_nsec = 83769000}
>         __func__ = "_timer_thread"
> #2  0x00002aaaabb28dd5 in start_thread () from /lib64/libpthread.so.0
> No symbol table info available.
> #3  0x00002aaaabe3aead in clone () from /lib64/libc.so.6
> No symbol table info available.
> 
> Thread 2 (Thread 0x2aaaaabf4700 (LWP 25407)):
> #0  0x00002aaaabb2c965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
> No symbol table info available.
> #1  0x00002aaaaadb0be3 in _watch_tasks (arg=<optimized out>) at ../../../source/src/common/slurm_jobacct_gather.c:366
>         err = <optimized out>
>         __func__ = "_watch_tasks"
> #2  0x00002aaaabb28dd5 in start_thread () from /lib64/libpthread.so.0
> No symbol table info available.
> #3  0x00002aaaabe3aead in clone () from /lib64/libc.so.6
> No symbol table info available.
> 
> Thread 1 (Thread 0x2aaaaaaf2480 (LWP 25388)):
> #0  0x0000000000413c54 in _job_init_task_info (efname=0x648590 "/dev/null", ofname=0x646ec0 "/dev/null", ifname=0x658bc0 "/dev/null", gtid=0x64b990, job=0x645bc0)
>     at ../../../../source/src/slurmd/slurmstepd/slurmstepd_job.c:164
>         i = 0
>         err = <optimized out>
>         pack_offset = 0
>         node_id = 8
>         in = <optimized out>
>         out = <optimized out>
> #1  stepd_step_rec_create (msg=msg@entry=0x64f260, protocol_version=<optimized out>) at ../../../../source/src/slurmd/slurmstepd/slurmstepd_job.c:498
>         job = 0x645bc0
>         srun = <optimized out>
>         resp_addr = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}
>         io_addr = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}
>         i = <optimized out>
>         nodeid = <optimized out>
>         __func__ = "stepd_step_rec_create"
> #2  0x000000000040f134 in mgr_launch_tasks_setup (msg=0x64f260, cli=cli@entry=0x645530, self=self@entry=0x64b9b0, protocol_version=<optimized out>) at ../../../../source/src/slurmd/slurmstepd/mgr.c:206
>         job = 0x0
> #3  0x000000000040cca8 in _step_setup (msg=0x64cb90, self=0x64b9b0, cli=0x645530) at ../../../../source/src/slurmd/slurmstepd/slurmstepd.c:677
>         job = 0x0
> #4  main (argc=1, argv=0x7fffffffed78) at ../../../../source/src/slurmd/slurmstepd/slurmstepd.c:132
>         cli = 0x645530
>         self = 0x64b9b0
>         msg = 0x64cb90
>         rc = 0
>         launch_params = 0x42b30d <__libc_csu_init+77> "H\203\303\001H9\353u\352H\203\304\b[]A\\A]A^A_Ðf.\017\037\204"
>         __func__ = "main"

Comment 17 Marcin Stolarek 2019-07-19 06:20:19 MDT

Avat, 

Could you please load the core dump one more time and execute following commands in gdb:
(gdb) set print pretty on
(gdb) print *job
(gdb) f 1
(gdb) print *msg
(gdb) print *conf


It looks like the issue may be coming from node name resolution. Are you sure that you don't have some DNS (/etc/hosts) inconsistency?

cheers,
Marcin

Comment 18 Azat 2019-07-19 06:59:57 MDT

Created attachment 10959 [details]
gdb long output

(In reply to Marcin Stolarek from comment #17)
> Could you please load the core dump one more time and execute following
> commands in gdb:
> (gdb) set print pretty on
> (gdb) print *job
> (gdb) f 1
> (gdb) print *msg
> (gdb) print *conf
Yes, see the attachment.
 
> It looks like the issue may be coming from node name resolution. Are you
> sure that you don't have some DNS (/etc/hosts) inconsistency?
No problems that we aware of... DNS seems to be fine.

Comment 19 Azat 2019-07-19 07:38:08 MDT

(In reply to Nate Rini from comment #12)
> > [2019-07-11T23:31:45.358] error: Credential signature check: Credential data size mismatch
> > [2019-07-11T23:31:45.358] error: _convert_job_mem: slurm_cred_verify failed: Invalid job credential
> This is an unexpected error (ESIG_BUF_SIZE_MISMATCH) with your packets that
> will need to be addressed. Do you have the same munge version installed on
> all nodes? 
Yes, all off them have the same version.

> Is is possible that your packets are getting corrupted but not
> being caught by the normal TCP crc checks?
I think it is unlikely. 

> Can you please verify that slurm
> is linking (via ldd) to the same munge on all nodes. 
I didn't find a binary from Slurm which links to libmunge but Slurm is installed in the shared FS, so wherever they link, it is the same on all nodes :) 

> Is this error seen on
> other nodes or is it isolated to this node (possibly due to bad memory)?
We didn't inspect that error. 
 
Also at around 21:00 that day the master control daemon crashed and the secondary took over (that is related to one of your comments above and also there is a bug #7360 open for it). Might the error be because of that? 

> > [739542.extern] _oom_event_monitor: oom-kill event count: 1
> This error may explain why orted failed, there should be addition
> information in the nodes syslogs/journalctl logs. Can you please provide
> those logs?
not much there: 
> Jul 15 17:11:00 gwdd016 kernel: orted[1204]: segfault at 2aaaaeda2970 ip 00002aaaac3330b8 sp 00007fffffffb480 error 4 in libgcc_s-4.8.5-20150702.so.1[2aaaac324000+15000]

Comment 20 Marcin Stolarek 2019-07-19 07:43:45 MDT

Azat, 

Could you please check if the time on the host where slurmctld runs is in sync with the nodes?

cheers,
Marcin

Comment 22 Azat 2019-07-19 08:06:19 MDT

(In reply to Nate Rini from comment #13)
> > [2019-04-12T09:42:40.654] TOPOLOGY: warning -- no switch can reach all nodes through its descendants.Do not use route/topology
> Is intentional? Jobs cannot cross unconnected nodes in a topology
Yes, it is intentional, we have 2 sites where the nodes are located. Instead of making 2 clusters, for convenience we created 2 partitions for both sites and corresponding switch config in Slurm. And jobs should not run on the nodes from 2 sites simultaneously.   

> > [60983.extern] _oom_event_monitor: oom-kill event count: 1
> There are good number of oom events on this node. I assume this is job
> triggered since the job cgroup appears to be correctly catching the oom
> events. Is it possible to see why this job keeps causing oom events?
It seems that indeed it causes orted segfault. But I have no idea how to find out why the jobs cause oom events, might be misconfiguration of the job by the user too.

Comment 23 Azat 2019-07-19 08:08:36 MDT

(In reply to Marcin Stolarek from comment #20)
> Could you please check if the time on the host where slurmctld runs is in
> sync with the nodes?

Yes, 
> gwdd016:1 16:04:51 ~ # ntpq -p
>      remote           refid      st t when poll reach   delay   offset  jitter
> ==============================================================================
> *gwdu105.global. 134.76.249.102   3 u  351 1024  377    0.119   -0.299   0.242
> +gwdu104.global. 134.76.249.102   3 u   13 1024  377    0.183   -0.381   0.250
Looks good. gwdu105 is the node for master slurmctld.

Comment 24 Azat 2019-07-19 08:11:41 MDT

(In reply to Nate Rini from comment #14)
> (In reply to Azat from comment #7)
> > I attached the full version of slurmd log (masked UIDs and usernames). 
> Did you use a script to do this or was it manually done? 
That one was done manually, but afterwards I made a script (for slurmcld logs ;) )

> I also marked all
> of your attachments as private just in case something was missed.
Thanks! I have missed something, indeed!

Comment 25 Azat 2019-07-19 08:30:00 MDT

(In reply to Nate Rini from comment #15)
> > [2019-06-25T11:29:54.898] error: Job allocate response msg send failure, killing JobId=616782
> > [2019-06-25T11:31:35.326] error: Node ping apparently hung, many nodes may be DOWN or configured SlurmdTimeout should be increased
> > [2019-07-15T14:23:38.870] error: _find_node_record(756): lookup failure for dge1
> > [2019-07-15T14:23:38.870] error: node_name2bitmap: invalid node specified dge1
> Have there been changes to the node lineup in Slurm or possibly issues with
> your DNS server?
I have no clue where dge1 came from... I couldn't find it in any of the config files. dge nodes are all with 3 digits in the name. And that one should probably be dge001. Can you tell me please where node_name2bitmap gets it? 

> > [2019-06-25T12:41:24.950] error: slurm_persist_conn_open_without_init: failed to open persistent connection to gwdu104:6819: Connection refused
> What is the status of your backup controller currently?
It was the time when we updated Slurm, probably because of that. 

> > [2019-06-25T13:05:07.086] error: slurm_unpack_received_msg: Invalid Protocol Version 8704 from uid=-1 at 10.108.96.104:45391
> > [2019-06-25T13:05:07.086] error: slurm_unpack_received_msg: Incompatible versions of client and server code
> Can you please check the node @10.108.96.104 to make sure it's Slurm version
> is within 2 releases of your controller.
Our installation of Slurm is on shared FS. 2019-06-25T13 is the time when we updated the Slurm. Probably because of that, however it was still within 2 release versions. We updated from 18.08. 

> > [2019-06-25T13:57:43.130] error: _slurm_rpc_node_registration node=dmp008: Invalid argument
> > [2019-06-25T13:57:44.399] error: Node dmp008 has low real_memory size (112699 < 128000)
> > [2019-06-25T13:58:43.016] error: Node dmp008 appears to have a different slurm.conf than the slurmctld.
> Looks like this node lost a DIMM, please make sure to update its slurm.conf
> or turn off the slurm daemons on it to avoid the log spam.
Thanks, we did it. 

> > error: slurm_jobcomp plugin context not initialized
> Was there a different jobcomp configuration to cause this error? It should
> never happen with your current none plugin:
> >JobCompType=jobcomp/none
I believe no. I should clarify it with my colleagues later. 
 
> > error: job_str_signal: Security violation JOB_CANCEL RPC for JobId=759301_* from uid XXX
> This user appears to have tried to kill all jobs, maybe a conversation with
> them is warranted.
As long as the are not able to do it we are good :) 

> > [2019-07-16T13:47:50.447] error: job_submit/lua: /opt/slurm/etc/job_submit.lua: [string "slurm.user_msg (string.format(unpack({...})))"]:1: bad argument #1 to 'format' (string expected, got no value)
> Has this script bug been resolved (doubt it is causing the current issues
> though).
Thank you! We will look into that too. 

> > [2019-07-11T20:17:39.703] fatal: _start_msg_tree_internal: pthread_create error Resource temporarily unavailable
> > [2019-07-11T20:17:39.703] fatal: _start_msg_tree_internal: pthread_create error Resource temporarily unavailable
> Looks like the slurmctld node ran out of memory. Has this issue been
> resolved?
Not yet, that one has its one thread at bug #7360

> > [2019-07-11T21:13:33.928] error: Registered JobId=714588 StepId=Batch on wrong node gwdd084
> > [2019-07-11T21:13:33.928] error: Registered JobId=714588 StepId=4294967295 on wrong node gwdd084
> > [2019-07-11T21:13:34.869] error: Registered JobId=714589 StepId=Batch on wrong node gwdd057
> > [2019-07-11T21:13:34.869] error: Registered JobId=714589 StepId=4294967295 on wrong node gwdd057
> > [2019-07-11T21:13:34.897] error: Registered JobId=714590 StepId=Batch on wrong node gwdd016
> > [2019-07-11T21:13:34.897] error: Registered JobId=714590 StepId=4294967295 on wrong node gwdd016
> I see a lot of errors along these lines. Can you please verify that there
> isn't something odd going on with the DNS or node hostnames where there are
> duplicate nodes getting the same hostname or IP?
It is done by cluster manager software and so far we didn't have issues with it. Both IPs and hostnames currently look fine to me.

Comment 26 Marcin Stolarek 2019-07-19 08:37:09 MDT

Azat, 

Could you please add this configuration line to your slurm.conf:
>AuthInfo=cred_expire=600

It looks like the issue is related to credentials being expired in the meantime of job being started. We're looking into the issue if we can better handle the case, but according to man slurm.conf
>       AuthInfo
>              Additional  information  to  be used for authentication of communications between the Slurm daemons (slurmctld and slurmd)
>              and the Slurm clients.  The interpretation of this option is specific to the configured AuthType.  Multiple options may be
>              specified in a comma delimited list.  If not specified, the default authentication information will be used.
>              cred_expire   Default  job  step  credential lifetime, in seconds (e.g. "cred_expire=1200").  It must be sufficiently long
>                            enough to load user environment, run prolog, deal with the slurmd getting paged out of  memory,  etc.   This
>                            also controls how long a requeued job must wait before starting again.  The default value is 120 seconds.



cheers,
Marcin

Comment 27 Nate Rini 2019-07-19 11:09:17 MDT

(In reply to Azat from comment #25)
> (In reply to Nate Rini from comment #15)
> > > [2019-06-25T11:29:54.898] error: Job allocate response msg send failure, killing JobId=616782
> > > [2019-06-25T11:31:35.326] error: Node ping apparently hung, many nodes may be DOWN or configured SlurmdTimeout should be increased
> > > [2019-07-15T14:23:38.870] error: _find_node_record(756): lookup failure for dge1
> > > [2019-07-15T14:23:38.870] error: node_name2bitmap: invalid node specified dge1
> > Have there been changes to the node lineup in Slurm or possibly issues with
> > your DNS server?
> I have no clue where dge1 came from... I couldn't find it in any of the
> config files. dge nodes are all with 3 digits in the name. And that one
> should probably be dge001. Can you tell me please where node_name2bitmap
> gets it? 
node_name2bitmap() matches hosts based on a REGEX, it's possible that a user requested this incorrectly. node_name2bitmap() is called from all over the codebase, so I can't pinpoint the cause without higher level of debug (debug2+) logs to catch the requesting RPC.

> > > [2019-06-25T13:05:07.086] error: slurm_unpack_received_msg: Invalid Protocol Version 8704 from uid=-1 at 10.108.96.104:45391
> > > [2019-06-25T13:05:07.086] error: slurm_unpack_received_msg: Incompatible versions of client and server code
> > Can you please check the node @10.108.96.104 to make sure it's Slurm version
> > is within 2 releases of your controller.
> Our installation of Slurm is on shared FS. 2019-06-25T13 is the time when we
> updated the Slurm. Probably because of that, however it was still within 2
> release versions. We updated from 18.08. 
Version 8704 is SLURM_19_05_PROTOCOL_VERSION, so maybe a 17.11 (or older) node may have still been around when this log was taken.

> > > error: slurm_jobcomp plugin context not initialized
> > Was there a different jobcomp configuration to cause this error? It should
> > never happen with your current none plugin:
> > >JobCompType=jobcomp/none
> I believe no. I should clarify it with my colleagues later. 
If you can find out what caused this error, please open a ticket about it.

(In reply to Azat from comment #22)
> > > [60983.extern] _oom_event_monitor: oom-kill event count: 1
> > There are good number of oom events on this node. I assume this is job
> > triggered since the job cgroup appears to be correctly catching the oom
> > events. Is it possible to see why this job keeps causing oom events?
> It seems that indeed it causes orted segfault. But I have no idea how to
> find out why the jobs cause oom events, might be misconfiguration of the job
> by the user too.
We are mainly concerned with the slurmstepd crashing, but an OOM event in the cgroup may take out the slurmstepd (that runs as the user) for that job although it should be less likely. Based on your backtrace, that is not happening here.

Comment 30 Azat 2019-07-22 01:47:55 MDT

(In reply to Marcin Stolarek from comment #26)
Marcin,

> Could you please add this configuration line to your slurm.conf:
> >AuthInfo=cred_expire=600
> It looks like the issue is related to credentials being expired in the
> meantime of job being started.
Thank you! Can you please tell how did you calculate credential expiration value 600 sec. (probably using some difference between credential checks from the logs)? Just in case if 600 is not enough and we need to increase it in future. 

And is "scontrol reconfigure" enough to activate that config line? Or do slurmd daemons also need to be restarted (I have read it in some old ticket here #2125)? 

Cheers,
Azat

Comment 31 Marcin Stolarek 2019-07-22 04:19:39 MDT

Azat, 

It is not based on any calculation - it was just something much larger than the default which is 120s. 

It looks like you're affected by the situation where the credential expires while it's being processed. In this case, part of the job information is "guessed" (we're having an internal discussion on how to better handle the case). In many environments, whole "processing" happens within second/seconds, depending on your system - the way user environment is constructed, contents of your prolog and potential swapping of slurmd it may take much more time. A full answer will require some investigation on your side - for instance, do you measure the time spent in prolog? 

You'll have to restart slurmd for this configuration to be effective. 

cheers,
Marcin

Comment 32 Marcin Stolarek 2019-07-25 07:16:53 MDT

Azat, 

Could you please confirm that configuration change was enough to eliminate the issue from the cluster?
If this is true, can we lower the bug report severity to 4?

cheers,
Marcin

Comment 33 Azat 2019-07-25 07:17:08 MDT

I'm currently out of the office until 27.07.2019.


If you need support with the GWDG scientific compute cluster, please contact hpc@gwdg.de.


In all other cases, including HLRN related topics please contact hpc-team@gwdg.de.


Best regards,

Azat Khuziyakhmetov

Comment 34 Marcus Boden 2019-07-26 02:07:50 MDT

Hi,

(In reply to Marcin Stolarek from comment #32)

> Could you please confirm that configuration change was enough to eliminate
> the issue from the cluster?
> If this is true, can we lower the bug report severity to 4?

Sadly, no. The slurmctld just crashed again with:

> [2019-07-26T09:32:39.238] fatal: _start_msg_tree_internal: pthread_create error Resource temporarily unavailable
> [2019-07-26T09:32:39.238] fatal: _start_msg_tree_internal: pthread_create error Resource temporarily unavailable

Best,
Marcus

Comment 35 Marcin Stolarek 2019-07-26 03:15:43 MDT

It looks like a different issue. I just searched through the bugs and it looks like you already reported this as Bug 7360. 

Do you still see jobs failing to start and then remaining pending with "launch_failed_requeued_held" reason?

cheers,
Marcin

Comment 36 Azat 2019-07-29 01:36:19 MDT

(In reply to Marcin Stolarek from comment #32)
Marcin, 

> Could you please confirm that configuration change was enough to eliminate
> the issue from the cluster?
Yes, there are no new cases were detected according to logs (I have searched for "slurmd error running"). 

> If this is true, can we lower the bug report severity to 4?
Sure.

Thank you once again for help,

cheers,
Azat

Comment 37 Marcin Stolarek 2019-07-29 01:41:47 MDT

Lowering bug report severity to 4. The root cause was job credentials getting expired, while the job was starting.

Comment 60 Nate Rini 2020-05-04 10:39:41 MDT

Azat,

The fix has landed for 20.11 (9fd790a83..7ca9c753fd2f6).

Closing ticket, please respond if you have any questions or issues.

Thanks,
--Nate