Ticket 3257

Summary:	Credential data size mismatch error
Product:	Slurm	Reporter:	David Gloe <david.gloe>
Component:	slurmd	Assignee:	Tim Wickberg <tim>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	alex, bart, david.gloe, dmjacobsen
Version:	16.05.6
Hardware:	Linux
OS:	Linux
Site:	NERSC	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	Other	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:	Cori
CLE Version:		Version Fixed:	17.02-pre5
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	dumped data from _pack_cred from _slurm_cred_sign() dumped data from decrypted munge data (from slurmctld) increase tmp_string size to avoid truncation in pack_bit_str

Description David Gloe 2016-11-11 16:31:09 MST

A few days ago, a customer system had a job fail with this error:
srun: error: Task launch for 3056968.0 failed on node nid04850: Invalid job credential

Looking at the slurmd logs shows a credential data size mismatch error.

[2016-11-08T14:44:01.729] error: Credential signature check: Credential data size mismatch
[2016-11-08T14:44:01.729] error: _convert_job_mem: slurm_cred_verify failed: Invalid job credential
[2016-11-08T14:44:05.514] _run_prolog: run job script took usec=3777919
[2016-11-08T14:44:05.514] _run_prolog: prolog with lock for job 3056968 ran for 4 seconds
[2016-11-08T14:44:08.554] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: No error
[2016-11-08T14:45:53.581] sbcast req_uid=42034 job_id=3056968 fname=/tmp/su3_rmd254719 block_no=1
[2016-11-08T14:46:05.675] launch task 3056968.0 request from 42034.42034@10.128.9.19 (port 39868)
[2016-11-08T14:46:05.868] error: Credential signature check: Credential data size mismatch
[2016-11-08T14:46:05.868] error: Invalid job credential from 42034@10.128.9.19: Invalid job credential
[2016-11-08T14:47:06.546] Message aggregation disabled
[2016-11-08T14:47:06.648] Resource spec: Reserved system memory limit not configured for this node
[2016-11-08T14:49:17.050] error: Credential signature check: Credential replayed
[2016-11-08T14:49:17.050] error: _convert_job_mem: slurm_cred_verify failed: Invalid job credential
[2016-11-08T14:49:17.398] _run_prolog: run job script took usec=341413
[2016-11-08T14:49:17.398] _run_prolog: prolog with lock for job 3056979 ran for 0 seconds
[2016-11-08T14:49:19.176] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: No error
[2016-11-08T14:50:47.730] sbcast req_uid=42034 job_id=3056979 fname=/tmp/su3_rmd255450 block_no=1
[2016-11-08T14:52:01.732] launch task 3056979.0 request from 42034.42034@10.128.9.19 (port 36821)
[2016-11-08T14:52:01.924] error: Credential signature check: Credential data size mismatch
[2016-11-08T14:52:01.925] error: Invalid job credential from 42034@10.128.9.19: Invalid job credential

Do you have any ideas about what could cause this?

Comment 1 David Gloe 2016-11-11 16:41:37 MST

Unchecking "Cray Bugs" so site staff can access this bug.

Comment 2 Tim Wickberg 2016-11-11 17:06:14 MST

Is there a difference between the slurm.conf on that node versus the slurmctld?

Credential mismatches are most commonly caused by different plugins available on either side of the communication path.

Has this recurred, or occurred on more than one node?

Comment 3 Doug Jacobsen 2016-11-11 17:16:59 MST

Slurm. Conf is in shared network dir and plugins are not changed between
boots.  Slurm.conf is reread after every job (slurmd sighupped)

On Nov 11, 2016 4:06 PM, <bugs@schedmd.com> wrote:

Tim Wickberg <tim@schedmd.com> changed bug 3257
<https://bugs.schedmd.com/show_bug.cgi?id=3257>
What Removed Added
CC   dmjacobsen@lbl.gov
Assignee support@schedmd.com tim@schedmd.com

*Comment # 2 <https://bugs.schedmd.com/show_bug.cgi?id=3257#c2> on bug 3257
<https://bugs.schedmd.com/show_bug.cgi?id=3257> from Tim Wickberg
<tim@schedmd.com> *

Is there a difference between the slurm.conf on that node versus the slurmctld?

Credential mismatches are most commonly caused by different plugins available
on either side of the communication path.

Has this recurred, or occurred on more than one node?

------------------------------
You are receiving this mail because:

   - You are on the CC list for the bug.

Comment 4 Doug Jacobsen 2016-11-13 17:41:48 MST

Hello,

I have more information.  This is happening on jobs of particularly large scale.  Here is a debug-level slurmd log of the event happening on the sbatch step running node:

[2016-11-13T16:07:50.289] task_p_slurmd_batch_request: 3087366
[2016-11-13T16:07:50.290] task/affinity: job 3087366 CPU input mask for node: 0x0FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
[2016-11-13T16:07:50.290] task/affinity: job 3087366 CPU final HW mask for node: 0x7FFFFFFFFFFFFFFFF7FFFFFFFFFFFFFFFF7FFFFFFFFFFFFFFFF7FFFFFFFFFFFFFFFF
[2016-11-13T16:07:50.290] debug:  task_p_slurmd_batch_request: 3087366
[2016-11-13T16:07:50.306] debug:  Calling /usr/sbin/slurmstepd spank prolog
[2016-11-13T16:07:50.312] debug:  Process tracking via Cray job module loaded
[2016-11-13T16:07:50.316] debug:  proctrack_p_create: created jid 0x00000040 thread 0x2aaaecc0a700
[2016-11-13T16:07:50.327] debug:  Reading slurm.conf file: /etc/slurm/slurm.conf
[2016-11-13T16:07:50.354] debug:  Running spank/prolog for jobid [3087366] uid [42034]
[2016-11-13T16:07:50.354] debug:  spank: opening plugin stack /etc/slurm/plugstack.conf
[2016-11-13T16:07:50.358] debug:  spank: /etc/slurm/plugstack.conf:1: Loaded plugin shifter_slurm.so
[2016-11-13T16:07:50.358] debug:  spank: /etc/slurm/plugstack.conf:2: Loaded plugin vtune.so
[2016-11-13T16:07:50.362] debug:  shifter prolog, id after looking at args: (null):(null)
[2016-11-13T16:07:50.362] debug:  vtune_prolog: no vtune requested, done
[2016-11-13T16:07:50.465] debug:  [job 3087366] attempting to run prolog [/etc/slurm/nodeprolog.sh]
[2016-11-13T16:07:50.473] debug:  Process tracking via Cray job module loaded
[2016-11-13T16:07:50.477] debug:  proctrack_p_create: created jid 0x00000041 thread 0x2aaaecc0a700
[2016-11-13T16:07:50.576] _run_prolog: run job script took usec=273110
[2016-11-13T16:07:50.576] _run_prolog: prolog with lock for job 3087366 ran for 0 seconds
[2016-11-13T16:07:50.576] Launching batch job 3087366 for UID 42034
[2016-11-13T16:07:50.628] debug:  switch CRAY plugin loaded.
[2016-11-13T16:07:50.628] debug:  gres: Couldn't find the specified plugin name for gres/craynetwork looking at all files
[2016-11-13T16:07:50.629] debug:  Cannot find plugin of type gres/craynetwork, just track gres counts
[2016-11-13T16:07:50.630] debug:  gres: Couldn't find the specified plugin name for gres/hbm looking at all files
[2016-11-13T16:07:50.630] debug:  Cannot find plugin of type gres/hbm, just track gres counts
[2016-11-13T16:07:50.779] Consumable Resources (CR) Node Selection plugin loaded with argument 1266
[2016-11-13T16:07:50.780] debug:  AcctGatherProfile NONE plugin loaded
[2016-11-13T16:07:50.780] debug:  AcctGatherInfiniband NONE plugin loaded
[2016-11-13T16:07:50.780] debug:  AcctGatherFilesystem NONE plugin loaded
[2016-11-13T16:07:50.782] debug:  AcctGatherEnergy CRAY plugin loaded
[2016-11-13T16:07:50.784] debug:  Reading cgroup.conf file /etc/slurm/cgroup.conf
[2016-11-13T16:07:51.312] Considering each NUMA node as a socket
[2016-11-13T16:07:51.312] debug:  CPUs:272 Boards:1 Sockets:1 CoresPerSocket:68 ThreadsPerCore:4
[2016-11-13T16:07:51.316] debug:  Job accounting gather cgroup plugin loaded
[2016-11-13T16:07:51.321] [3087366] debug:  laying out the 1646592 tasks on 6144 hosts nid[04672-04683,04688-04747,04756-04771,04776-04777,04779-04811,04816-04836,04992-05007,05012-05023,05028-05067,05072-05131,05140-05195,05200-05251,05256-05263,05268-05323,05328-05375,06144-06159,06164-06219,06224-06283,06292-06347,06352-06415,06420-06475,06480-06487,06489-06523,06528-06543,06548-06603,06608-06667,06676-06731,06736-06799,06804-06859,06864-06927,06932-06987,06992-07051,07060-07115,07120-07183,07188-07243,07248-07311,07316-07371,07376-07435,07444-07499,07504-07567,07572-07627,07632-07695,07704-07755,07760-07823,07828-07883,07888-07899,07904-07935,07940-07951,07956-08011,08016-08079,08084-08127,08132-08139,08144-08207,08212-08267,08272-08335,08340-08343,08348-08395,08400-08463,08468-08475,08480-08523,08528-08571,08576-08591,08596-08651,08656-08719,08724-08779,08784-08847,08852-08895,08900-08907,08912-08975,08980-09035,09040-09047,09052-09071,09076-09103,09108-09163,09168-09207,09212-09231,09236-09291,09296-09359,09364-09419,09424-09475,09480-09487,09492-09543,09552-09615,09620-09659,09661-09675,09680-09743,09748-09799,09808-09871,09876-09931,09936-09959,09964-09999,10004-10059,10064-10127,10132-10187,10192-10255,10260-10315,10320-10383,10388-10443,10448-10511,10516-10571,10576-10635,10644-10699,10704-10767,10772-10827,10832-10895,10900-10955,10960-10987,10992-11023,11028-11083,11088-11151,11156-11211,11216-11279,11284-11313,11315-11339,11344-11407,11412-11439,11444-11467,11472-11535,11540-11567,11572-11595,11600-11663,11668-11723,11728-11791,11796-11851,11856-11883,11888-11919,11924-11979,11984-12047,12052-12094,12096-12107,12112-12175,12180-12235,12240-12255,12260-12287] dist 2
[2016-11-13T16:07:51.422] [3087366] debug:  Message thread started pid = 101353
[2016-11-13T16:07:51.423] [3087366] core_spec/cray: init
[2016-11-13T16:07:51.424] [3087366] task affinity plugin loaded with CPU mask 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
[2016-11-13T16:07:51.425] [3087366] debug:  task CRAY plugin loaded.
[2016-11-13T16:07:51.426] [3087366] debug:  Reading cgroup.conf file /etc/slurm/cgroup.conf
[2016-11-13T16:07:51.431] [3087366] debug:  task/cgroup: now constraining jobs allocated cores
[2016-11-13T16:07:51.431] [3087366] debug:  task/cgroup: loaded
[2016-11-13T16:07:51.431] [3087366] debug:  Process tracking via Cray job module loaded
[2016-11-13T16:07:51.431] [3087366] debug:  Checkpoint plugin loaded: checkpoint/none
[2016-11-13T16:07:51.431] [3087366] Munge cryptographic signature plugin loaded
[2016-11-13T16:07:51.432] [3087366] debug:  job_container cncu plugin loaded
[2016-11-13T16:07:51.432] [3087366] debug:  mpi type = (null)
[2016-11-13T16:07:51.432] [3087366] debug:  Process tracking via Cray job module loaded
[2016-11-13T16:07:51.437] [3087366] debug:  proctrack_p_create: created jid 0x00000042 thread 0x2aaab2f7b700
[2016-11-13T16:07:51.437] [3087366] debug:  spank: opening plugin stack /etc/slurm/plugstack.conf
[2016-11-13T16:07:51.441] [3087366] debug:  spank: /etc/slurm/plugstack.conf:1: Loaded plugin shifter_slurm.so
[2016-11-13T16:07:51.441] [3087366] debug:  spank: /etc/slurm/plugstack.conf:2: Loaded plugin vtune.so
[2016-11-13T16:07:51.442] [3087366] debug:  spank: /etc/slurm/plugstack.conf:3: Loaded plugin libAtpSLaunch.so
[2016-11-13T16:07:51.442] [3087366] debug:  spank: /etc/slurm/plugstack.conf:4: Loaded plugin zonesort.so
[2016-11-13T16:07:51.444] [3087366] debug:  SPANK: appending plugin option "image"
[2016-11-13T16:07:51.444] [3087366] debug:  SPANK: appending plugin option "volume"
[2016-11-13T16:07:51.444] [3087366] debug:  SPANK: appending plugin option "vtune"
[2016-11-13T16:07:51.444] [3087366] debug:  mpi type = (null)
[2016-11-13T16:07:51.444] [3087366] debug:  mpi/openmpi: slurmstepd prefork
[2016-11-13T16:07:51.444] [3087366] debug:  task_p_pre_setuid: 3087366.4294967294
[2016-11-13T16:07:51.445] [3087366] debug:  task/cgroup: job abstract cores are '0-66'
[2016-11-13T16:07:51.445] [3087366] debug:  task/cgroup: step abstract cores are '0-66'
[2016-11-13T16:07:51.445] [3087366] debug:  task/cgroup: job physical cores are '0-66,68-134,136-202,204-270'
[2016-11-13T16:07:51.445] [3087366] debug:  task/cgroup: step physical cores are '0-66,68-134,136-202,204-270'
[2016-11-13T16:07:51.450] [3087366] debug level = 2
[2016-11-13T16:07:51.451] [3087366] starting 1 tasks
[2016-11-13T16:07:51.452] [3087366] task 0 (101357) started 2016-11-13T16:07:51
[2016-11-13T16:07:51.452] [3087366] debug:  task_p_pre_launch_priv: 3087366.4294967294
[2016-11-13T16:07:51.453] [3087366] Created file /var/opt/cray/alps/spool/status6056184792583984134
[2016-11-13T16:07:51.454] [3087366] debug:  jobacct_gather_cgroup_cpuacct_attach_task: jobid 3087366 stepid 4294967294 taskid 0 max_task_id 0
[2016-11-13T16:07:51.456] [3087366] debug:  jobacct_gather_cgroup_memory_attach_task: jobid 3087366 stepid 4294967294 taskid 0 max_task_id 0
[2016-11-13T16:07:51.460] [3087366] debug:  jag_common_poll_data: Task average frequency = 1401000 pid 101357 mem size 0 212372 time 0.000000(0+0)
[2016-11-13T16:07:51.460] [3087366] debug:  Job 3087366 memory used:0 limit:94371840 KB
[2016-11-13T16:07:51.464] [3087366] debug:  Reading cgroup.conf file /etc/slurm/cgroup.conf
[2016-11-13T16:07:51.470] [3087366] debug:  task_p_pre_launch: affinity jobid 3087366.4294967294, task:0 bind:256
[2016-11-13T16:07:51.470] [3087366] task_p_pre_launch: Using sched_affinity for tasks
[2016-11-13T16:07:51.483] [3087366] debug:  task_p_pre_launch: 3087366.4294967294, apid 6056184792583984134, task 0
[2016-11-13T16:07:52.298] debug:  switch CRAY plugin loaded.
[2016-11-13T16:07:52.298] debug:  gres: Couldn't find the specified plugin name for gres/craynetwork looking at all files
[2016-11-13T16:07:52.300] debug:  Cannot find plugin of type gres/craynetwork, just track gres counts
[2016-11-13T16:07:52.300] debug:  gres: Couldn't find the specified plugin name for gres/hbm looking at all files
[2016-11-13T16:07:52.301] debug:  Cannot find plugin of type gres/hbm, just track gres counts
[2016-11-13T16:07:52.459] Consumable Resources (CR) Node Selection plugin loaded with argument 1266
[2016-11-13T16:07:52.461] debug:  AcctGatherProfile NONE plugin loaded
[2016-11-13T16:07:52.462] debug:  AcctGatherInfiniband NONE plugin loaded
[2016-11-13T16:07:52.462] debug:  AcctGatherFilesystem NONE plugin loaded
[2016-11-13T16:07:52.467] debug:  AcctGatherEnergy CRAY plugin loaded
[2016-11-13T16:07:52.473] debug:  Reading cgroup.conf file /etc/slurm/cgroup.conf
[2016-11-13T16:07:53.039] Considering each NUMA node as a socket
[2016-11-13T16:07:53.039] debug:  CPUs:272 Boards:1 Sockets:1 CoresPerSocket:68 ThreadsPerCore:4
[2016-11-13T16:07:53.044] debug:  Job accounting gather cgroup plugin loaded
[2016-11-13T16:07:53.046] error: User requested launch of zero tasks!
[2016-11-13T16:07:53.047] [3087366.4294967295] debug:  Message thread started pid = 101435
[2016-11-13T16:07:53.047] [3087366.4294967295] core_spec/cray: init
[2016-11-13T16:07:53.048] [3087366.4294967295] task affinity plugin loaded with CPU mask 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
[2016-11-13T16:07:53.049] [3087366.4294967295] debug:  task CRAY plugin loaded.
[2016-11-13T16:07:53.051] [3087366.4294967295] debug:  Reading cgroup.conf file /etc/slurm/cgroup.conf
[2016-11-13T16:07:53.056] [3087366.4294967295] debug:  task/cgroup: now constraining jobs allocated cores
[2016-11-13T16:07:53.056] [3087366.4294967295] debug:  task/cgroup: loaded
[2016-11-13T16:07:53.056] [3087366.4294967295] debug:  Process tracking via Cray job module loaded
[2016-11-13T16:07:53.056] [3087366.4294967295] debug:  Checkpoint plugin loaded: checkpoint/none
[2016-11-13T16:07:53.056] [3087366.4294967295] Munge cryptographic signature plugin loaded
[2016-11-13T16:07:53.057] [3087366.4294967295] debug:  job_container cncu plugin loaded
[2016-11-13T16:07:53.057] [3087366.4294967295] debug:  mpi type = (null)
[2016-11-13T16:07:53.057] [3087366.4294967295] debug:  Process tracking via Cray job module loaded
[2016-11-13T16:07:53.073] [3087366.4294967295] debug:  proctrack_p_create: created jid 0x00000043 thread 0x2aaab3183700
[2016-11-13T16:07:53.073] [3087366.4294967295] debug:  spank: opening plugin stack /etc/slurm/plugstack.conf
[2016-11-13T16:07:53.077] [3087366.4294967295] debug:  spank: /etc/slurm/plugstack.conf:1: Loaded plugin shifter_slurm.so
[2016-11-13T16:07:53.077] [3087366.4294967295] debug:  spank: /etc/slurm/plugstack.conf:2: Loaded plugin vtune.so
[2016-11-13T16:07:53.078] [3087366.4294967295] debug:  spank: /etc/slurm/plugstack.conf:3: Loaded plugin libAtpSLaunch.so
[2016-11-13T16:07:53.078] [3087366.4294967295] debug:  spank: /etc/slurm/plugstack.conf:4: Loaded plugin zonesort.so
[2016-11-13T16:07:53.079] [3087366.4294967295] debug:  SPANK: appending plugin option "image"
[2016-11-13T16:07:53.079] [3087366.4294967295] debug:  SPANK: appending plugin option "volume"
[2016-11-13T16:07:53.079] [3087366.4294967295] debug:  SPANK: appending plugin option "vtune"
[2016-11-13T16:07:53.079] [3087366.4294967295] error: cpu_freq_cpuset_validate: cpu_bind string is null
[2016-11-13T16:07:53.080] [3087366.4294967295] debug:  task_p_pre_setuid: 3087366.4294967295
[2016-11-13T16:07:53.080] [3087366.4294967295] debug:  task/cgroup: job abstract cores are '0-66'
[2016-11-13T16:07:53.080] [3087366.4294967295] debug:  task/cgroup: step abstract cores are '0-66'
[2016-11-13T16:07:53.080] [3087366.4294967295] debug:  task/cgroup: job physical cores are '0-66,68-134,136-202,204-270'
[2016-11-13T16:07:53.080] [3087366.4294967295] debug:  task/cgroup: step physical cores are '0-66,68-134,136-202,204-270'
[2016-11-13T16:07:53.084] [3087366.4294967295] debug:  jobacct_gather_cgroup_cpuacct_attach_task: jobid 3087366 stepid 4294967295 taskid 0 max_task_id 0
[2016-11-13T16:07:53.085] [3087366.4294967295] debug:  jobacct_gather_cgroup_memory_attach_task: jobid 3087366 stepid 4294967295 taskid 0 max_task_id 0
[2016-11-13T16:07:53.088] [3087366.4294967295] debug:  jag_common_poll_data: Task average frequency = 1401000 pid 101444 mem size 0 4204 time 0.000000(0+0)
[2016-11-13T16:07:53.089] [3087366.4294967295] debug:  Step 3087366.4294967295 memory used:0 limit:94371840 KB
[2016-11-13T16:07:53.089] [3087366.4294967295] error: Couldn't find shifterConfig.json, cannot do extern step processing.
[2016-11-13T16:07:55.368] sbcast req_uid=42034 job_id=3087366 fname=/tmp/su3_rmd101357 block_no=1
[2016-11-13T16:07:55.408] debug:  Process tracking via Cray job module loaded
[2016-11-13T16:07:55.412] debug:  proctrack_p_create: created jid 0x00000044 thread 0x2aaaecc0a700
[2016-11-13T16:08:07.498] launch task 3087366.0 request from 42034.42034@10.128.18.101 (port 60071)
[2016-11-13T16:08:07.498] debug:  Checking credential with 214208 bytes of sig data
[2016-11-13T16:08:07.695] error: Credential signature check: Credential replayed
[2016-11-13T16:08:07.695] error: Invalid job credential from 42034@10.128.18.101: Invalid job credential
[2016-11-13T16:08:57.163] [3087366] task 0 (101357) exited with exit code 0.
[2016-11-13T16:08:57.171] [3087366] debug:  task_p_post_term: affinity 3087366.4294967294, task 0
[2016-11-13T16:08:57.171] [3087366] debug:  task_p_post_term: 3087366.4294967294, task 0
[2016-11-13T16:08:57.171] [3087366] debug:  task_p_post_term: 3087366.4294967294, task 0
[2016-11-13T16:08:57.173] [3087366] debug:  step_terminate_monitor_stop signalling condition
[2016-11-13T16:08:57.174] [3087366] _monitor is running
[2016-11-13T16:08:57.174] [3087366] _monitor is stopping
[2016-11-13T16:08:57.174] [3087366] Unlinked /var/opt/cray/alps/spool/status6056184792583984134
[2016-11-13T16:08:57.485] [3087366] job 3087366 completed with slurm_rc = 0, job_rc = 0
[2016-11-13T16:08:57.485] [3087366] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0

I've truncated all the job completion stuff but can add if you like.

In the second node of the job:

[2016-11-13T16:07:53.697] [3087366.4294967295] error: Couldn't find shifterConfig.json, cannot do extern step processing.
[2016-11-13T16:07:55.436] sbcast req_uid=42034 job_id=3087366 fname=/tmp/su3_rmd101357 block_no=1
[2016-11-13T16:07:55.453] debug:  Process tracking via Cray job module loaded
[2016-11-13T16:07:55.466] debug:  proctrack_p_create: created jid 0x00000038 thread 0x2aaaecc0a700
[2016-11-13T16:08:07.523] launch task 3087366.0 request from 42034.42034@10.128.18.101 (port 60071)
[2016-11-13T16:08:07.523] debug:  Checking credential with 214208 bytes of sig data
[2016-11-13T16:08:07.723] error: Credential signature check: Credential replayed
[2016-11-13T16:08:07.723] error: Invalid job credential from 42034@10.128.18.101: Invalid job credential
[2016-11-13T16:08:57.516] debug:  _rpc_terminate_job, uid = 0
[2016-11-13T16:08:57.516] debug:  task_p_slurmd_release_resources: affinity jobid 3087366
[2016-11-13T16:08:57.516] debug:  task_p_slurmd_release_resources: 3087366
[2016-11-13T16:08:57.517] debug:  credential for job 3087366 revoked



(just the logs around the error).

As far as I understand things, we're only seeing this at large scale.  In this case a 6144 node job.  And some jobs of even larger scale work fine.

There is an sbcast immediately before the srun in this job.  Is it possible that the credential used for the sbcast (if there is one) is somehow colliding with the srun?  Perhaps the sbcast hasn't fully completed or something?

For some reason this user is not using srun --bcast...

-Doug

Comment 5 Doug Jacobsen 2016-11-13 18:05:45 MST

Removing the sbcast and inlineing in with --bcast gives a similar error:

[2016-11-13T17:03:06.287] [3087396.4294967295] error: Couldn't find shifterConfig.json, cannot do extern step processing.
[2016-11-13T17:03:08.552] sbcast req_uid=42034 job_id=3087396 fname=/tmp/su3_rmd105886 block_no=1
[2016-11-13T17:03:08.569] debug:  Process tracking via Cray job module loaded
[2016-11-13T17:03:08.578] debug:  proctrack_p_create: created jid 0x0000005e thread 0x2aaaecc0a700
[2016-11-13T17:03:09.796] launch task 3087396.0 request from 42034.42034@10.128.18.101 (port 23493)
[2016-11-13T17:03:09.796] debug:  Checking credential with 214208 bytes of sig data
[2016-11-13T17:03:09.988] error: Credential signature check: Credential data size mismatch
[2016-11-13T17:03:09.989] error: Invalid job credential from 42034@10.128.18.101: Invalid job credential
nid04672:~ #

instead of replay we're seeing Credential data size mismatch.  Is there a chance this is some kind of integer boundary issue (seems unlikely since we can run at larger scale).

Comment 6 Doug Jacobsen 2016-11-13 22:05:39 MST

So one thing that stands out with this bug is that the jobs that fail seems to have HUGE credential buffers:

nid04672:~ # egrep 'Checking credential with|data size mismatch' /var/spool/slurmd/nid04672.log
[2016-11-13T15:46:55.768] error: Credential signature check: Credential data size mismatch
[2016-11-13T15:52:41.166] error: Credential signature check: Credential data size mismatch
[2016-11-13T16:08:07.498] debug:  Checking credential with 214208 bytes of sig data
[2016-11-13T16:20:33.595] debug:  Checking credential with 37396 bytes of sig data
[2016-11-13T16:20:36.677] debug:  Checking credential with 35244 bytes of sig data
[2016-11-13T16:21:55.062] debug:  Checking credential with 39488 bytes of sig data
[2016-11-13T16:22:55.812] debug:  Checking credential with 39488 bytes of sig data
[2016-11-13T16:23:44.370] debug:  Checking credential with 39488 bytes of sig data
[2016-11-13T16:24:50.747] debug:  Checking credential with 39488 bytes of sig data
[2016-11-13T16:25:17.632] debug:  Checking credential with 39488 bytes of sig data
[2016-11-13T16:29:23.569] debug:  Checking credential with 37396 bytes of sig data
[2016-11-13T16:29:27.892] debug:  Checking credential with 35244 bytes of sig data
[2016-11-13T17:03:09.796] debug:  Checking credential with 214208 bytes of sig data
[2016-11-13T17:03:09.988] error: Credential signature check: Credential data size mismatch
[2016-11-13T20:32:39.567] debug:  Checking credential with 214208 bytes of sig data
[2016-11-13T20:32:39.759] error: Credential signature check: Credential data size mismatch
[2016-11-13T20:41:34.348] debug:  Checking credential with 214208 bytes of sig data
[2016-11-13T20:41:34.545] error: Credential signature check: Credential data size mismatch
[2016-11-13T20:54:23.279] debug:  Checking credential with 214208 bytes of sig data
[2016-11-13T20:54:23.472] error: Credential signature check: Credential data size mismatch
[2016-11-13T20:59:28.932] debug:  Checking credential with 340 bytes of sig data
nid04672:~ #

Does that just scale with the number of nodes in the calculation?

Comment 7 Tim Wickberg 2016-11-13 22:09:54 MST

This only happens with jobs using sbcast? There may be some bug in the handling there, I'll try to take a look at tomorrow morning if Alex or Dominik don't get to it first.

On November 13, 2016 10:05:39 PM MST, bugs@schedmd.com wrote:
>https://bugs.schedmd.com/show_bug.cgi?id=3257
>
>--- Comment #6 from Doug Jacobsen <dmjacobsen@lbl.gov> ---
>So one thing that stands out with this bug is that the jobs that fail
>seems to
>have HUGE credential buffers:
>
>nid04672:~ # egrep 'Checking credential with|data size mismatch'
>/var/spool/slurmd/nid04672.log
>[2016-11-13T15:46:55.768] error: Credential signature check: Credential
>data
>size mismatch
>[2016-11-13T15:52:41.166] error: Credential signature check: Credential
>data
>size mismatch
>[2016-11-13T16:08:07.498] debug:  Checking credential with 214208 bytes
>of sig
>data
>[2016-11-13T16:20:33.595] debug:  Checking credential with 37396 bytes
>of sig
>data
>[2016-11-13T16:20:36.677] debug:  Checking credential with 35244 bytes
>of sig
>data
>[2016-11-13T16:21:55.062] debug:  Checking credential with 39488 bytes
>of sig
>data
>[2016-11-13T16:22:55.812] debug:  Checking credential with 39488 bytes
>of sig
>data
>[2016-11-13T16:23:44.370] debug:  Checking credential with 39488 bytes
>of sig
>data
>[2016-11-13T16:24:50.747] debug:  Checking credential with 39488 bytes
>of sig
>data
>[2016-11-13T16:25:17.632] debug:  Checking credential with 39488 bytes
>of sig
>data
>[2016-11-13T16:29:23.569] debug:  Checking credential with 37396 bytes
>of sig
>data
>[2016-11-13T16:29:27.892] debug:  Checking credential with 35244 bytes
>of sig
>data
>[2016-11-13T17:03:09.796] debug:  Checking credential with 214208 bytes
>of sig
>data
>[2016-11-13T17:03:09.988] error: Credential signature check: Credential
>data
>size mismatch
>[2016-11-13T20:32:39.567] debug:  Checking credential with 214208 bytes
>of sig
>data
>[2016-11-13T20:32:39.759] error: Credential signature check: Credential
>data
>size mismatch
>[2016-11-13T20:41:34.348] debug:  Checking credential with 214208 bytes
>of sig
>data
>[2016-11-13T20:41:34.545] error: Credential signature check: Credential
>data
>size mismatch
>[2016-11-13T20:54:23.279] debug:  Checking credential with 214208 bytes
>of sig
>data
>[2016-11-13T20:54:23.472] error: Credential signature check: Credential
>data
>size mismatch
>[2016-11-13T20:59:28.932] debug:  Checking credential with 340 bytes of
>sig
>data
>nid04672:~ #
>
>Does that just scale with the number of nodes in the calculation?

Comment 8 Doug Jacobsen 2016-11-13 22:16:10 MST

I've moved the sbcast into the srun.  That's one variable. Another is 
that this job is using some more advanced features of Cray's mpich 
library, in particular file-based rank reordering. I'm wondering if that 
file is somehow getting jammed into the credential for distribution, 
because the file is:


My last update was incomplete:


nid04672:~ # egrep 'Checking credential with|data size 
mismatch|Credential signature check' /var/spool/slurmd/nid04672.log
[2016-11-13T15:46:55.768] error: Credential signature check: Credential 
data size mismatch
[2016-11-13T15:52:41.166] error: Credential signature check: Credential 
data size mismatch
[2016-11-13T15:57:53.272] error: Credential signature check: Credential 
replayed
[2016-11-13T16:02:55.070] error: Credential signature check: Credential 
replayed
[2016-11-13T16:08:07.498] debug:  Checking credential with 214208 bytes 
of sig data
[2016-11-13T16:08:07.695] error: Credential signature check: Credential 
replayed
[2016-11-13T16:20:33.595] debug:  Checking credential with 37396 bytes 
of sig data
[2016-11-13T16:20:36.677] debug:  Checking credential with 35244 bytes 
of sig data
[2016-11-13T16:21:55.062] debug:  Checking credential with 39488 bytes 
of sig data
[2016-11-13T16:22:55.812] debug:  Checking credential with 39488 bytes 
of sig data
[2016-11-13T16:23:44.370] debug:  Checking credential with 39488 bytes 
of sig data
[2016-11-13T16:24:50.747] debug:  Checking credential with 39488 bytes 
of sig data
[2016-11-13T16:25:17.632] debug:  Checking credential with 39488 bytes 
of sig data
[2016-11-13T16:29:23.569] debug:  Checking credential with 37396 bytes 
of sig data
[2016-11-13T16:29:27.892] debug:  Checking credential with 35244 bytes 
of sig data
[2016-11-13T17:03:09.796] debug:  Checking credential with 214208 bytes 
of sig data
[2016-11-13T17:03:09.988] error: Credential signature check: Credential 
data size mismatch
[2016-11-13T20:32:39.567] debug:  Checking credential with 214208 bytes 
of sig data
[2016-11-13T20:32:39.759] error: Credential signature check: Credential 
data size mismatch
[2016-11-13T20:41:34.348] debug:  Checking credential with 214208 bytes 
of sig data
[2016-11-13T20:41:34.545] error: Credential signature check: Credential 
data size mismatch
[2016-11-13T20:54:23.279] debug:  Checking credential with 214208 bytes 
of sig data
[2016-11-13T20:54:23.472] error: Credential signature check: Credential 
data size mismatch
[2016-11-13T20:59:28.932] debug:  Checking credential with 340 bytes of 
sig data
[2016-11-13T21:07:02.433] debug:  Checking credential with 39552 bytes 
of sig data
[2016-11-13T21:10:47.147] debug:  Checking credential with 214272 bytes 
of sig data
[2016-11-13T21:10:47.345] error: Credential signature check: Credential 
replayed
nid04672:~ #



I think what I'm seeing is that with very large munge credentials seem 
to be correlated to a variety of failures.  What is srun/slurmstepd 
packing into this?  Maybe munge is misbehaving?

-Doug


On 11/13/16 10:09 PM, bugs@schedmd.com wrote:
>
> *Comment # 7 <https://bugs.schedmd.com/show_bug.cgi?id=3257#c7> on bug 
> 3257 <https://bugs.schedmd.com/show_bug.cgi?id=3257> from Tim Wickberg 
> <mailto:tim@schedmd.com> *
> This only happens with jobs using sbcast? There may be some bug in the handling
> there, I'll try to take a look at tomorrow morning if Alex or Dominik don't get
> to it first.
>
> On November 13, 2016 10:05:39 PM MST,bugs@schedmd.com <mailto:bugs@schedmd.com>  wrote:
> >https://bugs.schedmd.com/show_bug.cgi?id=3257 <show_bug.cgi?id=3257>
> >
> >---Comment #6 <show_bug.cgi?id=3257#c6> from Doug Jacobsen 
> <dmjacobsen@lbl.gov <mailto:dmjacobsen@lbl.gov>> --- >So one thing 
> that stands out with this bug is that the jobs that fail >seems to 
> >have HUGE credential buffers:
> >
> >nid04672:~ # egrep 'Checking credential with|data size mismatch' >/var/spool/slurmd/nid04672.log >[2016-11-13T15:46:55.768] error: 
> Credential signature check: Credential >data >size mismatch 
> >[2016-11-13T15:52:41.166] error: Credential signature check: 
> Credential >data >size mismatch >[2016-11-13T16:08:07.498] debug: 
> Checking credential with 214208 bytes >of sig >data 
> >[2016-11-13T16:20:33.595] debug: Checking credential with 37396 bytes 
> >of sig >data >[2016-11-13T16:20:36.677] debug: Checking credential 
> with 35244 bytes >of sig >data >[2016-11-13T16:21:55.062] debug: 
> Checking credential with 39488 bytes >of sig >data 
> >[2016-11-13T16:22:55.812] debug: Checking credential with 39488 bytes 
> >of sig >data >[2016-11-13T16:23:44.370] debug: Checking credential 
> with 39488 bytes >of sig >data >[2016-11-13T16:24:50.747] debug: 
> Checking credential with 39488 bytes >of sig >data 
> >[2016-11-13T16:25:17.632] debug: Checking credential with 39488 bytes 
> >of sig >data >[2016-11-13T16:29:23.569] debug: Checking credential 
> with 37396 bytes >of sig >data >[2016-11-13T16:29:27.892] debug: 
> Checking credential with 35244 bytes >of sig >data 
> >[2016-11-13T17:03:09.796] debug: Checking credential with 214208 
> bytes >of sig >data >[2016-11-13T17:03:09.988] error: Credential 
> signature check: Credential >data >size mismatch 
> >[2016-11-13T20:32:39.567] debug: Checking credential with 214208 
> bytes >of sig >data >[2016-11-13T20:32:39.759] error: Credential 
> signature check: Credential >data >size mismatch 
> >[2016-11-13T20:41:34.348] debug: Checking credential with 214208 
> bytes >of sig >data >[2016-11-13T20:41:34.545] error: Credential 
> signature check: Credential >data >size mismatch 
> >[2016-11-13T20:54:23.279] debug: Checking credential with 214208 
> bytes >of sig >data >[2016-11-13T20:54:23.472] error: Credential 
> signature check: Credential >data >size mismatch 
> >[2016-11-13T20:59:28.932] debug: Checking credential with 340 bytes 
> of >sig >data >nid04672:~ #
> >
> >Does that just scale with the number of nodes in the calculation?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>

Comment 9 Doug Jacobsen 2016-11-13 22:19:03 MST

I fully removed the sbcast, even the --bcast option to srun.  It still 
crashes with a bad signature verification.

Comment 10 Doug Jacobsen 2016-11-13 23:00:52 MST

(gdb) break slurm_cred.c:1716
Breakpoint 1 at 0x4a7fef: file slurm_cred.c, line 1716.
(gdb) continue
Continuing.
[New Thread 0x2aaab3bc0700 (LWP 120035)]
Detaching after fork from child process 120037.
Detaching after fork from child process 120040.
Detaching after fork from child process 120068.
[New Thread 0x2aaab3dc2700 (LWP 120034)]
Detaching after fork from child process 120070.
[Thread 0x2aaab3bc0700 (LWP 120035) exited]
[Thread 0x2aaab3dc2700 (LWP 120034) exited]
[New Thread 0x2aaab3dc2700 (LWP 120207)]
[New Thread 0x2aaaeca09700 (LWP 120229)]
[New Thread 0x2aaab3cc1700 (LWP 120228)]
[New Thread 0x2aaaec403700 (LWP 120227)]
[New Thread 0x2aaab3fc4700 (LWP 120226)]
[New Thread 0x2aaab29ac700 (LWP 120225)]
[New Thread 0x2aaab33b8700 (LWP 120224)]
[New Thread 0x2aaab3abf700 (LWP 120223)]
[New Thread 0x2aaab3ec3700 (LWP 120222)]
[New Thread 0x2aaaec504700 (LWP 120221)]
[New Thread 0x2aaaec201700 (LWP 120220)]
[New Thread 0x2aaaec807700 (LWP 120219)]
[New Thread 0x2aaab35ba700 (LWP 120218)]
[New Thread 0x2aaaec706700 (LWP 120217)]
[New Thread 0x2aaaec100700 (LWP 120216)]
[New Thread 0x2aaab39be700 (LWP 120215)]
[New Thread 0x2aaaec908700 (LWP 120214)]
[New Thread 0x2aaaec302700 (LWP 120213)]
[New Thread 0x2aaab37bc700 (LWP 120212)]
[New Thread 0x2aaaec605700 (LWP 120211)]
[New Thread 0x2aaab34b9700 (LWP 120210)]
[New Thread 0x2aaab32b7700 (LWP 120209)]
[New Thread 0x2aaab3bc0700 (LWP 120208)]
[Switching to Thread 0x2aaab3dc2700 (LWP 120207)]

Breakpoint 1, _slurm_cred_verify_signature (ctx=0x113f990, cred=0x2aaac0002290, protocol_version=7680) at slurm_cred.c:1716
1716	slurm_cred.c: No such file or directory.
(gdb) next
[Thread 0x2aaaec504700 (LWP 120221) exited]
1717	in slurm_cred.c
(gdb) next
[Thread 0x2aaab3bc0700 (LWP 120208) exited]
1718	in slurm_cred.c
(gdb) next
[Thread 0x2aaab3abf700 (LWP 120223) exited]
[Thread 0x2aaab34b9700 (LWP 120210) exited]
[Thread 0x2aaaec403700 (LWP 120227) exited]
[Thread 0x2aaaeca09700 (LWP 120229) exited]
[Thread 0x2aaab3ec3700 (LWP 120222) exited]
[Thread 0x2aaab3cc1700 (LWP 120228) exited]
[Thread 0x2aaab29ac700 (LWP 120225) exited]
[Thread 0x2aaab33b8700 (LWP 120224) exited]
[Thread 0x2aaaec201700 (LWP 120220) exited]
[Thread 0x2aaab35ba700 (LWP 120218) exited]
[Thread 0x2aaab3fc4700 (LWP 120226) exited]
[Thread 0x2aaaec706700 (LWP 120217) exited]
[Thread 0x2aaaec100700 (LWP 120216) exited]
[Thread 0x2aaaec807700 (LWP 120219) exited]
[Thread 0x2aaab39be700 (LWP 120215) exited]
[Thread 0x2aaaec908700 (LWP 120214) exited]
[Thread 0x2aaaec302700 (LWP 120213) exited]
[Thread 0x2aaab37bc700 (LWP 120212) exited]
[Thread 0x2aaaec605700 (LWP 120211) exited]
[Thread 0x2aaab32b7700 (LWP 120209) exited]
1720	in slurm_cred.c
(gdb) print ctx->key
$1 = (void *) 0x8d3170
(gdb) print *buffer
$2 = {magic = 1112884549, head = 0x2aaac00048c0 "", size = 200686, processed = 160597}
(gdb) print cred->siglen
$3 = 214272
(gdb) print cred->signature
$4 = 0x2aaac0422970 "MUNGE:AwQDAABDcaQSGAteG2d71AHRezU4XsRMhs0UNTl0QbJigw3A+kA7BHL3mA8/CFfG1xlBFfvDS56FYSx8Qsgf/6on7K18SWkepwS6gW6aBW5zwOfYrD11yogyh9ROqxFUH2bm0CUfz8b1aSTxPDPmS79fU/sJienfCoUJAtDQeeCFtL3zAJYiVL/Mr3Sdo95Snu"...
(gdb) step
[New Thread 0x2aaab32b7700 (LWP 120298)]
crypto_verify_sign (key=0x8d3170, buffer=0x2aaac00048c0 "", buf_size=160597,
    signature=0x2aaac0422970 "MUNGE:AwQDAABDcaQSGAteG2d71AHRezU4XsRMhs0UNTl0QbJigw3A+kA7BHL3mA8/CFfG1xlBFfvDS56FYSx8Qsgf/6on7K18SWkepwS6gW6aBW5zwOfYrD11yogyh9ROqxFUH2bm0CUfz8b1aSTxPDPmS79fU/sJienfCoUJAtDQeeCFtL3zAJYiVL/Mr3Sdo95Snu"..., sig_size=214272)
    at crypto_munge.c:294
294	crypto_munge.c: No such file or directory.
(gdb) print sig_size
$5 = 214272
(gdb) next
297	in crypto_munge.c
(gdb) next
299	in crypto_munge.c
(gdb) next
301	in crypto_munge.c
(gdb) next
304	in crypto_munge.c
(gdb) next
[New Thread 0x2aaab36bb700 (LWP 120351)]
[New Thread 0x2aaaec504700 (LWP 120350)]
[New Thread 0x2aaab3bc0700 (LWP 120349)]
[New Thread 0x2aaab3abf700 (LWP 120348)]
[New Thread 0x2aaab34b9700 (LWP 120347)]
[New Thread 0x2aaaec403700 (LWP 120346)]
[New Thread 0x2aaab3ec3700 (LWP 120345)]
[New Thread 0x2aaaeca09700 (LWP 120344)]
[New Thread 0x2aaab3cc1700 (LWP 120343)]
[New Thread 0x2aaab29ac700 (LWP 120342)]
[New Thread 0x2aaaec201700 (LWP 120341)]
[New Thread 0x2aaab33b8700 (LWP 120340)]
[New Thread 0x2aaab35ba700 (LWP 120339)]
[New Thread 0x2aaaec908700 (LWP 120338)]
[New Thread 0x2aaab3fc4700 (LWP 120337)]
[New Thread 0x2aaaec605700 (LWP 120336)]
[New Thread 0x2aaaec706700 (LWP 120335)]
[New Thread 0x2aaaec100700 (LWP 120334)]
[New Thread 0x2aaaec302700 (LWP 120333)]
[New Thread 0x2aaaec807700 (LWP 120332)]
[New Thread 0x2aaab39be700 (LWP 120331)]
[New Thread 0x2aaab37bc700 (LWP 120330)]
307	in crypto_munge.c
(gdb) print buf_out_size
$6 = 160609
(gdb) print buf_size
$7 = 160597
(gdb)

Comment 11 Doug Jacobsen 2016-11-13 23:03:25 MST

so it looks like the call to _pack_cred() is working out a different buffer size than the munge message correctly decrypts to. 160597 bytes from _pack_cred() vs 160609 bytes from unmunging the message.  I manually decrypt the message and a cursory glance seems fine.

Comment 12 Doug Jacobsen 2016-11-14 00:57:32 MST

Created attachment 3709 [details]
dumped data from _pack_cred from _slurm_cred_sign()

this data was collected by capturing the buffer output from _pack_cred() in _slurm_cred_sign().  I then passed it through xxd to produce this file.

Comment 13 Doug Jacobsen 2016-11-14 01:00:05 MST

Created attachment 3710 [details]
dumped data from decrypted munge data (from slurmctld)

this data was dumped from slurmd in the crypto_verify_sign() of auth/munge called by _slurm_cred_sign().  It was then passed through xxd to produce this output.

Comment 14 Doug Jacobsen 2016-11-14 02:45:44 MST

Hello, in the attached memory dumps, I traced slurmd while submitting the job in question here, breaking up entry into _slurm_cred_sign() and eventually stepping into auth/munge, similar to the output posted in Comment 10.

The initial significant difference I see is starting at position 0016ce0 in each file.  It appears the packed bit string of cred->job_core_bitmap and cred->step_core_bitmap is differing between the slurmd version and the slurmctld version (munge output)

It looks like both strings have much in common, but the slurmctld version has a trailing ",33415" which in the context appears to be extraneous continued output coming from slurmctld.

i.e., the string appears to be:
<early portions>,333404-333467,333472-333535,333540-333603,333608-333671,333676-333739,333744-333807,333812-333875,333880-333943,333948-334011,334016-334079,334084-334147,33415   <- truncated

I think that the length of the bit string is overflowing the buffer somehow, but slurmctld does not note any errors that packmem() generates.

Comment 15 Doug Jacobsen 2016-11-14 02:46:38 MST

After a lot of trial and error I think this error occurs if the sbatch has a core specialization request, i.e.,

#SBATCH -S 1

instead of on the srun only.

Comment 16 Doug Jacobsen 2016-11-14 03:00:57 MST

Yes, confirmed.  If I get rid of #SBATCH -S 1 from the batch script and move it to the srun statement, the job can run.  Note that 3000 node jobs work fine with #SBATCH -S 1.

My observation is that when sbatch is given the -S 1 argument, and the srun inherits it and run, it seems that the credential message is an order of magnitude larger than without sbatch -S 1 (just on srun).

[2016-11-14T01:28:48.489] debug:  Checking credential with 39572 bytes of sig data
[2016-11-14T01:31:10.868] debug:  credential for job 3089877 revoked
[2016-11-14T01:36:32.965] debug:  Checking credential with 39616 bytes of sig data
[2016-11-14T01:38:51.657] debug:  credential for job 3089883 revoked
[2016-11-14T01:44:06.566] debug:  Checking credential with 214316 bytes of sig data
[2016-11-14T01:44:06.759] error: Credential signature check: Credential data size mismatch
[2016-11-14T01:44:06.759] error: Invalid job credential from 56094@10.128.18.101: Invalid job credential
[2016-11-14T01:44:58.491] debug:  credential for job 3089886 revoked
[2016-11-14T01:50:56.030] debug:  Checking credential with 39572 bytes of sig data


My suspicion is that in this case that the message buffer is somehow getting over-large and truncated, causing the message to later be rejected.  I haven't tracked it down the rest of the way.

For now we'll tell our users to avoid putting core specialization in the sbatch line.

-Doug

Comment 17 Doug Jacobsen 2016-11-14 08:38:02 MST

Now that I've given folks the advice not to use core specialization above 3.3k knl nodes (unsure if that is the hard limit) large scale jobs seem to be a bit better off (not getting this error).

However, the users have pointed out to me that core specialization must be specified as part of the sbatch/salloc or it does not function:

+ date
Mon Nov 14 01:50:53 PST 2016
+ srun -S 1 --bcast=/tmp/su3_rmd146906 --compress=lz4 --hint=nomultithread --ntasks=24576 --nodes=6144 --ntasks-per-node=4 --cpus-per-task=16 --mem_bind=map_mem:1 ../../su3_rmd-intel-qphix
srun: error: Ignoring --core-spec value for a job step within an existing job. Set specialized cores at job allocation time.


So, this indicates a larger issue now which is that we cannot use core specialization for large jobs.

Please let me know if I can provide any additional information to help pull this issue apart.

Thanks,
Doug

Comment 20 Tim Wickberg 2016-11-14 14:30:14 MST

I've been talking with Doug on this a bit, but just to keep everyone in the loop:

I currently believe this issue is that the job_core_bitmap is over 64kbytes, which leads to truncation given the current RPC implementation. For jobs of ~4k nodes, that's only 16bytes available for each node's layout string on average, and I could see this being easily exceeded when using core specialization. This would also likely account for some of the variability in reproducing the issue - fragmentation of the node layout could move around the each size of this string when decoded, and be enough to push it over the limit.

Unfortunately, as an RPC issue this won't be easily fixed within the 16.05 release. I'm looking at how to best fix this, or if we should even be sending the underlying bitstring in the textual representation, but this will take some time.

Comment 21 Tim Wickberg 2016-11-14 23:11:43 MST

Created attachment 3712 [details]
increase tmp_string size to avoid truncation in pack_bit_str

I made a mistake with 16bit vs 32bit field sizes before.

The attached patch should solve the problem, and does not constitute an RPC change.

Comment 24 Doug Jacobsen 2016-11-17 09:26:51 MST

Hello,
The temporary patch to work around this issue did not function.  slurmctld would segfault as soon as it attempted to dump partitions with dump_part unable to initialize (first function to hit one of these macros).

slurmctld was started with an unlimited stack size.  But that still didn't help.  I've backed out the patch.

I was hesitant to simply add a 16mb xmalloc/xfree to put this on the heap instead since it seems this macro is used frequently.

-Doug

Comment 25 Dominik Bartkiewicz 2016-11-17 09:52:25 MST

How big is dump_part file?
Is any error in slurmctld log?

Dominik

Comment 26 Doug Jacobsen 2016-11-17 10:27:00 MST

The system is down for maintenance now.  Once it returns I will check.
Reverting to the 64k buffer allowed it to startup.

Slurmctld segfaulted with no logs that would appear relevant.

Doug

On Nov 17, 2016 9:52 AM, <bugs@schedmd.com> wrote:

> *Comment # 25 <https://bugs.schedmd.com/show_bug.cgi?id=3257#c25> on bug
> 3257 <https://bugs.schedmd.com/show_bug.cgi?id=3257> from Dominik
> Bartkiewicz <bart@schedmd.com> *
>
> How big is dump_part file?
> Is any error in slurmctld log?
>
> Dominik
>
> ------------------------------
> You are receiving this mail because:
>
>    - You are on the CC list for the bug.
>
>

Comment 28 Tim Wickberg 2016-11-21 16:11:25 MST

Hey Doug -

Hope your travel back went smoothly.

I should have realized that 0xfffffe wasn't going to fit on the stack... we have a trivial patch that switches to malloc/free-based fix thus avoids the stack allocation problems, but as you've noted this may have a substantial performance penalty as every job/step placement uses this macro.

The fix for 17.02 and on involves using pack_bit_str_hex() instead of pack_bit_fmt(), although that's not quite ready yet (the corresponding unpack() calls are obfuscated somewhat) and would break the existing RPC layer.

In the meantime, assuming you haven't tried this already, I suspect that 0x20000 for the buffer size would be sufficient, while not taking up the entire stack and causing the segfault you've seen.

Comment 29 Doug Jacobsen 2016-11-30 14:12:53 MST

Hello,

I rebuilt slurm with the 0x20000 buffer size, got the slurmctld and the knl compute nodes using it (the haswells still have the older stuff).

Anyway, we are now able to run jobs with 6400 nodes using core specialization.  I'll try to run a full scale job once the system becomes more available.

Are the needed changes committed to 17.02?

-Doug

Comment 30 Tim Wickberg 2016-11-30 14:19:35 MST

(In reply to Doug Jacobsen from comment #29)
> Hello,
> 
> I rebuilt slurm with the 0x20000 buffer size, got the slurmctld and the knl
> compute nodes using it (the haswells still have the older stuff).
> 
> Anyway, we are now able to run jobs with 6400 nodes using core
> specialization.  I'll try to run a full scale job once the system becomes
> more available.

Okay, glad to hear thats working so far.

> Are the needed changes committed to 17.02?

Still working on it; the change to that one specific function is straightforward, but excising other uses of pack_bit_fmt() is taking a bit more investigation - there's no corresponding unpack call, so tracking down where the receive side is kicking in takes a bit more care.

I'll close this out when those fixes are all in, and provide the commit id when I have it done.

Comment 31 Tim Wickberg 2017-01-13 12:07:12 MST

Commit 409ab31f605 (+ the next five as well) switch all 17.02 RPCs over to pack_bit_str_hex() instead of pack_bit_fmt.