Ticket 6859 - elasticsearch jobcomplete plugin
Summary: elasticsearch jobcomplete plugin
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other tickets)
Version: 17.11.13
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Alejandro Sanchez
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-04-15 07:57 MDT by Andrea
Modified: 2019-07-26 09:59 MDT (History)
2 users (show)

See Also:
Site: Atos/Eviden Sites
Alineos Sites: ---
Atos/Eviden Sites: DKRZ Germany
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 17.11.14
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Slurmctld logs (9.89 MB, text/plain)
2019-04-16 02:07 MDT, Andrea
Details
elasticsearcg (179.20 KB, application/x-gzip)
2019-04-17 05:46 MDT, Andrea
Details
6859_1711_v1 (2.89 KB, patch)
2019-05-06 08:44 MDT, Alejandro Sanchez
Details | Diff
Elastic search SLURM's file (179.20 KB, application/gzip)
2019-06-17 05:20 MDT, Andrea
Details
elasticsearch logs (15.34 MB, text/x-log)
2019-06-17 05:20 MDT, Andrea
Details
Outlook-Atos logo.png (12.25 KB, image/png)
2019-06-21 04:14 MDT, Andrea
Details
Outlook-LinkedIn i.png (1.37 KB, image/png)
2019-06-21 04:14 MDT, Andrea
Details
Outlook-Twitter ic.png (1.22 KB, image/png)
2019-06-21 04:14 MDT, Andrea
Details
Outlook-Facebook i.png (909 bytes, image/png)
2019-06-21 04:14 MDT, Andrea
Details
Outlook-Youtube ic.png (1.26 KB, image/png)
2019-06-21 04:14 MDT, Andrea
Details
image006.png (13.02 KB, image/png)
2019-06-24 04:46 MDT, Andrea
Details
image007.png (1.38 KB, image/png)
2019-06-24 04:46 MDT, Andrea
Details
image008.png (1.38 KB, image/png)
2019-06-24 04:46 MDT, Andrea
Details
image009.png (960 bytes, image/png)
2019-06-24 04:46 MDT, Andrea
Details
image010.png (1.41 KB, image/png)
2019-06-24 04:46 MDT, Andrea
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Andrea 2019-04-15 07:57:55 MDT
Hi Team,

Here at the DKRZ we noticed some sort of issues with the elasticsearch job complete plugin.
What we noticed is that elasticsearch is complaining about the size of the messages. Our logs are being spammed with:
[2019-04-15T15:55:26,928][DEBUG][o.e.a.b.TransportShardBulkAction] [ma3] [slurm][1] failed to execute bulk item (index) BulkShardRequest [[slurm][1]] containing [index {[slurm][jobcomp][AWohR_xPSKL63ncfLTBZ], source[n/a, actual length: [10.8kb], max length: 2kb]}]
[2019-04-15T15:55:26,942][DEBUG][o.e.a.b.TransportShardBulkAction] [ma3] [slurm][3] failed to execute bulk item (index) BulkShardRequest [[slurm][3]] containing [index {[slurm][jobcomp][AWohR_xes-p2C7Ge78QZ], source[n/a, actual length: [10.8kb], max length: 2kb]}]
[2019-04-15T15:55:26,949][DEBUG][o.e.a.b.TransportShardBulkAction] [ma3] [slurm][1] failed to execute bulk item (index) BulkShardRequest [[slurm][1]] containing [index {[slurm][jobcomp][AWohR_xkAlBr5lb5kHIQ], source[n/a, actual length: [10.8kb], max length: 2kb]}]

We're not sure when this has started, since all the logs we have, are showing these errors.

Configuration:
[root@ma1 slurm]# scontrol show config | grep -iE 'comp'
CompleteWait            = 40 sec
JobCompHost             = localhost
JobCompLoc              = http://10.50.64.90:9201
JobCompPort             = 0
JobCompType             = jobcomp/elasticsearch
JobCompUser             = root

Could you please advice how would we go about resolving this issue?

Regards,
Andrea
Comment 1 Alejandro Sanchez 2019-04-15 09:31:25 MDT
Hi,

will you temporarily increase the debug level to at least 'debug' and enable the Elasticsearch debug flag, then let some jobs finish and finally reset the debug level back to the original value, remove the flag and attach slurmctld.log?

$ scontrol setdebug debug
$ scontrol setdebugflags +elasticsearch
Let some jobs finish
$ scontrol setdebugflags -elasticsearch
$ scontrol setdebug info (or whatever you had as original)

I'd like to know if the error is logged due to a single job index retry or for every job and why.

Thanks.
Comment 2 Andrea 2019-04-16 02:07:41 MDT
Created attachment 9917 [details]
Slurmctld logs
Comment 3 Andrea 2019-04-16 02:08:11 MDT
Hi Alejandro,

Thank you for your answer.
Kindly find attached the slurmctld logs.

Regards,
Andrea
Comment 4 Alejandro Sanchez 2019-04-16 08:53:01 MDT
Will you stop slurmctld, copy to a temp directory StateSaveLocation/elasticsearch_state file, then start slurmctld and attach the file here?

What version of elasticsearch are you using?

I'm wondering if there's a problem with any of the fields being malformed or any issue with the esearch expected mappings so that the server status code response is always 400. So I'm interested in loading those jobs locally and see if I can index them locally or find any common pattern among the failing ones. The logs show there are jobs still being indexed after the failure.

Thanks.
Comment 5 Andrea 2019-04-17 04:18:02 MDT
Hi Alejandro,

This is a production system and we can't afford stopping slurmctld for this - is there any other alternative?

Regards,
Andrea
Comment 6 Andrea 2019-04-17 04:19:11 MDT
Regarding the elasticsearch version: elasticsearch-5.5.1-1.noarch
Comment 7 Alejandro Sanchez 2019-04-17 05:15:11 MDT
Then just attach a tar file with elasticsearc_state[.old] 2 files. Just wanted to minimize the likelihood of copying the files while slurmctld is manipulating them with the restart.
Comment 8 Andrea 2019-04-17 05:46:21 MDT
Created attachment 9929 [details]
elasticsearcg

Hi Alejandro,

Kindly find attached the required files.

Regards,
Andrea
Comment 9 Andrea 2019-04-23 01:41:09 MDT
Hi Alejandro,

Have you had any chance to look into this?

Regards,
Andrea
Comment 10 Alejandro Sanchez 2019-04-23 05:37:15 MDT
I can reproduce the issue with the provided state. I'm looking into it.
Comment 11 Alejandro Sanchez 2019-04-23 06:05:09 MDT
Do you happen to have handy any job script associated with any of the failed to be indexed jobs? Some of the jobids are 6980257 or 7000918 but they all can be found by inspecting the attached state file.
Comment 12 Andrea 2019-04-23 06:06:41 MDT
Hi Alejandro,

Unfortunately no. I would have to ask around.

Regards,
Andrea


Andrea Del Monaco
HPC Consultant – Big Data & Security [GER]
M: +31 612031174
Burgemeester Rijnderslaan 30 – 1185 MC Amstelveen – The Netherlands
atos.net




On Tue, Apr 23, 2019 at 2:05 PM +0200, "bugs@schedmd.com" <bugs@schedmd.com<mailto:bugs@schedmd.com>> wrote:


Comment # 11<https://bugs.schedmd.com/show_bug.cgi?id=6859#c11> on bug 6859<https://bugs.schedmd.com/show_bug.cgi?id=6859> from Alejandro Sanchez<mailto:alex@schedmd.com>

Do you happen to have handy any job script associated with any of the failed to
be indexed jobs? Some of the jobids are 6980257 or 7000918 but they all can be
found by inspecting the attached state file.

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 13 Alejandro Sanchez 2019-04-23 07:51:42 MDT
It seems there are invalid UTF-8 characters being sent as part of the serialized json representation of jobs:

alex@polaris:~$ sudo grep "Invalid UTF" /var/log/elasticsearch/elasticsearch.log | sort -u
Caused by: com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 middle byte 0x63
Caused by: com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 middle byte 0x6e
Caused by: com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 middle byte 0x74
Caused by: com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 start byte 0xa1
Caused by: com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 start byte 0xb0
Caused by: com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 start byte 0xfc
alex@polaris:~$

Either that, or there's a bug in the FasterXML Jackson JSON parser[1]. I'm thinking how best would be to proceed at this point.

[1] https://github.com/FasterXML/jackson
Comment 17 Alejandro Sanchez 2019-05-03 12:37:45 MDT
Hi,

Your attached elasticsearch_state contains 337 jobs that could not be indexed. I've managed to disaggregate the reasons:

329 due to Invalid UTF-8. 

6 due to script string containing control character ACK.

failed to parse field [script] of type [text] ... Illegal unquoted character ((CTRL-CHAR, code 6)): has to be escaped using backslash.

1 due to job name containing a quotation mark. 

Unexpected character ('"' (code 34)): was expecting comma to separate Object entries. Was due to a job name containing a quotation mark:

... "time_limit":1800,"job_name":"icon"","account":"bm0974", ...
                                      ^
1 due to alloc_node containing a vertical tab.

failed to parse field [alloc_node] of type [text]. Illegal unquoted character ((CTRL-CHAR, code 11)): has to be escaped using backslash.

... "partition":"prepost","alloc_node":"^K", ...
                                        ^^
Said this, I have good and bad news.

The good news is I've created a very specific patch that manages to solve all these problems for your site's specific attached state, meaning with the patch applied all the loaded jobs from the attached state are successfully indexed. To do so, I've written a function to convert from ISO-8859-1 character encoding (I suspected some strings contained German-specific characters) to UTF-8 by using the iconv() family of syscalls. I also created another function to strip out any control characters. Finally, I also used a string substitution function for the job named 'icon"'. 

The bad news is unfortunately that the scope of the problem isn't just this plugin, but whole Slurm code is affected today. Bug 2265 is a known issue: Slurm today doesn't perform an aggressive sanitization on user provided strings, and this means problems can happen in many places currently.

I admit that the plugin could be improved in several ways (i.e. make use of json-c library instead of manually trying to create a valid json object off each finished job record), but still if the job record gets to the plugin with unsanitized strings in the job names, the job script, and so, edge-cases can always be triggered.

On Monday I will do some more tests on the workaround patch for this specific attached state. If you want, I will attach it so that you still have the chance to index these finished jobs. But still, if users continue submitting scripts with non-UTF8 characters, control characters and similar then I foresee more problems. I'll give you more details on the patch next week.

Longer-term (and not earlier than future 20.02) we would like to tackle this problem with unsanitized strings, as well as improve how the plugin is implemented today. Until then, this is a known unresolved issue, and the most I can give you is this short-term and very specific workaround for your site/state.
Comment 18 Alejandro Sanchez 2019-05-06 08:44:25 MDT
Created attachment 10124 [details]
6859_1711_v1

Hi,

please, see attached a patch to sanitize your 337 unindexed serialized jobs.

What this patch does is in the thread processing the list of pending to be indexed jobs, once a job has failed, call a function _soft_sanitize() to address all the problems specific to these 337 jobs in the attached state. Once the serialized strings are amended, the thread retries to index them after INDEX_RETRY_INTERVAL 30, which should then succeed (it does for me).

After that, I would remove the patch. Otherwise, the code will try to _soft_sanitize() future jobs that failed to be indexed (including those that couldn't for other reasons, i.e. elasticsearch server is down). And we don't want that. So it is intended as a one-time thing to index this state.

The patch only changes this file

src/plugins/jobcomp/elasticsearch/jobcomp_elasticsearch.c

so applying/removing the patch will require recompiling using 'make' && 'make install' against the Makefile in

/path/to/slurm/build/src/plugins/jobcomp/elasticsearch/Makefile

Please, let me know how it goes.

Thanks.
Comment 19 Alejandro Sanchez 2019-05-06 08:46:04 MDT
(In reply to Alejandro Sanchez from comment #18)
> so applying/removing the patch will require recompiling

... (and restart slurmctld after doing so).
Comment 20 Alejandro Sanchez 2019-05-09 05:16:45 MDT
Hi. Any updates on this? thanks.
Comment 21 Andrea 2019-05-09 05:26:37 MDT
Hi Alejandro,

We have recompiled the Slurm packages with your patch and will be installing them during a downtime window happening in June.

Thank you for your help with this case.
Will this patch be included in the master branch?

Regards,
Andrea
Comment 22 Alejandro Sanchez 2019-05-09 05:32:31 MDT
(In reply to Andrea from comment #21)
> Hi Alejandro,
> 
> We have recompiled the Slurm packages with your patch and will be installing
> them during a downtime window happening in June.
> 
> Thank you for your help with this case.
> Will this patch be included in the master branch?

Not at all. At most I'd refactor the plugin in master branch, but this patch is very specific to your site/state and wouldn't make sense to other sites. It even doesn't make sense for future finished jobs in your site. It should be a one-time applied work-around to make your state be indexed and then removed.
 
> Regards,
> Andrea
Comment 23 Andrea 2019-06-17 03:25:25 MDT
Hi Alejandro,

Thank you for your patience.
Last week, we've performed the upgrade of Slurm packages, but we keep seeing the following messages being spammed:
[2019-06-17T11:23:34,025][DEBUG][o.e.a.b.TransportShardBulkAction] [ma3] [slurm][3] failed to execute bulk item (index) BulkShardRequest [[slurm][3]] containing [index {[slurm][jobcomp][AWtkv7YJs-p2C7GevghS], source[{"jobid":60526,"username":"k202068","user_id":20586,"groupname":"k20200","group_id":1567,"@start":"2016-06-16T08:19:38","@end":"2016-06-16T08:19:38","elapsed":0,"partition":"compute","alloc_node":"btlogin1","nodes":"(null)","total_cpus":0,"total_nodes":0,"derived_exitcode":0,"exitcode":0,"state":"CANCELLED","cpu_hours":0.000000,"@submit":"2016-06-16T08:19:09","eligible_time":18446744073709551345,"work_dir":"/mnt/lustre/pf/k/k202068/mpi_programme","std_err":"/mnt/lustre/scratch/k/k202068/dev-shm.e%j","std_in":"/dev/null","std_out":"/mnt/lustre/scratch/k/k202068/dev-shm.o%j","cluster":"bullp","qos":"normal","ntasks":48,"ntasks_per_node":48,"cpus_per_task":1,"time_limit":3600,"account":"k20200","script":"#!/bin/bash\n#SBATCH --job-name=dev-shm \n#SBATCH --partition=temp\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=48\n#SBATCH --time=00:60:00\n#SBATCH --mail-type=FAIL\n#SBATCH --mail-user=beyer@dkrz.de\n#SBATCH --account=k20200\n#SBATCH --output=/mnt/lustre/scratch/k/k202068/dev-shm.o%j\n#SBATCH --error=/mnt/lustre/scratch/k/k202068/dev-shm.e%j\n#SBATCH --begin=now+300\n####SBATCH --reservation=k202068_59\n####SBATCH --constraint=haswell\n\nexport SESSION_TMP_DIR=\"/dev/shm/${SLURM_JOB_USER}_${SLURM_JOB_ID}\"\n\nenv | grep -i SESSION_TMP_DIR\n\nfor i in `seq 0 9`; do\n   dd if=/dev/zero of=${SESSION_TMP_DIR}/dd_testfile_100M_${i} bs=1M count=100\ndone\n\nsleep 1\n\nls -la ${SESSION_TMP_DIR}\n\nsleep 1\n\nmodule load bullxmpi/bullxmpi-1.2.8.4\nmodule load intel/16.0.3\n\nsrun -l --propagate=STACK --cpu_bind=threads --distribution=block:cyclic ./heat-mpi-slow-big_5120-intel-16.03-bullxmpi-1.2.8.4\n\n\n","parent_accounts":"/root/dkrz/k20200"}]}]
org.elasticsearch.index.mapper.MapperParsingException: failed to

Any idea?
Comment 24 Alejandro Sanchez 2019-06-17 03:50:16 MDT
Hi Andrea,

can you attach your current elasticsearc_state[.old] and elasticsearch.log?

Thanks
Comment 25 Andrea 2019-06-17 05:20:10 MDT
Created attachment 10618 [details]
Elastic search SLURM's file
Comment 26 Andrea 2019-06-17 05:20:46 MDT
Created attachment 10619 [details]
elasticsearch logs
Comment 27 Andrea 2019-06-17 05:21:07 MDT
Hi,

Kindly find attached the required files.

Regards,
Andrea
Comment 28 Alejandro Sanchez 2019-06-20 09:57:30 MDT
Hi Andrea,

Looking at the logs

alex@polaris:~/Downloads/new$ grep "Caused by" elasticsearch.log  | sort -u
Caused by: com.fasterxml.jackson.core.JsonParseException: Numeric value (18446744073709548016) out of range of long (-9223372036854775808 - 9223372036854775807)
Caused by: com.fasterxml.jackson.core.JsonParseException: Numeric value (18446744073709551345) out of range of long (-9223372036854775808 - 9223372036854775807)
Caused by: java.lang.NullPointerException
alex@polaris:~/Downloads/new$

Elasticearch specifies[1] for long number type:

long - A signed 64-bit integer with a minimum value of -2^63 and a maximum value of 2^63-1.

This range matches the mentioned in the logs, where there are jobs containing

... "eligible_time":18446744073709551345 ...

these jobs were probably serialized before this commit, so using a version prior to slurm-16-05-10-1

https://github.com/SchedMD/slurm/commit/0a4d5770e9e

that commit changed eligible_time from being formatted as an unsigned long to be formatted as @eligible as date type and add queue_time as a long (signed 64-bit integer).

With this command you can inspect the type of each field (mappings) for your slurm index:

alex@polaris:~/t$ curl -XGET http://localhost:9200/slurm/_mapping?pretty
{
  "slurm" : {
    "mappings" : {
      "jobcomp" : {
        "properties" : {
          "@eligible" : {
            "type" : "date"
          },
...
          "queue_wait" : {
            "type" : "long"
          },
...

I noted something strange though. The jobs that elastic logs complain about, like 

source[{"jobid":60526,"username":"k202068",..."eligible_time":18446744073709551345,..."cluster":"bullp"...

doesn't appear in the elasticsearch_state you attached. The elasticsearch_state you provided seems to contain jobs from cluster mistral and with and running a newer version of Slurm. Instead, the logs show cluster bullp with the old formatting eligible_time as explained above. This reasoning is also backed up by the fact that the state you attached is properly indexed with the patch I provided in this bug.

[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/number.html
Comment 29 Andrea 2019-06-21 04:14:25 MDT
Created attachment 10666 [details]
Outlook-Atos logo.png

Hi Alejandro,

Thank you for the analysis, but then how do we move from here?
It's weird that you see bullp cluster and mistral - it has always been mistral afaik.

Mit freundlichen Grüßen / Kind regards,

[Atos logo]

Andrea Del Monaco
HPC Consultant – Big Data & Security [GER]
M: +31 612031174
Burgemeester Rijnderslaan 30 – 1185 MC Amstelveen – The Netherlands
atos.net<https://atos.net/>
[LinkedIn icon]<https://www.linkedin.com/company/1259/> [Twitter icon] <https://twitter.com/atos>  [Facebook icon] <https://www.facebook.com/Atos/>  [Youtube icon] <https://www.youtube.com/user/Atos>

________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Thursday, June 20, 2019 5:57 PM
To: Del Monaco, Andrea
Subject: [Bug 6859] elasticsearch jobcomplete plugin


Comment # 28<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6859%23c28&data=02%7C01%7Candrea.delmonaco%40atos.net%7C9d5f8f0657894ba73a7c08d6f5980dba%7C33440fc6b7c7412cbb730e70b0198d5a%7C0%7C0%7C636966430747301921&sdata=GujeyxCCKDKo0%2BXKmNn89ojHon%2FbVqZ3JWnro9cGtQY%3D&reserved=0> on bug 6859<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6859&data=02%7C01%7Candrea.delmonaco%40atos.net%7C9d5f8f0657894ba73a7c08d6f5980dba%7C33440fc6b7c7412cbb730e70b0198d5a%7C0%7C0%7C636966430747311936&sdata=YcCc3WqarJWBO7ge6QaERvgYajbKw%2BUmQD0NycsazJc%3D&reserved=0> from Alejandro Sanchez<mailto:alex@schedmd.com>

Hi Andrea,

Looking at the logs

alex@polaris:~/Downloads/new$ grep "Caused by" elasticsearch.log  | sort -u
Caused by: com.fasterxml.jackson.core.JsonParseException: Numeric value
(18446744073709548016) out of range of long (-9223372036854775808 -
9223372036854775807)
Caused by: com.fasterxml.jackson.core.JsonParseException: Numeric value
(18446744073709551345) out of range of long (-9223372036854775808 -
9223372036854775807)
Caused by: java.lang.NullPointerException
alex@polaris:~/Downloads/new$

Elasticearch specifies[1] for long number type:

long - A signed 64-bit integer with a minimum value of -2^63 and a maximum
value of 2^63-1.

This range matches the mentioned in the logs, where there are jobs containing

... "eligible_time":18446744073709551345 ...

these jobs were probably serialized before this commit, so using a version
prior to slurm-16-05-10-1

https://github.com/SchedMD/slurm/commit/0a4d5770e9e<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSchedMD%2Fslurm%2Fcommit%2F0a4d5770e9e&data=02%7C01%7Candrea.delmonaco%40atos.net%7C9d5f8f0657894ba73a7c08d6f5980dba%7C33440fc6b7c7412cbb730e70b0198d5a%7C0%7C0%7C636966430747311936&sdata=vGNpOWfO6UEr2rZ3ZcSaS2a4BBsjOd707twEW0sbvzU%3D&reserved=0>

that commit changed eligible_time from being formatted as an unsigned long to
be formatted as @eligible as date type and add queue_time as a long (signed
64-bit integer).

With this command you can inspect the type of each field (mappings) for your
slurm index:

alex@polaris:~/t$ curl -XGET http://localhost:9200/slurm/_mapping?pretty
{
  "slurm" : {
    "mappings" : {
      "jobcomp" : {
        "properties" : {
          "@eligible" : {
            "type" : "date"
          },
...
          "queue_wait" : {
            "type" : "long"
          },
...

I noted something strange though. The jobs that elastic logs complain about,
like

source[{"jobid":60526,"username":"k202068",..."eligible_time":18446744073709551345,..."cluster":"bullp"...

doesn't appear in the elasticsearch_state you attached. The elasticsearch_state
you provided seems to contain jobs from cluster mistral and with and running a
newer version of Slurm. Instead, the logs show cluster bullp with the old
formatting eligible_time as explained above. This reasoning is also backed up
by the fact that the state you attached is properly indexed with the patch I
provided in this bug.

[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/number.html<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.elastic.co%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fnumber.html&data=02%7C01%7Candrea.delmonaco%40atos.net%7C9d5f8f0657894ba73a7c08d6f5980dba%7C33440fc6b7c7412cbb730e70b0198d5a%7C0%7C0%7C636966430747321935&sdata=6zRvuVbsdvFobCqLVXphSvVvK7U9dV62SueywA55nhQ%3D&reserved=0>

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 30 Andrea 2019-06-21 04:14:25 MDT
Created attachment 10667 [details]
Outlook-LinkedIn i.png
Comment 31 Andrea 2019-06-21 04:14:25 MDT
Created attachment 10668 [details]
Outlook-Twitter ic.png
Comment 32 Andrea 2019-06-21 04:14:25 MDT
Created attachment 10669 [details]
Outlook-Facebook i.png
Comment 33 Andrea 2019-06-21 04:14:25 MDT
Created attachment 10670 [details]
Outlook-Youtube ic.png
Comment 34 Alejandro Sanchez 2019-06-24 04:43:13 MDT
The problem is the last state you attached doesn't contain the serialized jobs the logs complain about. For example, the log complains about job 60526 but the state doesn't contain information about this job. 

Do you have/had more than one cluster trying to index to the same elastic server?

Can you attach the state containing the jobs that according to the logs can't be indexed? then with this state, I can prepare a modification of the previous patch to amend the out of range of long issue.
Comment 35 Andrea 2019-06-24 04:46:11 MDT
Created attachment 10682 [details]
image006.png

Hi Alejandro,

I have attached the files you asked for, so I am not sure what else I should attach…
Are there any other files that I might attach?

Mit freundlichen Grüßen / Kind regards,
[Atos logo]
Andrea Del Monaco
HPC Consultant – Big Data & Security [GER]
M: +31 612031174
Burgemeester Rijnderslaan 30 – 1185 MC Amstelveen – The Netherlands
atos.net<https://atos.net/>
[LinkedIn icon]<https://www.linkedin.com/company/1259/> [Twitter icon] <https://twitter.com/atos>  [Facebook icon] <https://www.facebook.com/Atos/>  [Youtube icon] <https://www.youtube.com/user/Atos>


From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, June 24, 2019 12:43 PM
To: Del Monaco, Andrea <andrea.delmonaco@atos.net>
Subject: [Bug 6859] elasticsearch jobcomplete plugin

Comment # 34<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6859%23c34&data=02%7C01%7Candrea.delmonaco%40atos.net%7C697bdcf90a414d2ab44608d6f890cd61%7C33440fc6b7c7412cbb730e70b0198d5a%7C0%7C0%7C636969698135134797&sdata=P3KToXVniJfU4JCvcyQmX1V12kKtuBbCj2O%2FPFrTcPo%3D&reserved=0> on bug 6859<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6859&data=02%7C01%7Candrea.delmonaco%40atos.net%7C697bdcf90a414d2ab44608d6f890cd61%7C33440fc6b7c7412cbb730e70b0198d5a%7C0%7C0%7C636969698135134797&sdata=K0FopA6bQZAFTOy4FfXk%2FwdyXOX7s8rKxcmTjNd42yg%3D&reserved=0> from Alejandro Sanchez<mailto:alex@schedmd.com>

The problem is the last state you attached doesn't contain the serialized jobs

the logs complain about. For example, the log complains about job 60526 but the

state doesn't contain information about this job.



Do you have/had more than one cluster trying to index to the same elastic

server?



Can you attach the state containing the jobs that according to the logs can't

be indexed? then with this state, I can prepare a modification of the previous

patch to amend the out of range of long issue.

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 36 Andrea 2019-06-24 04:46:12 MDT
Created attachment 10683 [details]
image007.png
Comment 37 Andrea 2019-06-24 04:46:12 MDT
Created attachment 10684 [details]
image008.png
Comment 38 Andrea 2019-06-24 04:46:12 MDT
Created attachment 10685 [details]
image009.png
Comment 39 Andrea 2019-06-24 04:46:12 MDT
Created attachment 10686 [details]
image010.png
Comment 40 Alejandro Sanchez 2019-06-24 05:10:18 MDT
The plugin has an internal List (pool) of finished jobs to be indexed of size MAX_JOBS 1000000. The ones that can't be indexed for any reason are retried every INDEX_RETRY_INTERVAL 30 (seconds). When the plugin finishes execution due to slurmctld termination, the mentioned List is saved to the elasticsearch_state file. Complains about jobs like 60526 are mentioned in the elasticsearch.log but such job isn't in the elasticsearch_state you attached:

alex@polaris:~/Downloads/enew$ grep -c 60526 elasticsearch.log
737
alex@polaris:~/Downloads/enew$ grep -c 60526 elasticsearch_state
0
alex@polaris:~/Downloads/enew$

This means that either:

a) Job 60526 was already indexed (unlikely but could you double-check?)

b) You have more than one cluster and you attached an elasticsearch_state or elasticsearch.log from different clusters; or the ClusterName for this system changed at some point in time:

alex@polaris:~/Downloads/enew$ grep -c bullp elasticsearch.log 
1494
alex@polaris:~/Downloads/enew$ grep -c mistral elasticsearch.log 
0
alex@polaris:~/Downloads/enew$ grep -c bullp elasticsearch_state
0
alex@polaris:~/Downloads/enew$ grep -c mistral elasticsearch_state
337
alex@polaris:~/Downloads/enew$

c) Job 60526 wasn't saved in the elasticsearch_state for some reason (unlikely).
Comment 41 Alejandro Sanchez 2019-07-15 04:19:34 MDT
Hi,

is there anything else you need from here? thanks.
Comment 42 Alejandro Sanchez 2019-07-26 09:59:34 MDT
Hi, I'm gonna close this out as per the standalone workaround patch provided. I'm planning to refactor the plugin to address a couple of issues I've been identifying with time. This won't happen until 20.02 at the soonest.