Ticket 9122

Summary:	slurmrestd sends back invalid json payload
Product:	Slurm	Reporter:	Andrew Bruno <aebruno2>
Component:	slurmrestd	Assignee:	Nate Rini <nate>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	nate, tim
Version:	20.02.2
Hardware:	Linux
OS:	Linux
Site:	University of Buffalo (SUNY)	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	20.02.4, 20.11
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	system info for openstack instance system info for bare metal machine slurm.conf Patch to fix corrupt json

Description Andrew Bruno 2020-05-28 14:10:13 MDT

We're interested in consuming jobs data using the slurmrestd /slurm/v0.0.35/jobs endpoint operation. We're finding this endpoint will consistently return invalid json data. Here's an example:

     "cpu_frequency_maximum": null,
     "cpu_frequency_governor": null,
     "cpus_per_tres": "",
     "deadline": 0,
     "delay_boot": 0,
     "dependency": "",
     "derived_exit_code": 0,
   439,
     "user_name": "",
     "wckey": "",


The error will randomly appear in a different spot in the output. For example:

     "burst_buffer": "",
     "burst_buffer_state": "",
     "cluster": "ub-hpc",
     "cluster_features": "",
     "command": "\/projects\/academic\/user\/script",
     "commealse,
     "resize_time": 0,
     "restart_cnt": 0,
     "resv_name": "",
     "shared": null,
     "show_flags": [
       "SHOW_ALL",
       "SHOW_LOCAL"
     ],
     "sockets_per_board": 0,
     "sockets_per_node": null,

We also will sometimes (thought not as frequent) get a valid json response. We have about 1.5K jobs that get returned from the endpoint and the typical response body size is:

< Content-Length: 4078929

Did some testing and doesn't appear to be related to json serialization but perhaps related to conmgr.c _handle_write (buffer/memory issue?).

We compiled slurm with the following:

./configure --enable-slurmrestd

And we're running using a unix socket:

slurmrestd unix:/path/to/slurm.sock

Then we test with curl and python to validate that correct json is returned:

curl -vvv -H 'Accept: application/json' --unix-socket /path/to/slurm.sock http:/slurm/v0.0.35/jobs | python -mjson.tool

Invalid control character at: line 7897 column 17 (char 219192)

Let us know if we can provide any other information.

Comment 1 Nate Rini 2020-05-28 14:32:10 MDT

(In reply to Andrew Bruno from comment #0)
> We're interested in consuming jobs data using the slurmrestd
> /slurm/v0.0.35/jobs endpoint operation. We're finding this endpoint will
> consistently return invalid json data. Here's an example:
Can you please attach the raw response from slurmrestd with the invalid response. Can you also verify how slurmrestd is being called.

Comment 2 Nate Rini 2020-05-28 14:33:49 MDT

(In reply to Nate Rini from comment #1)
> (In reply to Andrew Bruno from comment #0)
> Can you also verify how slurmrestd is being called.

Just to be more precise: is slurmrestd being called by root or a slurm user?

Comment 3 Andrew Bruno 2020-05-28 15:21:00 MDT

> Just to be more precise: is slurmrestd being called by root or a slurm user?

slurmrestd is being run by normal user account. 

$ id
uid=1000(centos) gid=1000(centos) groups=1000(centos),4(adm),10(wheel),190(systemd-journal)

$ slurmrestd unix:/home/centos/slurm.sock
lt-slurmrestd: _add_connection: [unix:/home/centos/slurm.sock] new connection input_fd=8 output_fd=8
lt-slurmrestd: _signal_change: sending
lt-slurmrestd: _watch: starting connections=0 listen=1
lt-slurmrestd: _watch: detected 1 events from event fd
lt-slurmrestd: _handle_connection: [unix:/home/centos/slurm.sock] waiting to read
lt-slurmrestd: _watch: queuing up listen
lt-slurmrestd: _listen: listeners=1
lt-slurmrestd: _listen: [unix:/home/centos/slurm.sock] listening
lt-slurmrestd: _listen: polling 3/3 file descriptors

Then from another terminal we run:

curl -vvv -H 'Accept: application/json' --unix-socket /home/centos/slurm.sock http:/slurm/v0.0.35/jobs

Comment 4 Nate Rini 2020-05-28 15:25:31 MDT

I'm going to try to replicate this locally first based on your description.

--Nate

Comment 5 Andrew Bruno 2020-05-28 15:41:03 MDT

(In reply to Nate Rini from comment #4)
> I'm going to try to replicate this locally first based on your description.
> 

Sounds good. It appears it requires a large enough job list to replicate. We tried in our dev environment with only a few jobs and couldn't reproduce.  

I'll work on getting you some sanitized raw response output. But in the meantime here's some more details.

We captured a good valid json response (good.json) and one that is invalid json (bad.json):

$ wc -l good.json bad.json 
 128863 good.json
 128153 bad.json

$ cat bad.json | grep '"job_id"' | wc -l
1043

$ cat bad.json | grep '"job_id"' | sort -u | wc -l
128


$ cat good.json | grep '"job_id"' | wc -l
1049

$ cat good.json | grep '"job_id"' | sort -u | wc -l
1049

Comment 6 Nate Rini 2020-05-28 15:57:38 MDT

(In reply to Andrew Bruno from comment #5)
> (In reply to Nate Rini from comment #4)
> > I'm going to try to replicate this locally first based on your descript
> Sounds good. It appears it requires a large enough job list to replicate. We
> tried in our dev environment with only a few jobs and couldn't reproduce.  
So far the naive replication with 1.5k jobs pending doesn't show it either.

> I'll work on getting you some sanitized raw response output. But in the
> meantime here's some more details.
Can you please provide an updated slurm.conf & friends from the cluster

Comment 8 Nate Rini 2020-05-28 16:03:17 MDT

Can you please also provide your version of json-c. Version '0.13.1+dfsg-4ubuntu0.1' has known data corruption issues.

Comment 9 Andrew Bruno 2020-05-28 16:13:38 MDT

(In reply to Nate Rini from comment #8)
> Can you please also provide your version of json-c. Version
> '0.13.1+dfsg-4ubuntu0.1' has known data corruption issues.

We're running on centos 7.8:

Name        : json-c
Arch        : x86_64
Version     : 0.11
Release     : 4.el7_0
Size        : 64 k

However, I thought this might have been related to json-c and compiled slurm against the latest release of json-c (json-c-0.14-20200419) and still saw the same invalid json responses.

Comment 10 Nate Rini 2020-05-28 16:14:53 MDT

(In reply to Andrew Bruno from comment #0)
>    439,
>      "commealse,
should be
>      "comment": "",

Is the corruption always 3-4 characters long?

Comment 13 Andrew Bruno 2020-05-28 17:30:19 MDT

(In reply to Nate Rini from comment #10)
> (In reply to Andrew Bruno from comment #0)
> >    439,
> >      "commealse,
> should be
> >      "comment": "",
> 
> Is the corruption always 3-4 characters long?

Here's a few more examples of the corruption:

.....
     "core_spec": null,
     "thread_spec": null,
     "cores_per_socket":ray_task_id": null,
     "array_max_tasks": 0,
     "array_task_string": "",
     "association_id": 735,

.....
     "contiguous": false,
  spend_time": 0,
     "system_comment": "",
     "time_limit": 1440,


....
     "wckey": "",
     "current_working_directory": "\/gpfs\/scratch\/hpcc"
  "",
     "tres_req_str": "cpu=1,mem=2800M,node=1,billing=1",

Comment 14 Nate Rini 2020-05-28 17:37:27 MDT

Can you please provide your slurm.conf & friends and the following output:
> lscpu
> lsmem
> cat /proc/meminfo
> numactl -a -s

I'm wondering if your cpu configuration is more likely to cause a race condition. Have you seen this corruption on any other machines?

Comment 15 Andrew Bruno 2020-05-28 17:59:57 MDT

(In reply to Nate Rini from comment #14)
> Can you please provide your slurm.conf & friends and the following output:
> > lscpu
> > lsmem
> > cat /proc/meminfo
> > numactl -a -s
> 
> I'm wondering if your cpu configuration is more likely to cause a race
> condition. Have you seen this corruption on any other machines?

Yes, I can reproduce on both a cloud instance (running in openstack) as well as a bare metal machine. Attached are the sysinfo-openstack.txt and sysinfo-baremetal.txt system info along with slurm.conf.

Let us know if you need anything else.

Comment 16 Andrew Bruno 2020-05-28 18:00:44 MDT

Created attachment 14434 [details]
system info for openstack instance

Comment 17 Andrew Bruno 2020-05-28 18:01:08 MDT

Created attachment 14435 [details]
system info for bare metal machine

Comment 18 Andrew Bruno 2020-05-28 18:01:39 MDT

Created attachment 14436 [details]
slurm.conf

Comment 19 Andrew Bruno 2020-05-28 20:09:18 MDT

Created attachment 14437 [details]
Patch to fix corrupt json

Comment 20 Andrew Bruno 2020-05-28 20:14:50 MDT

I was able to fix the issue (patch is attached). Narrowed it down to this line:

        memcpy(get_buf_data(con->out), (get_buf_data(con->out) + wrote),
               (get_buf_offset(con->out) - wrote));

The memory areas must not overlap when using memcpy? There may be a better approach but switching to using memmove fixed all the corrupted data I was seeing.

Comment 24 Nate Rini 2020-06-01 10:44:19 MDT

Andrew

This patch has been sent for review and inclusion upstream.

Thanks,
--Nate

Comment 26 Nate Rini 2020-06-05 13:41:09 MDT

This patch is upstream: https://github.com/SchedMD/slurm/commit/37aba351deb13f436665ca7e4cfeb80211d1fbb6

Thanks,
--Nate