12365 – OpenAPI Generated Python Code does not return TRES Values

Ticket 12365 - OpenAPI Generated Python Code does not return TRES Values

Summary: OpenAPI Generated Python Code does not return TRES Values

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmrestd (show other tickets)
Version:	20.11.8
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Nate Rini
QA Contact:

URL:

Depends on:
Blocks:	12648
	Show dependency tree / graph

Reported:	2021-08-26 18:15 MDT by Bill Britt
Modified:	2021-10-26 16:18 MDT (History)
CC List:	5 users (show)

See Also:	12507 11516
Site:	U WA Health Metrics
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	21.08.3, 22.05pre1
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
API output called outside of openapi generated library. (4.67 KB, text/plain) 2021-08-26 18:15 MDT, Bill Britt	Details
sacct output with MaxRSS Value (19.95 KB, image/png) 2021-08-26 18:15 MDT, Bill Britt	Details
sacct_output (256 bytes, text/plain) 2021-08-27 10:22 MDT, Sam Hu	Details
slurm.conf (3.64 KB, text/plain) 2021-08-30 10:22 MDT, Bill Britt	Details
Output of sacct -p -o all -j 1523 -D (5.74 KB, text/plain) 2021-08-30 14:12 MDT, Bill Britt	Details
slurmdbd.conf with password removed (359 bytes, text/plain) 2021-09-10 09:39 MDT, Bill Britt	Details
slurmdbd_get_job_result_j1523_0910 (5.71 KB, text/plain) 2021-09-10 15:39 MDT, Sam Hu	Details
curl_output_0910 (1008 bytes, text/plain) 2021-09-10 17:17 MDT, Sam Hu	Details
sacct_output_0910 (234 bytes, text/plain) 2021-09-10 17:18 MDT, Sam Hu	Details
sacct_output0914 (254 bytes, text/plain) 2021-09-14 11:02 MDT, Sam Hu	Details
curl_output_0914 (1.32 MB, text/plain) 2021-09-14 11:03 MDT, Sam Hu	Details
slurmdbd_get_job_result_0914 (6.17 KB, text/plain) 2021-09-14 11:04 MDT, Sam Hu	Details
Another set of test (with job created today) (752 bytes, text/plain) 2021-09-14 12:19 MDT, Sam Hu	Details
curl_output_0914_1 (1.32 MB, text/plain) 2021-09-14 12:21 MDT, Sam Hu	Details
slurmdbd_get_job_result_0914_1 (6.16 KB, text/plain) 2021-09-14 12:21 MDT, Sam Hu	Details
slurmdbd logs (26.02 KB, text/plain) 2021-09-14 12:43 MDT, Bill Britt	Details
slurmctld logs (4.05 KB, text/plain) 2021-09-14 12:43 MDT, Bill Britt	Details
sacct_p_all_d_j_1523 (5.74 KB, text/plain) 2021-09-15 10:22 MDT, Sam Hu	Details
sacct_p_all_d_j_16445 (5.76 KB, text/plain) 2021-09-15 10:23 MDT, Sam Hu	Details
curl_output_0916_j1523 updated (22.61 KB, text/plain) 2021-09-16 16:17 MDT, Sam Hu	Details
curl_output_0916_j16445 (22.86 KB, text/plain) 2021-09-16 16:17 MDT, Sam Hu	Details
slurmdbd_get_job_result_j1523_j16445_0917 (11.92 KB, text/plain) 2021-09-17 17:17 MDT, Sam Hu	Details
python source file (2.33 KB, text/x-python) 2021-09-17 17:25 MDT, Sam Hu	Details
simple script to look for step TRES (1.12 KB, text/x-python) 2021-09-22 11:42 MDT, Nate Rini	Details
showjob script (1.05 KB, text/x-python3) 2021-09-22 18:03 MDT, Nate Rini	Details
patch for 20.11 (1.10 KB, patch) 2021-09-22 18:03 MDT, Nate Rini	Details \| Diff
New set of test 0927 (2.68 KB, text/x-python) 2021-09-27 16:59 MDT, Sam Hu	Details
sacct_output_j16632_0927 (255 bytes, text/plain) 2021-09-27 17:00 MDT, Sam Hu	Details
test_slurm_worker_node_for_schedmd_4 (2.41 KB, text/x-python) 2021-09-27 17:01 MDT, Sam Hu	Details
slurmdbd_get_job_result_j16632_0927 (6.45 KB, text/plain) 2021-09-27 17:02 MDT, Sam Hu	Details
conda_slurm_reference in my testing (89.58 KB, image/png) 2021-09-29 10:08 MDT, Sam Hu	Details
OpenAPI file from after patching (86.63 KB, application/json) 2021-09-29 14:53 MDT, Bill Britt	Details
slurmdrestd logs (167.65 KB, text/plain) 2021-09-30 09:21 MDT, Bill Britt	Details
backtrace (73.38 KB, text/plain) 2021-09-30 14:13 MDT, Bill Britt	Details
slurmdbd log (26.35 KB, text/plain) 2021-09-30 16:21 MDT, Bill Britt	Details
slurmrestd log curl error (2.61 KB, text/plain) 2021-10-01 13:35 MDT, Ali Nikkhah	Details
slurmdbd log around curl time (1.50 KB, text/plain) 2021-10-01 23:00 MDT, Ali Nikkhah	Details
slurmdbd debug3 and debugflags log (603 bytes, text/plain) 2021-10-04 16:10 MDT, Ali Nikkhah	Details
slurmrestd debug3 log (7.19 KB, text/plain) 2021-10-04 16:14 MDT, Ali Nikkhah	Details
slurmrestd strace (5.24 KB, text/plain) 2021-10-05 14:07 MDT, Ali Nikkhah	Details
j18261_job_output_of_slurmdbd_get_job (28.00 KB, text/plain) 2021-10-08 17:23 MDT, Sam Hu	Details
j18261_job_output_of_sacct (208 bytes, text/plain) 2021-10-08 17:24 MDT, Sam Hu	Details
Show Obsolete (1) Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Bill Britt 2021-08-26 18:15:15 MDT

Created attachment 21055 [details]
API output called outside of openapi generated library.

Hello, we are unable to obtain TRES values using python code generated by the following openapi-generator call on SLURM 20.11.8 
  docker run --rm \
    --volume ${PWD}:/local \
    --workdir=/local \
    openapitools/openapi-generator-cli:v4.3.1 \
    generate \
    --input-spec /local/slurm-api.json \
    --output /local/py_api_client\
    --generator-name python \
    --package-name slurm_urllib3

The sacct.png file attached show the MaxRSS value returned by sacct and the slurmdbd_get_job_result.txt file shows that the value is returned by the API when called directly with cURL to the /slurmdb/v0.0.36/job/{job_id} route; however we cannot find the result from  "slurm_api.slurmdbd_get_job".


Version info:
# /opt/slurm/sbin/slurmrestd -V
slurm 20.11.8

# sbatch -V
slurm 20.11.8

Comment 1 Bill Britt 2021-08-26 18:15:47 MDT

Created attachment 21056 [details]
sacct output with MaxRSS Value

Comment 2 Nate Rini 2021-08-27 09:12:45 MDT

(In reply to Bill Britt from comment #1)
> Created attachment 21056 [details]
> sacct output with MaxRSS Value

For future reference, we prefer the 'sacct -p' formatted output that can be attached as a file to the ticket instead of an image.

Comment 8 Nate Rini 2021-08-27 09:34:28 MDT

(In reply to Bill Britt from comment #0)
> the value is returned by the API
> when called directly with cURL to the /slurmdb/v0.0.36/job/{job_id} route;
> however we cannot find the result from  "slurm_api.slurmdbd_get_job".

Please also attach this curl output.

Comment 9 Sam Hu 2021-08-27 10:22:52 MDT

Created attachment 21070 [details]
sacct_output

Comment 10 Sam Hu 2021-08-27 10:23:19 MDT

Created attachment 21071 [details]
curl_output

Comment 11 Nate Rini 2021-08-27 12:22:23 MDT

(In reply to Sam Hu from comment #10)
> Created attachment 21071 [details]
> curl_output

The auth token was included in the curl output. Please make sure to cycle the JWT key on the cluster unless this is a trivial test cluster.

Comment 12 Nate Rini 2021-08-27 12:30:14 MDT

Please also attach your slurm.conf

Comment 15 Bill Britt 2021-08-30 10:22:21 MDT

Created attachment 21083 [details]
slurm.conf

Comment 16 Nate Rini 2021-08-30 14:06:26 MDT

Please also call and attach output (as a file):
> sacct -p -o all -j 1523 -D

Comment 17 Bill Britt 2021-08-30 14:12:20 MDT

Created attachment 21088 [details]
Output of sacct -p -o all -j 1523 -D

Comment 23 Nate Rini 2021-08-31 10:01:53 MDT

Bill

Still working on this issue. The raw values reported by slurmrestd appear to be incorrect. Still working on debugging the cause.

Thanks,
--Nate

Comment 25 Nate Rini 2021-09-10 09:35:17 MDT

Please attach your slurmdbd.conf (sans passwords).

Comment 26 Bill Britt 2021-09-10 09:39:18 MDT

Created attachment 21221 [details]
slurmdbd.conf with password removed

Comment 27 Bill Britt 2021-09-10 09:41:07 MDT

Hi Nate, added the slurmdbd.conf file.

If it will help I would be happy to schedule a screen sharing session to troubleshoot on the system live.

Comment 28 Nate Rini 2021-09-10 09:48:43 MDT

Please call the following in mysql against the Slurm db and attach the output:
> use slurm_acct_db;
> select * from tres_table;
> select * from general_step_table where job_db_inx = 1523\G

Comment 29 Bill Britt 2021-09-10 09:52:42 MDT

MariaDB [(none)]> use slurm_acct_db;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
MariaDB [slurm_acct_db]> select * from tres_table;
+---------------+---------+------+----------------+------+
| creation_time | deleted | id   | type           | name |
+---------------+---------+------+----------------+------+
|    1623869945 |       0 |    1 | cpu            |      |
|    1623869945 |       0 |    2 | mem            |      |
|    1623869945 |       0 |    3 | energy         |      |
|    1623869945 |       0 |    4 | node           |      |
|    1623869945 |       0 |    5 | billing        |      |
|    1623869945 |       0 |    6 | fs             | disk |
|    1623869945 |       0 |    7 | vmem           |      |
|    1623869945 |       0 |    8 | pages          |      |
|    1623869945 |       1 | 1000 | dynamic_offset |      |
+---------------+---------+------+----------------+------+
9 rows in set (0.000 sec)

MariaDB [slurm_acct_db]> select * from general_step_table where job_db_inx = 1523\G
Empty set (0.000 sec)

Comment 30 Nate Rini 2021-09-10 10:02:37 MDT

(In reply to Bill Britt from comment #27)
> If it will help I would be happy to schedule a screen sharing session to
> troubleshoot on the system live.

We usually avoid live sessions outside purchased consulting sessions. Its also usually not helpful to have a customer wait while we write and test our requests.

(In reply to Bill Britt from comment #29)
> MariaDB [slurm_acct_db]> select * from general_step_table where job_db_inx =
> 1523\G
> Empty set (0.000 sec)

Looks like the job numbers have already wrapped on this cluster. Please call this instead:
> use slurm_acct_db;
> select * from general_step_table where job_db_inx in (select job_db_inx from general_job_table where id_job = 1523)\G

Comment 31 Bill Britt 2021-09-10 10:06:15 MDT

MariaDB [(none)]> use slurm_acct_db;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
MariaDB [slurm_acct_db]> select * from general_step_table where job_db_inx in (select job_db_inx from general_job_table where id_job = 1523)\G
*************************** 1. row ***************************
               job_db_inx: 2643
                  deleted: 0
                exit_code: 0
                  id_step: -5
            step_het_comp: 4294967294
              kill_requid: -1
                 nodelist: gen-slurm-sarchive-s01.cluster.ihme.washington.edu
              nodes_alloc: 1
                 node_inx: 0
                    state: 3
                step_name: batch
                 task_cnt: 1
                task_dist: 0
               time_start: 1630003657
                 time_end: 1630003659
           time_suspended: 0
                 user_sec: 0
                user_usec: 6676
                  sys_sec: 0
                 sys_usec: 3509
              act_cpufreq: 1197
          consumed_energy: 0
          req_cpufreq_min: 0
              req_cpufreq: 0
          req_cpufreq_gov: 0
               tres_alloc: 1=3,2=178,4=1
        tres_usage_in_ave: 1=0,2=32768,3=0,6=0,7=82403328,8=0
        tres_usage_in_max: 1=0,2=32768,3=0,6=0,7=82403328,8=0
 tres_usage_in_max_taskid: 1=0,2=0,6=0,7=0,8=0
 tres_usage_in_max_nodeid: 1=0,2=0,3=0,6=0,7=0,8=0
        tres_usage_in_min: 1=0,2=32768,3=0,6=0,7=82403328,8=0
 tres_usage_in_min_taskid: 1=0,2=0,6=0,7=0,8=0
 tres_usage_in_min_nodeid: 1=0,2=0,3=0,6=0,7=0,8=0
        tres_usage_in_tot: 1=0,2=32768,3=0,6=0,7=82403328,8=0
       tres_usage_out_ave: 3=0,6=0
       tres_usage_out_max: 3=0,6=0
tres_usage_out_max_taskid: 6=0
tres_usage_out_max_nodeid: 3=0,6=0
       tres_usage_out_min: 3=0,6=0
tres_usage_out_min_taskid: 6=0
tres_usage_out_min_nodeid: 3=0,6=0
       tres_usage_out_tot: 3=0,6=0
*************************** 2. row ***************************
               job_db_inx: 2643
                  deleted: 0
                exit_code: 0
                  id_step: -4
            step_het_comp: 4294967294
              kill_requid: -1
                 nodelist: gen-slurm-sarchive-s01.cluster.ihme.washington.edu
              nodes_alloc: 1
                 node_inx: 0
                    state: 3
                step_name: extern
                 task_cnt: 1
                task_dist: 0
               time_start: 1630003657
                 time_end: 1630003659
           time_suspended: 0
                 user_sec: 0
                user_usec: 2093
                  sys_sec: 0
                 sys_usec: 0
              act_cpufreq: 1197
          consumed_energy: 0
          req_cpufreq_min: 0
              req_cpufreq: 0
          req_cpufreq_gov: 0
               tres_alloc: 1=3,2=178,3=18446744073709551614,4=1,5=3
        tres_usage_in_ave: 1=0,2=0,3=0,6=4292,7=8269824,8=0
        tres_usage_in_max: 1=0,2=0,3=0,6=4292,7=8269824,8=0
 tres_usage_in_max_taskid: 1=0,2=0,6=0,7=0,8=0
 tres_usage_in_max_nodeid: 1=0,2=0,3=0,6=0,7=0,8=0
        tres_usage_in_min: 1=0,2=0,3=0,6=4292,7=8269824,8=0
 tres_usage_in_min_taskid: 1=0,2=0,6=0,7=0,8=0
 tres_usage_in_min_nodeid: 1=0,2=0,3=0,6=0,7=0,8=0
        tres_usage_in_tot: 1=0,2=0,3=0,6=4292,7=8269824,8=0
       tres_usage_out_ave: 3=0,6=0
       tres_usage_out_max: 3=0,6=0
tres_usage_out_max_taskid: 6=0
tres_usage_out_max_nodeid: 3=0,6=0
       tres_usage_out_min: 3=0,6=0
tres_usage_out_min_taskid: 6=0
tres_usage_out_min_nodeid: 3=0,6=0
       tres_usage_out_tot: 3=0,6=0
2 rows in set (0.001 sec)

Comment 37 Nate Rini 2021-09-10 14:22:44 MDT

(In reply to Nate Rini from comment #8)
> (In reply to Bill Britt from comment #0)
> > the value is returned by the API
> > when called directly with cURL to the /slurmdb/v0.0.36/job/{job_id} route;
> > however we cannot find the result from  "slurm_api.slurmdbd_get_job".
> 
> Please also attach this curl output.

Please attach an updated query of this output.

Comment 38 Sam Hu 2021-09-10 15:39:18 MDT

Created attachment 21225 [details]
slurmdbd_get_job_result_j1523_0910

Please see attached for the result of slurmdbd_get_job for j1523.

Comment 39 Nate Rini 2021-09-10 15:50:13 MDT

(In reply to Sam Hu from comment #38)
> Created attachment 21225 [details]
> slurmdbd_get_job_result_j1523_0910
> 
> Please see attached for the result of slurmdbd_get_job for j1523.

Extracting the mem request for the job sacct output in comment#17:
> ReqMem	AllocTRES
> 178Mn	billing=3,cpu=3,mem=178M,node=1
> 178Mn	cpu=3,mem=178M,node=1
> 178Mn	billing=3,cpu=3,mem=178M,node=1

Matches the value provided in attachment in comment#38:
> jobtres = {'allocated': [...
>                {'count': 178, 'id': 2, 'name': None, 'type': 'mem'},

Both are 178MiB per node.

Please also provide a new copy of the curl output from doing the query directly, same command as comment#10.

Comment 40 Sam Hu 2021-09-10 17:17:54 MDT

Created attachment 21231 [details]
curl_output_0910

The latest curl_output no longer has the info we were seeing before. Not sure what happened. But the point remains. The sacct can produce 32K MaxRSS, and the OpenAPI version does not provide that.

Comment 41 Sam Hu 2021-09-10 17:18:42 MDT

Created attachment 21232 [details]
sacct_output_0910

Comment 42 Nate Rini 2021-09-10 17:25:23 MDT

(In reply to Sam Hu from comment #40)
> Created attachment 21231 [details]
> curl_output_0910

This looks like a different bug. Somehow there is a "job_id": 0 with some values but not all that should be present. Please attach the slurmrestd logs when this was generated.
 
> The latest curl_output no longer has the info we were seeing before. Not
> sure what happened.
Does this cluster purge jobs? If not, a job should never get lost by slurmdbd which is concerning.

Please also attach the slurmctld and slurmdbd logs from around the time of the query.

> But the point remains. The sacct can produce 32K MaxRSS,
> and the OpenAPI version does not provide that.
I have not been able to reproduce the issue locally. The RSS used in my tests always matches the value in sacct and slurmrestd's output. When I got the raw data from MySQL, it didn't match the original output which suggests something else is going on as slurmrestd shouldn't manipulate the TRES values reported.

Comment 43 Bill Britt 2021-09-13 12:55:17 MDT

The log level for both slurmctld and slurmdbd was set to info, what log level would you like for us to use when we reproduce this issue?

Comment 44 Bill Britt 2021-09-13 12:57:02 MDT

Regarding the purging jobs question: I do not believe our cluster is configured to do this.

Comment 45 Nate Rini 2021-09-13 13:02:05 MDT

(In reply to Bill Britt from comment #44)
> Regarding the purging jobs question: I do not believe our cluster is
> configured to do this.

It was not present in the slurmdbd.conf but your site may use direct sacctmgr dump commands to do it. I just want to be sure.

Comment 46 Nate Rini 2021-09-13 13:14:25 MDT

(In reply to Bill Britt from comment #43)
> The log level for both slurmctld and slurmdbd was set to info, what log
> level would you like for us to use when we reproduce this issue?

This is specific to slurmdbd, so please set this in slurmdbd.conf:
> DebugFlags=DB_TRES
> DebugLevel=debug3

Normal logs for slurmctld are sufficient. I just want to make sure it's not having issues sending updates to slurmdbd. Also the output of 'sdiag' would be helpful.

Comment 47 Sam Hu 2021-09-14 11:02:47 MDT

Created attachment 21272 [details]
sacct_output0914

Comment 48 Sam Hu 2021-09-14 11:03:34 MDT

Created attachment 21273 [details]
curl_output_0914

Comment 49 Sam Hu 2021-09-14 11:04:30 MDT

Created attachment 21274 [details]
slurmdbd_get_job_result_0914

Comment 50 Nate Rini 2021-09-14 11:20:39 MDT

Reviewing logs

Comment 51 Sam Hu 2021-09-14 12:19:55 MDT

Created attachment 21278 [details]
Another set of test (with job created today)

Comment 52 Sam Hu 2021-09-14 12:21:00 MDT

Created attachment 21279 [details]
curl_output_0914_1

Comment 53 Sam Hu 2021-09-14 12:21:39 MDT

Created attachment 21280 [details]
slurmdbd_get_job_result_0914_1

Comment 54 Bill Britt 2021-09-14 12:43:16 MDT

Created attachment 21282 [details]
slurmdbd logs

Comment 55 Bill Britt 2021-09-14 12:43:40 MDT

Created attachment 21283 [details]
slurmctld logs

Comment 56 Nate Rini 2021-09-14 16:00:45 MDT

Please also provide:
> sacct -p -o all -D -j 1523
> sacct -p -o all -D -j 16445

Comment 57 Nate Rini 2021-09-14 16:05:46 MDT

(In reply to Bill Britt from comment #55)
> Created attachment 21283 [details]
> slurmctld logs
> [2021-09-14T15:06:49.360] error: High latency for 1000 calls to gettimeofday(): 754 microseconds

This is probably unrelated to this ticket but this is a known issue with certain clocks. Please see bug#11492 comment#3.

Comment 58 Nate Rini 2021-09-14 16:07:55 MDT

(In reply to Nate Rini from comment #46)
> Also the output of 'sdiag' would be helpful.

Please don't forget the sdiag output too.

Comment 59 Bill Britt 2021-09-14 16:50:47 MDT

root@gen-slurm-sctl-s01:~# sdiag
*******************************************************
sdiag output at Tue Sep 14 22:50:23 2021 (1631659823)
Data since      Tue Sep 14 15:06:53 2021 (1631632013)
*******************************************************
Server thread count:  3
Agent queue size:     0
Agent count:          0
Agent thread count:   0
DBD Agent queue size: 0

Jobs submitted: 72
Jobs started:   72
Jobs completed: 71
Jobs canceled:  0
Jobs failed:    0

Job states ts:  Tue Sep 14 22:50:02 2021 (1631659802)
Jobs pending:   0
Jobs running:   1

Main schedule statistics (microseconds):
	Last cycle:   33
	Max cycle:    214949
	Total cycles: 535
	Mean cycle:   685
	Mean depth cycle:  0
	Cycles per minute: 1
	Last queue length: 0

Backfilling stats
	Total backfilled jobs (since last slurm start): 0
	Total backfilled jobs (since last stats cycle start): 0
	Total backfilled heterogeneous job components: 0
	Total cycles: 0
	Last cycle when: N/A
	Last cycle: 0
	Max cycle:  0
	Last depth cycle: 0
	Last depth cycle (try sched): 0
	Last queue length: 0
	Last table size: 0

Latency for 1000 calls to gettimeofday(): 754 microseconds

Remote Procedure Call statistics by message type
	MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:1120   ave_time:1306   total_time:1463634
	REQUEST_PARTITION_INFO                  ( 2009) count:980    ave_time:455    total_time:446508
	REQUEST_JOB_INFO                        ( 2003) count:546    ave_time:3745   total_time:2045247
	REQUEST_NODE_INFO                       ( 2007) count:516    ave_time:541    total_time:279520
	ACCOUNTING_UPDATE_MSG                   (10001) count:110    ave_time:20582650 total_time:2264091594
	REQUEST_STEP_COMPLETE                   ( 5016) count:77     ave_time:115817 total_time:8917960
	MESSAGE_EPILOG_COMPLETE                 ( 6012) count:72     ave_time:292    total_time:21062
	REQUEST_COMPLETE_PROLOG                 ( 6018) count:72     ave_time:425    total_time:30623
	REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:67     ave_time:2219   total_time:148724
	REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:67     ave_time:1143   total_time:76635
	REQUEST_AUTH_TOKEN                      ( 5039) count:14     ave_time:671    total_time:9394
	REQUEST_JOB_READY                       ( 4019) count:10     ave_time:458    total_time:4584
	REQUEST_SHARE_INFO                      ( 2022) count:6      ave_time:5251   total_time:31509
	REQUEST_COMPLETE_JOB_ALLOCATION         ( 5017) count:5      ave_time:894    total_time:4473
	REQUEST_JOB_STEP_CREATE                 ( 5001) count:5      ave_time:1030   total_time:5152
	REQUEST_RESOURCE_ALLOCATION             ( 4001) count:5      ave_time:42775  total_time:213876
	ACCOUNTING_REGISTER_CTLD                (10003) count:1      ave_time:11462  total_time:11462
	REQUEST_CANCEL_JOB_STEP                 ( 5005) count:1      ave_time:8916282 total_time:8916282

Remote Procedure Call statistics by user
	root            (       0) count:3368   ave_time:3935   total_time:13253543
	slurm           (   64030) count:111    ave_time:20397324 total_time:2264103056
	samhu           (  701264) count:103    ave_time:972    total_time:100211
	sadm_bbritt     (  700713) count:60     ave_time:2243   total_time:134598
	dhs2018         (  700848) count:26     ave_time:349820 total_time:9095322
	sadm_alin4      (  701322) count:6      ave_time:5251   total_time:31509

Pending RPC statistics
	No pending RPCs

Comment 60 Nate Rini 2021-09-15 10:17:14 MDT

(In reply to Nate Rini from comment #56)
> > sacct -p -o all -D -j 1523
> > sacct -p -o all -D -j 16445

Please also provide the above

Comment 61 Sam Hu 2021-09-15 10:22:31 MDT

Created attachment 21288 [details]
sacct_p_all_d_j_1523

Comment 62 Sam Hu 2021-09-15 10:23:08 MDT

Created attachment 21289 [details]
sacct_p_all_d_j_16445

Comment 63 Nate Rini 2021-09-16 14:18:51 MDT

(In reply to Sam Hu from comment #48)
> Created attachment 21273 [details]
> curl_output_0914

The log looks odd:
> 'https://api-stage.cluster.ihme.washington.edu/slurmdb/v0.0.36/job/.1523'

Please verify that the query had '.1523' as the JobId.

Looks like adding the '.' caused slurmrestd to query the wrong job:
> "job_id": 16153

Comment 64 Sam Hu 2021-09-16 16:17:31 MDT

Created attachment 21317 [details]
curl_output_0916_j1523 updated

Interesting. New files attached.

Comment 65 Sam Hu 2021-09-16 16:17:57 MDT

Created attachment 21318 [details]
curl_output_0916_j16445

Comment 66 Nate Rini 2021-09-16 16:43:57 MDT

(In reply to Sam Hu from comment #64)
> Created attachment 21317 [details]
> curl_output_0916_j1523 updated
> 
> Interesting. New files attached.

I have opened a new child ticket to fix the filter issue: bug#12507

Comment 69 Nate Rini 2021-09-16 17:19:40 MDT

(In reply to Sam Hu from comment #64)
> Created attachment 21317 [details]
> curl_output_0916_j1523 updated
> 
> Interesting. New files attached.

Comparing to the attachment in comment#61:

slurmrestd:
> $ jq -C '.jobs[0].steps[0].tres.allocated[]|select(.type=="mem").count'  curl_output_0916_j1523.json
> 178

sacct:
AllocTres: mem=178M

That looks correct for the job memory allocation.

slurmrestd:
> $ jq -C '.jobs[0].steps[0].tres.requested.average[]|select(.type=="mem").count'  curl_output_0916_j1523.json
> 32768
> $ jq -C '.jobs[0].steps[0].tres.requested.max[]|select(.type=="mem").count'  curl_output_0916_j1523.json
> 32768

converting units: 32768B -> 32KiB

sacct:
TRESUsageInMin: mem=32K

The RSS usage is also correctly reported in the curl output.

Can you please provide another dump from the openapi client but please make sure that the request JobID is correct.

Comment 70 Sam Hu 2021-09-17 17:17:40 MDT

Created attachment 21348 [details]
slurmdbd_get_job_result_j1523_j16445_0917

Comment 71 Sam Hu 2021-09-17 17:25:54 MDT

Created attachment 21349 [details]
python source file

This python source file generated the slurmdbd_get_job_result_j1523_j16445_0917 output file just uploaded. The python script uses slurmdbd_get_job. But I have not seen another RSS info produced anywhere from the output set. Eg, if you search for 32 (for 32768 or 32KiB) in the output file, none can be found. But in the curl output, you can find 32 (for 32768 or 32KiB). That's the basic question we have: where can we find the RSS (32) from using slurmdbd_get_job?

Comment 74 Nate Rini 2021-09-22 09:26:54 MDT

(In reply to Sam Hu from comment #71)
> where can we find the RSS (32) from using slurmdbd_get_job?

The output provided in comment #70 does not have it yet the output from slurmrestd (comment #64) does. I'm looking to see why but this may be a bug with the OpenAPI generator's python parser.

Comment 76 Sam Hu 2021-09-22 10:53:47 MDT

Right: The output provided in comment #70 (which is generated from using slurmdbd_get_job) does not have it yet the output from slurmrestd (comment #64) does.

Our need is to use slurmdbd_get_job to gather usage data. Hence we need to have this function working. Please let me know if there is anything else I can help with.

Comment 77 Nate Rini 2021-09-22 10:56:42 MDT

(In reply to Sam Hu from comment #76)
> Our need is to use slurmdbd_get_job to gather usage data. Hence we need to
> have this function working.

For reasons currently unclear, all of the steps array is not getting parsed.

I suggest opening a mirror ticket with [https://github.com/OpenAPITools/openapi-generator]. If this is in fact a bug on their part, that is something they will have to fix as we have no control/oversight over their project.

Comment 78 Nate Rini 2021-09-22 11:42:59 MDT

Created attachment 21389 [details]
simple script to look for step TRES

> [fred@login ~]$ python3 showjob.py 5
> no tres for batch
> no tres for extern

Comment 79 Nate Rini 2021-09-22 13:59:30 MDT

(In reply to Nate Rini from comment #78)
> Created attachment 21389 [details]
> simple script to look for step TRES
> 
> > [fred@login ~]$ python3 showjob.py 5
> > no tres for batch
> > no tres for extern

Modified one of the example scripts and found that the steps are populated. The TRES values in each step are not. The python code looks like it should recursively parse the data but the tres member is always None despite it being present in the slurmrestd response.

Comment 83 Nate Rini 2021-09-22 17:21:14 MDT

(In reply to Nate Rini from comment #79)
> (In reply to Nate Rini from comment #78)
> > Created attachment 21389 [details]
> > simple script to look for step TRES
> > 
> > > [fred@login ~]$ python3 showjob.py 5
> > > no tres for batch
> > > no tres for extern
> 
> Modified one of the example scripts and found that the steps are populated.
> The TRES values in each step are not. The python code looks like it should
> recursively parse the data but the tres member is always None despite it
> being present in the slurmrestd response.

I believe I found the issue. The TRES object in the OpenAPI specification is incorrectly placed in the step (id) object instead of the step object. Working on a patch now.

Comment 85 Nate Rini 2021-09-22 18:03:12 MDT

Created attachment 21396 [details]
showjob script

Comment 86 Nate Rini 2021-09-22 18:03:59 MDT

Created attachment 21397 [details]
patch for 20.11

Comment 87 Nate Rini 2021-09-22 18:09:14 MDT

(In reply to Nate Rini from comment #86)
> Created attachment 21397 [details]
> patch for 20.11

This patch should correct the OAS for dbv0.0.36 in 20.11. Please verify if possible that it corrects your issue.

(In reply to Nate Rini from comment #85)
> Created attachment 21396 [details]
> showjob script

This script can be used to verify.

Here is an example output:
> [fred@login ~]$ sbatch --wrap 'memhog 5G' --mem=10G

> [fred@login ~]$ sacct -o MaxRSS,jobid -j 2
>     MaxRSS        JobID 
> ---------- ------------ 
>            2            
>       132K 2.batch      
>          0 2.extern 

 
> [fred@login ~]$ python3 showjob.py 2
> {'count': 135168,
>  'id': 2,
>  'name': None,
>  'node': 'node00',
>  'task': 0,
>  'type': 'mem'}
> {'count': 0, 'id': 2, 'name': None, 'node': 'node00', 'task': 0, 'type': 'mem'}

Comment 88 Sam Hu 2021-09-27 16:59:25 MDT

Created attachment 21475 [details]
New set of test 0927

Still an issue after the patch.

New set of test conducted 09/27. Basically we are looking for 4K(shown in the sacct_output_j16632_0927) in the output of slurmdbd_get_job_result_j16632_0927.txt.

Comment 89 Sam Hu 2021-09-27 17:00:46 MDT

Created attachment 21476 [details]
sacct_output_j16632_0927

Comment 90 Sam Hu 2021-09-27 17:01:25 MDT

Created attachment 21477 [details]
test_slurm_worker_node_for_schedmd_4

Comment 91 Sam Hu 2021-09-27 17:02:39 MDT

Created attachment 21478 [details]
slurmdbd_get_job_result_j16632_0927

Comment 92 Nate Rini 2021-09-28 09:16:34 MDT

(In reply to Sam Hu from comment #88)
> Still an issue after the patch.
> 
> New set of test conducted 09/27. Basically we are looking for 4K(shown in
> the sacct_output_j16632_0927) in the output of
> slurmdbd_get_job_result_j16632_0927.txt.

Was the installed openapi client rebuilt and reinstalled after applying the patch?

> AttributeError: 'Dbv0036JobStep' object has no attribute 'tres'

This error should have been corrected by the patch. An explicit uninstall may need to be done with the python installer on the pre-patched version.

Comment 93 Sam Hu 2021-09-29 10:08:23 MDT

Created attachment 21514 [details]
conda_slurm_reference in my testing

I have the conda environment, and using "pip freeze", I can see that I am using the new slurm_rest newly generated.

Comment 94 Bill Britt 2021-09-29 10:26:03 MDT

I also created a new python environment and used the script supplied here: https://bugs.schedmd.com/attachment.cgi?id=21396
with a newly generated client.

Is there a way we can verify the patch is applied correctly?

Comment 95 Nate Rini 2021-09-29 10:41:59 MDT

(In reply to Bill Britt from comment #94)
> I also created a new python environment and used the script supplied here:
> https://bugs.schedmd.com/attachment.cgi?id=21396
> with a newly generated client.
> 
> Is there a way we can verify the patch is applied correctly?

The structure returned by slurmdbd_get_job should have job.steps[].tres.requested.max defined. If it doesn't, then the patch wasn't applied.

I suggest just removing the egg installed and then reinstalling after the patch and re-generating. Also verify the openapi.json is different too.

Comment 96 Bill Britt 2021-09-29 14:52:29 MDT

Hi Nate,
We figured out why we were getting the same results; we had made a copy of the openapi.json file and were re-downloading the old version. However we have a new issue: The newly generated client appears to be missing libraries.

I receive the following error after generating the python client with the new openapi.json

python3 showjob.py3
Traceback (most recent call last):
  File "/Users/sadm_bbritt/Downloads/openapi-test/py_api_client/showjob.py3", line 17, in <module>
    from openapi_client.models import V0036JobSubmission as jobSubmission
ImportError: cannot import name 'V0036JobSubmission' from 'openapi_client.models' (/Users/sadm_bbritt/Downloads/openapi-test/py_api_client/openapi_client/models/__init__.py).

It appears the "Dbv0036Account" model file is missing. Below is a tree of the generated client. This was generated with:
```docker run --rm \
    --volume ${PWD}:/local \
    --workdir=/local \
    openapitools/openapi-generator-cli:v4.3.1 \
    generate \
    --input-spec /local/openapi.json \
    --output /local/py_api_client\
    --generator-name python \
    --package-name openapi_client```

And the attached "patch_openapi.json" file

├── README.md
├── docs
│   ├── Dbv0036Account.md
│   ├── Dbv0036AccountInfo.md
│   ├── Dbv0036AccountResponse.md
│   ├── Dbv0036Association.md
│   ├── Dbv0036AssociationDefault.md
│   ├── Dbv0036AssociationMax.md
│   ├── Dbv0036AssociationMaxJobs.md
│   ├── Dbv0036AssociationMaxJobsPer.md
│   ├── Dbv0036AssociationMaxPer.md
│   ├── Dbv0036AssociationMaxPerAccount.md
│   ├── Dbv0036AssociationMaxTres.md
│   ├── Dbv0036AssociationMaxTresMinutes.md
│   ├── Dbv0036AssociationMaxTresMinutesPer.md
│   ├── Dbv0036AssociationMaxTresPer.md
│   ├── Dbv0036AssociationMin.md
│   ├── Dbv0036AssociationShortInfo.md
│   ├── Dbv0036AssociationUsage.md
│   ├── Dbv0036AssociationsInfo.md
│   ├── Dbv0036ClusterInfo.md
│   ├── Dbv0036ClusterInfoAssociations.md
│   ├── Dbv0036ClusterInfoController.md
│   ├── Dbv0036ConfigInfo.md
│   ├── Dbv0036ConfigResponse.md
│   ├── Dbv0036CoordinatorInfo.md
│   ├── Dbv0036Diag.md
│   ├── Dbv0036DiagRPCs.md
│   ├── Dbv0036DiagRollups.md
│   ├── Dbv0036DiagTime.md
│   ├── Dbv0036DiagTime1.md
│   ├── Dbv0036DiagUsers.md
│   ├── Dbv0036Error.md
│   ├── Dbv0036Job.md
│   ├── Dbv0036JobArray.md
│   ├── Dbv0036JobArrayLimits.md
│   ├── Dbv0036JobArrayLimitsMax.md
│   ├── Dbv0036JobArrayLimitsMaxRunning.md
│   ├── Dbv0036JobComment.md
│   ├── Dbv0036JobExitCode.md
│   ├── Dbv0036JobExitCodeSignal.md
│   ├── Dbv0036JobHet.md
│   ├── Dbv0036JobInfo.md
│   ├── Dbv0036JobMcs.md
│   ├── Dbv0036JobRequired.md
│   ├── Dbv0036JobReservation.md
│   ├── Dbv0036JobState.md
│   ├── Dbv0036JobStep.md
│   ├── Dbv0036JobStepCPU.md
│   ├── Dbv0036JobStepCPURequestedFrequency.md
│   ├── Dbv0036JobStepNodes.md
│   ├── Dbv0036JobStepStatistics.md
│   ├── Dbv0036JobStepStatisticsCPU.md
│   ├── Dbv0036JobStepStatisticsEnergy.md
│   ├── Dbv0036JobStepStep.md
│   ├── Dbv0036JobStepStepHet.md
│   ├── Dbv0036JobStepTask.md
│   ├── Dbv0036JobStepTasks.md
│   ├── Dbv0036JobStepTime.md
│   ├── Dbv0036JobStepTres.md
│   ├── Dbv0036JobStepTresRequested.md
│   ├── Dbv0036JobTime.md
│   ├── Dbv0036JobTimeSystem.md
│   ├── Dbv0036JobTimeTotal.md
│   ├── Dbv0036JobTimeUser.md
│   ├── Dbv0036JobTres.md
│   ├── Dbv0036JobWckey.md
│   ├── Dbv0036Qos.md
│   ├── Dbv0036QosInfo.md
│   ├── Dbv0036QosLimits.md
│   ├── Dbv0036QosLimitsMax.md
│   ├── Dbv0036QosLimitsMaxAccruing.md
│   ├── Dbv0036QosLimitsMaxAccruingPer.md
│   ├── Dbv0036QosLimitsMaxJobs.md
│   ├── Dbv0036QosLimitsMaxJobsPer.md
│   ├── Dbv0036QosLimitsMaxTres.md
│   ├── Dbv0036QosLimitsMaxTresMinutes.md
│   ├── Dbv0036QosLimitsMaxTresMinutesPer.md
│   ├── Dbv0036QosLimitsMaxTresPer.md
│   ├── Dbv0036QosLimitsMaxWallClock.md
│   ├── Dbv0036QosLimitsMaxWallClockPer.md
│   ├── Dbv0036QosLimitsMin.md
│   ├── Dbv0036QosLimitsMinTres.md
│   ├── Dbv0036QosLimitsMinTresPer.md
│   ├── Dbv0036QosPreempt.md
│   ├── Dbv0036ResponseAccountDelete.md
│   ├── Dbv0036ResponseAssociationDelete.md
│   ├── Dbv0036ResponseClusterAdd.md
│   ├── Dbv0036ResponseClusterDelete.md
│   ├── Dbv0036ResponseQosDelete.md
│   ├── Dbv0036ResponseTres.md
│   ├── Dbv0036ResponseUserDelete.md
│   ├── Dbv0036ResponseUserUpdate.md
│   ├── Dbv0036ResponseWckeyAdd.md
│   ├── Dbv0036ResponseWckeyDelete.md
│   ├── Dbv0036TresInfo.md
│   ├── Dbv0036User.md
│   ├── Dbv0036UserAssociations.md
│   ├── Dbv0036UserDefault.md
│   ├── Dbv0036UserInfo.md
│   ├── Dbv0036Wckey.md
│   ├── Dbv0036WckeyInfo.md
│   ├── OpenapiApi.md
│   └── SlurmApi.md
├── git_push.sh
├── openapi_client
│   ├── __init__.py
│   ├── __init__.pyc
│   ├── __pycache__
│   │   ├── __init__.cpython-39.pyc
│   │   ├── api_client.cpython-39.pyc
│   │   ├── configuration.cpython-39.pyc
│   │   ├── exceptions.cpython-39.pyc
│   │   └── rest.cpython-39.pyc
│   ├── api
│   │   ├── __init__.py
│   │   ├── __init__.pyc
│   │   ├── __pycache__
│   │   │   ├── __init__.cpython-39.pyc
│   │   │   ├── openapi_api.cpython-39.pyc
│   │   │   └── slurm_api.cpython-39.pyc
│   │   ├── openapi_api.py
│   │   ├── openapi_api.pyc
│   │   └── slurm_api.py
│   ├── api_client.py
│   ├── api_client.pyc
│   ├── configuration.py
│   ├── configuration.pyc
│   ├── exceptions.py
│   ├── models
│   │   ├── __init__.py
│   │   ├── __pycache__
│   │   │   ├── __init__.cpython-39.pyc
│   │   │   ├── dbv0036_account.cpython-39.pyc
│   │   │   ├── dbv0036_account_info.cpython-39.pyc
│   │   │   ├── dbv0036_account_response.cpython-39.pyc
│   │   │   ├── dbv0036_association.cpython-39.pyc
│   │   │   ├── dbv0036_association_default.cpython-39.pyc
│   │   │   ├── dbv0036_association_max.cpython-39.pyc
│   │   │   ├── dbv0036_association_max_jobs.cpython-39.pyc
│   │   │   ├── dbv0036_association_max_jobs_per.cpython-39.pyc
│   │   │   ├── dbv0036_association_max_per.cpython-39.pyc
│   │   │   ├── dbv0036_association_max_per_account.cpython-39.pyc
│   │   │   ├── dbv0036_association_max_tres.cpython-39.pyc
│   │   │   ├── dbv0036_association_max_tres_minutes.cpython-39.pyc
│   │   │   ├── dbv0036_association_max_tres_minutes_per.cpython-39.pyc
│   │   │   ├── dbv0036_association_max_tres_per.cpython-39.pyc
│   │   │   ├── dbv0036_association_min.cpython-39.pyc
│   │   │   ├── dbv0036_association_short_info.cpython-39.pyc
│   │   │   ├── dbv0036_association_usage.cpython-39.pyc
│   │   │   ├── dbv0036_associations_info.cpython-39.pyc
│   │   │   ├── dbv0036_cluster_info.cpython-39.pyc
│   │   │   ├── dbv0036_cluster_info_associations.cpython-39.pyc
│   │   │   ├── dbv0036_cluster_info_controller.cpython-39.pyc
│   │   │   ├── dbv0036_config_info.cpython-39.pyc
│   │   │   ├── dbv0036_config_response.cpython-39.pyc
│   │   │   ├── dbv0036_coordinator_info.cpython-39.pyc
│   │   │   ├── dbv0036_diag.cpython-39.pyc
│   │   │   ├── dbv0036_diag_rollups.cpython-39.pyc
│   │   │   ├── dbv0036_diag_rp_cs.cpython-39.pyc
│   │   │   ├── dbv0036_diag_time.cpython-39.pyc
│   │   │   ├── dbv0036_diag_time1.cpython-39.pyc
│   │   │   ├── dbv0036_diag_users.cpython-39.pyc
│   │   │   ├── dbv0036_error.cpython-39.pyc
│   │   │   ├── dbv0036_job.cpython-39.pyc
│   │   │   ├── dbv0036_job_array.cpython-39.pyc
│   │   │   ├── dbv0036_job_array_limits.cpython-39.pyc
│   │   │   ├── dbv0036_job_array_limits_max.cpython-39.pyc
│   │   │   ├── dbv0036_job_array_limits_max_running.cpython-39.pyc
│   │   │   ├── dbv0036_job_comment.cpython-39.pyc
│   │   │   ├── dbv0036_job_exit_code.cpython-39.pyc
│   │   │   ├── dbv0036_job_exit_code_signal.cpython-39.pyc
│   │   │   ├── dbv0036_job_het.cpython-39.pyc
│   │   │   ├── dbv0036_job_info.cpython-39.pyc
│   │   │   ├── dbv0036_job_mcs.cpython-39.pyc
│   │   │   ├── dbv0036_job_required.cpython-39.pyc
│   │   │   ├── dbv0036_job_reservation.cpython-39.pyc
│   │   │   ├── dbv0036_job_state.cpython-39.pyc
│   │   │   ├── dbv0036_job_step.cpython-39.pyc
│   │   │   ├── dbv0036_job_step_cpu.cpython-39.pyc
│   │   │   ├── dbv0036_job_step_cpu_requested_frequency.cpython-39.pyc
│   │   │   ├── dbv0036_job_step_nodes.cpython-39.pyc
│   │   │   ├── dbv0036_job_step_statistics.cpython-39.pyc
│   │   │   ├── dbv0036_job_step_statistics_cpu.cpython-39.pyc
│   │   │   ├── dbv0036_job_step_statistics_energy.cpython-39.pyc
│   │   │   ├── dbv0036_job_step_step.cpython-39.pyc
│   │   │   ├── dbv0036_job_step_step_het.cpython-39.pyc
│   │   │   ├── dbv0036_job_step_task.cpython-39.pyc
│   │   │   ├── dbv0036_job_step_tasks.cpython-39.pyc
│   │   │   ├── dbv0036_job_step_time.cpython-39.pyc
│   │   │   ├── dbv0036_job_step_tres.cpython-39.pyc
│   │   │   ├── dbv0036_job_step_tres_requested.cpython-39.pyc
│   │   │   ├── dbv0036_job_time.cpython-39.pyc
│   │   │   ├── dbv0036_job_time_system.cpython-39.pyc
│   │   │   ├── dbv0036_job_time_total.cpython-39.pyc
│   │   │   ├── dbv0036_job_time_user.cpython-39.pyc
│   │   │   ├── dbv0036_job_tres.cpython-39.pyc
│   │   │   ├── dbv0036_job_wckey.cpython-39.pyc
│   │   │   ├── dbv0036_qos.cpython-39.pyc
│   │   │   ├── dbv0036_qos_info.cpython-39.pyc
│   │   │   ├── dbv0036_qos_limits.cpython-39.pyc
│   │   │   ├── dbv0036_qos_limits_max.cpython-39.pyc
│   │   │   ├── dbv0036_qos_limits_max_accruing.cpython-39.pyc
│   │   │   ├── dbv0036_qos_limits_max_accruing_per.cpython-39.pyc
│   │   │   ├── dbv0036_qos_limits_max_jobs.cpython-39.pyc
│   │   │   ├── dbv0036_qos_limits_max_jobs_per.cpython-39.pyc
│   │   │   ├── dbv0036_qos_limits_max_tres.cpython-39.pyc
│   │   │   ├── dbv0036_qos_limits_max_tres_minutes.cpython-39.pyc
│   │   │   ├── dbv0036_qos_limits_max_tres_minutes_per.cpython-39.pyc
│   │   │   ├── dbv0036_qos_limits_max_tres_per.cpython-39.pyc
│   │   │   ├── dbv0036_qos_limits_max_wall_clock.cpython-39.pyc
│   │   │   ├── dbv0036_qos_limits_max_wall_clock_per.cpython-39.pyc
│   │   │   ├── dbv0036_qos_limits_min.cpython-39.pyc
│   │   │   ├── dbv0036_qos_limits_min_tres.cpython-39.pyc
│   │   │   ├── dbv0036_qos_limits_min_tres_per.cpython-39.pyc
│   │   │   ├── dbv0036_qos_preempt.cpython-39.pyc
│   │   │   ├── dbv0036_response_account_delete.cpython-39.pyc
│   │   │   ├── dbv0036_response_association_delete.cpython-39.pyc
│   │   │   ├── dbv0036_response_cluster_add.cpython-39.pyc
│   │   │   ├── dbv0036_response_cluster_delete.cpython-39.pyc
│   │   │   ├── dbv0036_response_qos_delete.cpython-39.pyc
│   │   │   ├── dbv0036_response_tres.cpython-39.pyc
│   │   │   ├── dbv0036_response_user_delete.cpython-39.pyc
│   │   │   ├── dbv0036_response_user_update.cpython-39.pyc
│   │   │   ├── dbv0036_response_wckey_add.cpython-39.pyc
│   │   │   ├── dbv0036_response_wckey_delete.cpython-39.pyc
│   │   │   ├── dbv0036_tres_info.cpython-39.pyc
│   │   │   ├── dbv0036_user.cpython-39.pyc
│   │   │   ├── dbv0036_user_associations.cpython-39.pyc
│   │   │   ├── dbv0036_user_default.cpython-39.pyc
│   │   │   ├── dbv0036_user_info.cpython-39.pyc
│   │   │   ├── dbv0036_wckey.cpython-39.pyc
│   │   │   └── dbv0036_wckey_info.cpython-39.pyc
│   │   ├── dbv0036_account.py
│   │   ├── dbv0036_account_info.py
│   │   ├── dbv0036_account_response.py
│   │   ├── dbv0036_association.py
│   │   ├── dbv0036_association_default.py
│   │   ├── dbv0036_association_max.py
│   │   ├── dbv0036_association_max_jobs.py
│   │   ├── dbv0036_association_max_jobs_per.py
│   │   ├── dbv0036_association_max_per.py
│   │   ├── dbv0036_association_max_per_account.py
│   │   ├── dbv0036_association_max_tres.py
│   │   ├── dbv0036_association_max_tres_minutes.py
│   │   ├── dbv0036_association_max_tres_minutes_per.py
│   │   ├── dbv0036_association_max_tres_per.py
│   │   ├── dbv0036_association_min.py
│   │   ├── dbv0036_association_short_info.py
│   │   ├── dbv0036_association_usage.py
│   │   ├── dbv0036_associations_info.py
│   │   ├── dbv0036_cluster_info.py
│   │   ├── dbv0036_cluster_info_associations.py
│   │   ├── dbv0036_cluster_info_controller.py
│   │   ├── dbv0036_config_info.py
│   │   ├── dbv0036_config_response.py
│   │   ├── dbv0036_coordinator_info.py
│   │   ├── dbv0036_diag.py
│   │   ├── dbv0036_diag_rollups.py
│   │   ├── dbv0036_diag_rp_cs.py
│   │   ├── dbv0036_diag_time.py
│   │   ├── dbv0036_diag_time1.py
│   │   ├── dbv0036_diag_users.py
│   │   ├── dbv0036_error.py
│   │   ├── dbv0036_job.py
│   │   ├── dbv0036_job_array.py
│   │   ├── dbv0036_job_array_limits.py
│   │   ├── dbv0036_job_array_limits_max.py
│   │   ├── dbv0036_job_array_limits_max_running.py
│   │   ├── dbv0036_job_comment.py
│   │   ├── dbv0036_job_exit_code.py
│   │   ├── dbv0036_job_exit_code_signal.py
│   │   ├── dbv0036_job_het.py
│   │   ├── dbv0036_job_info.py
│   │   ├── dbv0036_job_mcs.py
│   │   ├── dbv0036_job_required.py
│   │   ├── dbv0036_job_reservation.py
│   │   ├── dbv0036_job_state.py
│   │   ├── dbv0036_job_step.py
│   │   ├── dbv0036_job_step_cpu.py
│   │   ├── dbv0036_job_step_cpu_requested_frequency.py
│   │   ├── dbv0036_job_step_nodes.py
│   │   ├── dbv0036_job_step_statistics.py
│   │   ├── dbv0036_job_step_statistics_cpu.py
│   │   ├── dbv0036_job_step_statistics_energy.py
│   │   ├── dbv0036_job_step_step.py
│   │   ├── dbv0036_job_step_step_het.py
│   │   ├── dbv0036_job_step_task.py
│   │   ├── dbv0036_job_step_tasks.py
│   │   ├── dbv0036_job_step_time.py
│   │   ├── dbv0036_job_step_tres.py
│   │   ├── dbv0036_job_step_tres_requested.py
│   │   ├── dbv0036_job_time.py
│   │   ├── dbv0036_job_time_system.py
│   │   ├── dbv0036_job_time_total.py
│   │   ├── dbv0036_job_time_user.py
│   │   ├── dbv0036_job_tres.py
│   │   ├── dbv0036_job_wckey.py
│   │   ├── dbv0036_qos.py
│   │   ├── dbv0036_qos_info.py
│   │   ├── dbv0036_qos_limits.py
│   │   ├── dbv0036_qos_limits_max.py
│   │   ├── dbv0036_qos_limits_max_accruing.py
│   │   ├── dbv0036_qos_limits_max_accruing_per.py
│   │   ├── dbv0036_qos_limits_max_jobs.py
│   │   ├── dbv0036_qos_limits_max_jobs_per.py
│   │   ├── dbv0036_qos_limits_max_tres.py
│   │   ├── dbv0036_qos_limits_max_tres_minutes.py
│   │   ├── dbv0036_qos_limits_max_tres_minutes_per.py
│   │   ├── dbv0036_qos_limits_max_tres_per.py
│   │   ├── dbv0036_qos_limits_max_wall_clock.py
│   │   ├── dbv0036_qos_limits_max_wall_clock_per.py
│   │   ├── dbv0036_qos_limits_min.py
│   │   ├── dbv0036_qos_limits_min_tres.py
│   │   ├── dbv0036_qos_limits_min_tres_per.py
│   │   ├── dbv0036_qos_preempt.py
│   │   ├── dbv0036_response_account_delete.py
│   │   ├── dbv0036_response_association_delete.py
│   │   ├── dbv0036_response_cluster_add.py
│   │   ├── dbv0036_response_cluster_delete.py
│   │   ├── dbv0036_response_qos_delete.py
│   │   ├── dbv0036_response_tres.py
│   │   ├── dbv0036_response_user_delete.py
│   │   ├── dbv0036_response_user_update.py
│   │   ├── dbv0036_response_wckey_add.py
│   │   ├── dbv0036_response_wckey_delete.py
│   │   ├── dbv0036_tres_info.py
│   │   ├── dbv0036_user.py
│   │   ├── dbv0036_user_associations.py
│   │   ├── dbv0036_user_default.py
│   │   ├── dbv0036_user_info.py
│   │   ├── dbv0036_wckey.py
│   │   └── dbv0036_wckey_info.py
│   └── rest.py
├── requirements.txt
├── setup.cfg
├── setup.py
├── showjob.py3
├── test
│   ├── __init__.py
│   ├── test_dbv0036_account.py
│   ├── test_dbv0036_account_info.py
│   ├── test_dbv0036_account_response.py
│   ├── test_dbv0036_association.py
│   ├── test_dbv0036_association_default.py
│   ├── test_dbv0036_association_max.py
│   ├── test_dbv0036_association_max_jobs.py
│   ├── test_dbv0036_association_max_jobs_per.py
│   ├── test_dbv0036_association_max_per.py
│   ├── test_dbv0036_association_max_per_account.py
│   ├── test_dbv0036_association_max_tres.py
│   ├── test_dbv0036_association_max_tres_minutes.py
│   ├── test_dbv0036_association_max_tres_minutes_per.py
│   ├── test_dbv0036_association_max_tres_per.py
│   ├── test_dbv0036_association_min.py
│   ├── test_dbv0036_association_short_info.py
│   ├── test_dbv0036_association_usage.py
│   ├── test_dbv0036_associations_info.py
│   ├── test_dbv0036_cluster_info.py
│   ├── test_dbv0036_cluster_info_associations.py
│   ├── test_dbv0036_cluster_info_controller.py
│   ├── test_dbv0036_config_info.py
│   ├── test_dbv0036_config_response.py
│   ├── test_dbv0036_coordinator_info.py
│   ├── test_dbv0036_diag.py
│   ├── test_dbv0036_diag_rollups.py
│   ├── test_dbv0036_diag_rp_cs.py
│   ├── test_dbv0036_diag_time.py
│   ├── test_dbv0036_diag_time1.py
│   ├── test_dbv0036_diag_users.py
│   ├── test_dbv0036_error.py
│   ├── test_dbv0036_job.py
│   ├── test_dbv0036_job_array.py
│   ├── test_dbv0036_job_array_limits.py
│   ├── test_dbv0036_job_array_limits_max.py
│   ├── test_dbv0036_job_array_limits_max_running.py
│   ├── test_dbv0036_job_comment.py
│   ├── test_dbv0036_job_exit_code.py
│   ├── test_dbv0036_job_exit_code_signal.py
│   ├── test_dbv0036_job_het.py
│   ├── test_dbv0036_job_info.py
│   ├── test_dbv0036_job_mcs.py
│   ├── test_dbv0036_job_required.py
│   ├── test_dbv0036_job_reservation.py
│   ├── test_dbv0036_job_state.py
│   ├── test_dbv0036_job_step.py
│   ├── test_dbv0036_job_step_cpu.py
│   ├── test_dbv0036_job_step_cpu_requested_frequency.py
│   ├── test_dbv0036_job_step_nodes.py
│   ├── test_dbv0036_job_step_statistics.py
│   ├── test_dbv0036_job_step_statistics_cpu.py
│   ├── test_dbv0036_job_step_statistics_energy.py
│   ├── test_dbv0036_job_step_step.py
│   ├── test_dbv0036_job_step_step_het.py
│   ├── test_dbv0036_job_step_task.py
│   ├── test_dbv0036_job_step_tasks.py
│   ├── test_dbv0036_job_step_time.py
│   ├── test_dbv0036_job_step_tres.py
│   ├── test_dbv0036_job_step_tres_requested.py
│   ├── test_dbv0036_job_time.py
│   ├── test_dbv0036_job_time_system.py
│   ├── test_dbv0036_job_time_total.py
│   ├── test_dbv0036_job_time_user.py
│   ├── test_dbv0036_job_tres.py
│   ├── test_dbv0036_job_wckey.py
│   ├── test_dbv0036_qos.py
│   ├── test_dbv0036_qos_info.py
│   ├── test_dbv0036_qos_limits.py
│   ├── test_dbv0036_qos_limits_max.py
│   ├── test_dbv0036_qos_limits_max_accruing.py
│   ├── test_dbv0036_qos_limits_max_accruing_per.py
│   ├── test_dbv0036_qos_limits_max_jobs.py
│   ├── test_dbv0036_qos_limits_max_jobs_per.py
│   ├── test_dbv0036_qos_limits_max_tres.py
│   ├── test_dbv0036_qos_limits_max_tres_minutes.py
│   ├── test_dbv0036_qos_limits_max_tres_minutes_per.py
│   ├── test_dbv0036_qos_limits_max_tres_per.py
│   ├── test_dbv0036_qos_limits_max_wall_clock.py
│   ├── test_dbv0036_qos_limits_max_wall_clock_per.py
│   ├── test_dbv0036_qos_limits_min.py
│   ├── test_dbv0036_qos_limits_min_tres.py
│   ├── test_dbv0036_qos_limits_min_tres_per.py
│   ├── test_dbv0036_qos_preempt.py
│   ├── test_dbv0036_response_account_delete.py
│   ├── test_dbv0036_response_association_delete.py
│   ├── test_dbv0036_response_cluster_add.py
│   ├── test_dbv0036_response_cluster_delete.py
│   ├── test_dbv0036_response_qos_delete.py
│   ├── test_dbv0036_response_tres.py
│   ├── test_dbv0036_response_user_delete.py
│   ├── test_dbv0036_response_user_update.py
│   ├── test_dbv0036_response_wckey_add.py
│   ├── test_dbv0036_response_wckey_delete.py
│   ├── test_dbv0036_tres_info.py
│   ├── test_dbv0036_user.py
│   ├── test_dbv0036_user_associations.py
│   ├── test_dbv0036_user_default.py
│   ├── test_dbv0036_user_info.py
│   ├── test_dbv0036_wckey.py
│   ├── test_dbv0036_wckey_info.py
│   ├── test_openapi_api.py
│   └── test_slurm_api.py
├── test-requirements.txt
└── tox.ini

Comment 97 Bill Britt 2021-09-29 14:53:26 MDT

Created attachment 21520 [details]
OpenAPI file from after patching

Comment 98 Bill Britt 2021-09-29 15:22:39 MDT

We realized again that the openapi.json file we just uploaded was incorrect (it was from the build source).

Now with the correct openapi.json downloaded with: 

export $(scontrol token lifespan=99999);curl -v -s -H X-SLURM-USER-NAME:$(whoami) -H X-SLURM-USER-TOKEN:$SLURM_JWT https://REST_HOST/openapi > openapi.json we run the script but it hangs. Using curl to connect directly to the API also hangs:

Example curl:
export $(scontrol token lifespan=99999);curl -X 'GET' 'https://REST_HOST/slurmdb/v0.0.36/job/5868759'  -H 'accept: application/json' -H X-SLURM-USER-NAME:$(whoami) -H X-SLURM-USER-TOKEN:$SLURM_JWT

Comment 99 Nate Rini 2021-09-29 15:51:32 MDT

(In reply to Bill Britt from comment #98)
> Using curl to connect directly to the API also hangs:

Please attach the slurmrestd log.

Comment 100 Bill Britt 2021-09-30 09:21:59 MDT

Created attachment 21537 [details]
slurmdrestd logs

Comment 101 Nate Rini 2021-09-30 10:14:52 MDT

Please place the headers for curl inside of quotes. Somehow they are getting put into lower case:
> Sep 30 01:45:39 gen-slurm-sapi-d01 slurmrestd[2659091]: debug2: _on_header_value: [[localhost]:50580] Header: x-slurm-user-name Value: sadm_alin4
Sep 30 01:45:39 gen-slurm-sapi-d01 slurmrestd[2659091]: debug2: _on_header_value: [[localhost]:50580] Header: x-slurm-user-token Value:

Please attach slurmctld log after.

Comment 102 Bill Britt 2021-09-30 11:14:47 MDT

Using this command:
export $(scontrol token lifespan=99999);curl -X 'GET' 'https://REST_HOST/slurmdb/v0.0.36/job/5868759'  -H 'accept: application/json' -H "X-SLURM-USER-NAME":"$(whoami)" -H "X-SLURM-USER-TOKEN":"$SLURM_JWT"


This is the output of the logs:

Sep 30 17:11:04 gen-slurm-sapi-d01 slurmrestd[2659091]: debug:  parse_http: [[localhost]:54330] Accepted HTTP connection
Sep 30 17:11:04 gen-slurm-sapi-d01 slurmrestd[2659091]: debug:  _on_url: [[localhost]:54330] url path: /slurmdb/v0.0.36/job/5868759 query: (null)
Sep 30 17:11:04 gen-slurm-sapi-d01 slurmrestd[2659091]: debug2: _on_header_value: [[localhost]:54330] Header: Host Value: api-dev.cluster.ihme.washington.edu
Sep 30 17:11:04 gen-slurm-sapi-d01 slurmrestd[2659091]: debug2: _on_header_value: [[localhost]:54330] Header: X-Real-Ip Value: 10.158.154.56
Sep 30 17:11:04 gen-slurm-sapi-d01 slurmrestd[2659091]: debug2: _on_header_value: [[localhost]:54330] Header: X-Forwarded-For Value: 10.158.154.56
Sep 30 17:11:04 gen-slurm-sapi-d01 slurmrestd[2659091]: debug2: _on_header_value: [[localhost]:54330] Header: X-Frame-Options Value: SAMEORIGIN
Sep 30 17:11:04 gen-slurm-sapi-d01 slurmrestd[2659091]: debug2: _on_header_value: [[localhost]:54330] Header: X-Forwarded-Port Value: 443
Sep 30 17:11:04 gen-slurm-sapi-d01 slurmrestd[2659091]: debug2: _on_header_value: [[localhost]:54330] Header: Connection Value: upgrade
Sep 30 17:11:04 gen-slurm-sapi-d01 slurmrestd[2659091]: error: _on_header_value: [[localhost]:54330] ignoring unsupported header request: upgrade
Sep 30 17:11:04 gen-slurm-sapi-d01 slurmrestd[2659091]: debug2: _on_header_value: [[localhost]:54330] Header: user-agent Value: curl/7.68.0
Sep 30 17:11:04 gen-slurm-sapi-d01 slurmrestd[2659091]: debug2: _on_header_value: [[localhost]:54330] Header: accept Value: application/json
Sep 30 17:11:04 gen-slurm-sapi-d01 slurmrestd[2659091]: debug2: _on_header_value: [[localhost]:54330] Header: x-slurm-user-name Value: sadm_bbritt
Sep 30 17:11:04 gen-slurm-sapi-d01 slurmrestd[2659091]: debug2: _on_header_value: [[localhost]:54330] Header: x-slurm-user-token Value: ***REMOVED***
Sep 30 17:11:04 gen-slurm-sapi-d01 slurmrestd[2659091]: operations_router: [[localhost]:54330] GET /slurmdb/v0.0.36/job/5868759
Sep 30 17:11:04 gen-slurm-sapi-d01 slurmrestd[2659091]: debug2: No jobstep requested
Sep 30 17:11:04 gen-slurm-sapi-d01 slurmrestd[2659091]: debug2: No jobarray or hetjob requested
Sep 30 17:11:04 gen-slurm-sapi-d01 slurmrestd[2659091]: debug:  accounting_storage/slurmdbd: _connect_dbd_conn: Sent PersistInit msg

The curl call just hangs.

Comment 103 Nate Rini 2021-09-30 11:25:26 MDT

Please verify that Slurm is healthy otherwise:
>srun uptime
>sacct

If those work, please use gdb to get a backtrace and attach:
> gdb -ex 't a a bt full' -ex 'quit' -p $(pgrep slurmrestd)

Comment 104 Bill Britt 2021-09-30 12:05:10 MDT

Both srun and sacct work fine:

# srun -p all.q -A general -c 1 --mem 128 uptime
 18:04:28 up 49 days, 12:17,  0 users,  load average: 0.12, 0.12, 0.09
# sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
5868783          uptime      all.q    general          1  COMPLETED      0:0
5868783.ext+     extern               general          1  COMPLETED      0:0
5868783.0        uptime               general          1  COMPLETED      0:0

Comment 105 Nate Rini 2021-09-30 12:21:33 MDT

(In reply to Nate Rini from comment #103)
> If those work, please use gdb to get a backtrace and attach:
> > gdb -ex 't a a bt full' -ex 'quit' -p $(pgrep slurmrestd)

Please attach the backtrace

Comment 106 Bill Britt 2021-09-30 14:13:52 MDT

Created attachment 21546 [details]
backtrace

Comment 108 Nate Rini 2021-09-30 16:04:38 MDT

Please attach the slurmdbd log. Looks like slurmrestd is waiting on an RPC reply from slurmdbd.

Comment 109 Bill Britt 2021-09-30 16:19:15 MDT

Update: a sample curl call eventually did return with an error:

# export $(scontrol token lifespan=99999);curl -X 'GET' 'https://api-dev.cluster.ihme.washington.edu/slurmdb/v0.0.36/job/5868759'  -H 'accept: application/json' -H "X-SLURM-USER-NAME":"$(whoami)" -H "X-SLURM-USER-TOKEN":"$SLURM_JWT"
{
   "meta": {
     "plugin": {
       "type": "openapi\/dbv0.0.36",
       "name": "REST DB v0.0.36"
     },
     "Slurm": {
       "version": {
         "major": 20,
         "micro": 8,
         "minor": 11
       },
       "release": "20.11.8"
     }
   },
   "errors": [
     {
       "description": "Unknown error with query",
       "error_number": 9000,
       "error": "Query empty or not RFC7320 compliant",
       "source": "slurmdb_jobs_get"
     }
   ],
   "jobs": [
   ]
 }%
# Sacct of the same job:
# sacct -j 5868759
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
5868759        hostname      all.q    general          8  COMPLETED      0:0
5868759.ext+     extern               general          8  COMPLETED      0:0
5868759.0      hostname               general          8  COMPLETED      0:0

Comment 110 Bill Britt 2021-09-30 16:21:05 MDT

Created attachment 21555 [details]
slurmdbd log

Comment 111 Nate Rini 2021-09-30 16:22:00 MDT

(In reply to Bill Britt from comment #109)
> Update: a sample curl call eventually did return with an error:
>        "description": "Unknown error with query",
>        "error_number": 9000,
>        "error": "Query empty or not RFC7320 compliant",
>        "source": "slurmdb_jobs_get"

Please attach slurmdbd and slurmrestd log after this error posted.

Comment 112 Nate Rini 2021-10-01 11:04:53 MDT

(In reply to Nate Rini from comment #111)
> (In reply to Bill Britt from comment #109)
> > Update: a sample curl call eventually did return with an error:
> >        "description": "Unknown error with query",
> >        "error_number": 9000,
> >        "error": "Query empty or not RFC7320 compliant",
> >        "source": "slurmdb_jobs_get"
> 
> Please attach slurmdbd and slurmrestd log after this error posted.

> Please attach slurmrestd log after this error posted.

Comment 113 Ali Nikkhah 2021-10-01 13:35:03 MDT

Created attachment 21576 [details]
slurmrestd log curl error

Sorry for the delay, here is the slurmrestd log for the curl error.

Comment 114 Nate Rini 2021-10-01 13:36:13 MDT

(In reply to Ali Nikkhah from comment #113)
> Created attachment 21576 [details]
> slurmrestd log curl error
> 
> Sorry for the delay, here is the slurmrestd log for the curl error.

Happy to work on your timeline here.

Comment 115 Nate Rini 2021-10-01 15:28:29 MDT

(In reply to Nate Rini from comment #114)
> (In reply to Ali Nikkhah from comment #113)
> > Created attachment 21576 [details]
> > slurmrestd log curl error
> > 
> > Sorry for the delay, here is the slurmrestd log for the curl error.
> 
> Happy to work on your timeline here.

Please attach the slurmdbd logs around Sep 30 21:48:15

Comment 116 Ali Nikkhah 2021-10-01 23:00:21 MDT

Created attachment 21581 [details]
slurmdbd log around curl time

This is the full slurmdbd log from around that time. There is not much to it.

Comment 117 Nate Rini 2021-10-04 09:01:38 MDT

(In reply to Ali Nikkhah from comment #116)
> Created attachment 21581 [details]
> slurmdbd log around curl time
> 
> This is the full slurmdbd log from around that time. There is not much to it.

We will have to increase the logging level for slurmdbd in slurmdbd.conf:
> DebugLevel=debug3
> DebugFlags=DB_QUERY,DB_JOB,network,protocol

Slurmdbd will need to be restarted to activate the higher logging level. Please run the query at least a couple of times. Then please revert the logging changes and restarted slurmdbd. Please then upload the logs from slurmdbd and slurmrestd.

Comment 118 Ali Nikkhah 2021-10-04 16:10:52 MDT

Created attachment 21593 [details]
slurmdbd debug3 and debugflags log

Comment 119 Ali Nikkhah 2021-10-04 16:14:09 MDT

Created attachment 21594 [details]
slurmrestd debug3 log

Comment 120 Nate Rini 2021-10-05 10:18:46 MDT

> debug:  _conn_readable: poll for fd 9 timeout after 900000 msecs of total wait 900000 msecs.
> error: Getting response to message type: DBD_GET_JOBS_COND

Looks like slurmrestd is unable to contact slurmdbd and is timing out. Does calling 'sacct' on the same node as slurmrestd work? Does it work when setting 'SLURM_JWT' in the env?

Comment 121 Ali Nikkhah 2021-10-05 12:22:38 MDT

(In reply to Nate Rini from comment #120)
> > debug:  _conn_readable: poll for fd 9 timeout after 900000 msecs of total wait 900000 msecs.
> > error: Getting response to message type: DBD_GET_JOBS_COND
> 
> Looks like slurmrestd is unable to contact slurmdbd and is timing out. Does
> calling 'sacct' on the same node as slurmrestd work? Does it work when
> setting 'SLURM_JWT' in the env?

Calling 'sacct' from the slurmrestd node works with and without SLURM_JWT set in the environment. The results are identical.

Comment 122 Nate Rini 2021-10-05 12:44:58 MDT

Please attach strace to slurmrestd and attach the resultant logs after a test request:
> strace -p $(grep slurmrestd) -o /tmp/strace.slurmrestd -s999 -tt

Comment 123 Nate Rini 2021-10-05 12:50:09 MDT

(In reply to Nate Rini from comment #122)
> Please attach strace to slurmrestd and attach the resultant logs after a
> test request:

Slight typo:

> > strace -p $(pgrep slurmrestd) -o /tmp/strace.slurmrestd -s999 -tt

Comment 124 Ali Nikkhah 2021-10-05 14:07:17 MDT

Created attachment 21610 [details]
slurmrestd strace

Comment 125 Nate Rini 2021-10-06 08:48:19 MDT

(In reply to Ali Nikkhah from comment #124)
> Created attachment 21610 [details]
> slurmrestd strace

Was strace run for the duration of a curl request? The attached log is of an idle slurmrestd.

Comment 126 Ali Nikkhah 2021-10-06 12:25:59 MDT

(In reply to Nate Rini from comment #125)
> (In reply to Ali Nikkhah from comment #124)
> > Created attachment 21610 [details]
> > slurmrestd strace
> 
> Was strace run for the duration of a curl request? The attached log is of an
> idle slurmrestd.

Yes, it was. Looking closer at this, it seems that there is something else wrong with this particular cluster that we were doing the latest debugging on. Bill will provide a further update- it looks like we may be good now.

Comment 127 Sam Hu 2021-10-08 17:23:25 MDT

Created attachment 21681 [details]
j18261_job_output_of_slurmdbd_get_job

We have applied the patch for missing tres value. Now the tres values show up; Thank you. There are 3 questions along that line. Please refer to the 2 files attached today:
1. For the cmd output file j18261_job_output_of_sacct, the 4K value for MaxRSS should show up in the j18261_job_output_of_slurmdbd_get_job's total tres section(the section almost to the EOF); but we only see 1234. Shouldn't 4K(4096) show up at the total level?
2. We see "allocated" and "requested" in tres at the top level. Shouldn't "consumed" be populated at the top level, and 'mem' gets populated there?
3. We can see "time" values at the top level(just above total tres), but they are 0s. However if we look into each step's "time", they have good >0 values there. Why wasn't "time" totaled at the top level?

Comment 128 Sam Hu 2021-10-08 17:24:11 MDT

Created attachment 21682 [details]
j18261_job_output_of_sacct

Comment 129 Nate Rini 2021-10-12 09:48:36 MDT

(In reply to Sam Hu from comment #127)
> 1. For the cmd output file j18261_job_output_of_sacct, the 4K value for
> MaxRSS should show up in the j18261_job_output_of_slurmdbd_get_job's total
> tres section(the section almost to the EOF); but we only see 1234. Shouldn't
> 4K(4096) show up at the total level?

It does for the step:
> 203                                        'total': [
> 207                                                  {'count': 4096,⏎
> 208                                                   'id': 2,⏎
> 209                                                   'name': None,⏎
> 210                                                   'type': 'mem'},⏎

> 2. We see "allocated" and "requested" in tres at the top level. Shouldn't
> "consumed" be populated at the top level, and 'mem' gets populated there?
> 3. We can see "time" values at the top level(just above total tres), but
> they are 0s. However if we look into each step's "time", they have good >0
> values there. 

slurmrestd does not currently sum the usage(s) of the steps to create a whole job usage the same as sacct does. If this is desired functionality, please submit an RFE, but the expectation was the sites could easily add this up as they wished as slurmrestd provides all of the data.

> Why wasn't "time" totaled at the top level?

In most cases slurmrestd is dumping the values provided by slurmctld or slurmdbd. In this case, the suspended, system, total and user times are in the RPC but not populated (aka zeroes). We may just remove them from the job tree since they are provided for each step directly.

Comment 134 Nate Rini 2021-10-26 15:51:50 MDT

The corrective patches are now upstream for the upcoming 21.08.3 release:
> https://github.com/SchedMD/slurm/commit/7595e0f5409d8308471874f488197bf24403294e
> https://github.com/SchedMD/slurm/commit/b87275e4807d1c8681fc4ae5792cd2a84a47b410

Comment 138 Nate Rini 2021-10-26 16:18:37 MDT

Closing as the patches are now upstream. Please respond if there are any more questions.