Created attachment 21055 [details] API output called outside of openapi generated library. Hello, we are unable to obtain TRES values using python code generated by the following openapi-generator call on SLURM 20.11.8 docker run --rm \ --volume ${PWD}:/local \ --workdir=/local \ openapitools/openapi-generator-cli:v4.3.1 \ generate \ --input-spec /local/slurm-api.json \ --output /local/py_api_client\ --generator-name python \ --package-name slurm_urllib3 The sacct.png file attached show the MaxRSS value returned by sacct and the slurmdbd_get_job_result.txt file shows that the value is returned by the API when called directly with cURL to the /slurmdb/v0.0.36/job/{job_id} route; however we cannot find the result from "slurm_api.slurmdbd_get_job". Version info: # /opt/slurm/sbin/slurmrestd -V slurm 20.11.8 # sbatch -V slurm 20.11.8
Created attachment 21056 [details] sacct output with MaxRSS Value
(In reply to Bill Britt from comment #1) > Created attachment 21056 [details] > sacct output with MaxRSS Value For future reference, we prefer the 'sacct -p' formatted output that can be attached as a file to the ticket instead of an image.
(In reply to Bill Britt from comment #0) > the value is returned by the API > when called directly with cURL to the /slurmdb/v0.0.36/job/{job_id} route; > however we cannot find the result from "slurm_api.slurmdbd_get_job". Please also attach this curl output.
Created attachment 21070 [details] sacct_output
Created attachment 21071 [details] curl_output
(In reply to Sam Hu from comment #10) > Created attachment 21071 [details] > curl_output The auth token was included in the curl output. Please make sure to cycle the JWT key on the cluster unless this is a trivial test cluster.
Please also attach your slurm.conf
Created attachment 21083 [details] slurm.conf
Please also call and attach output (as a file): > sacct -p -o all -j 1523 -D
Created attachment 21088 [details] Output of sacct -p -o all -j 1523 -D
Bill Still working on this issue. The raw values reported by slurmrestd appear to be incorrect. Still working on debugging the cause. Thanks, --Nate
Please attach your slurmdbd.conf (sans passwords).
Created attachment 21221 [details] slurmdbd.conf with password removed
Hi Nate, added the slurmdbd.conf file. If it will help I would be happy to schedule a screen sharing session to troubleshoot on the system live.
Please call the following in mysql against the Slurm db and attach the output: > use slurm_acct_db; > select * from tres_table; > select * from general_step_table where job_db_inx = 1523\G
MariaDB [(none)]> use slurm_acct_db; Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed MariaDB [slurm_acct_db]> select * from tres_table; +---------------+---------+------+----------------+------+ | creation_time | deleted | id | type | name | +---------------+---------+------+----------------+------+ | 1623869945 | 0 | 1 | cpu | | | 1623869945 | 0 | 2 | mem | | | 1623869945 | 0 | 3 | energy | | | 1623869945 | 0 | 4 | node | | | 1623869945 | 0 | 5 | billing | | | 1623869945 | 0 | 6 | fs | disk | | 1623869945 | 0 | 7 | vmem | | | 1623869945 | 0 | 8 | pages | | | 1623869945 | 1 | 1000 | dynamic_offset | | +---------------+---------+------+----------------+------+ 9 rows in set (0.000 sec) MariaDB [slurm_acct_db]> select * from general_step_table where job_db_inx = 1523\G Empty set (0.000 sec)
(In reply to Bill Britt from comment #27) > If it will help I would be happy to schedule a screen sharing session to > troubleshoot on the system live. We usually avoid live sessions outside purchased consulting sessions. Its also usually not helpful to have a customer wait while we write and test our requests. (In reply to Bill Britt from comment #29) > MariaDB [slurm_acct_db]> select * from general_step_table where job_db_inx = > 1523\G > Empty set (0.000 sec) Looks like the job numbers have already wrapped on this cluster. Please call this instead: > use slurm_acct_db; > select * from general_step_table where job_db_inx in (select job_db_inx from general_job_table where id_job = 1523)\G
MariaDB [(none)]> use slurm_acct_db; Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed MariaDB [slurm_acct_db]> select * from general_step_table where job_db_inx in (select job_db_inx from general_job_table where id_job = 1523)\G *************************** 1. row *************************** job_db_inx: 2643 deleted: 0 exit_code: 0 id_step: -5 step_het_comp: 4294967294 kill_requid: -1 nodelist: gen-slurm-sarchive-s01.cluster.ihme.washington.edu nodes_alloc: 1 node_inx: 0 state: 3 step_name: batch task_cnt: 1 task_dist: 0 time_start: 1630003657 time_end: 1630003659 time_suspended: 0 user_sec: 0 user_usec: 6676 sys_sec: 0 sys_usec: 3509 act_cpufreq: 1197 consumed_energy: 0 req_cpufreq_min: 0 req_cpufreq: 0 req_cpufreq_gov: 0 tres_alloc: 1=3,2=178,4=1 tres_usage_in_ave: 1=0,2=32768,3=0,6=0,7=82403328,8=0 tres_usage_in_max: 1=0,2=32768,3=0,6=0,7=82403328,8=0 tres_usage_in_max_taskid: 1=0,2=0,6=0,7=0,8=0 tres_usage_in_max_nodeid: 1=0,2=0,3=0,6=0,7=0,8=0 tres_usage_in_min: 1=0,2=32768,3=0,6=0,7=82403328,8=0 tres_usage_in_min_taskid: 1=0,2=0,6=0,7=0,8=0 tres_usage_in_min_nodeid: 1=0,2=0,3=0,6=0,7=0,8=0 tres_usage_in_tot: 1=0,2=32768,3=0,6=0,7=82403328,8=0 tres_usage_out_ave: 3=0,6=0 tres_usage_out_max: 3=0,6=0 tres_usage_out_max_taskid: 6=0 tres_usage_out_max_nodeid: 3=0,6=0 tres_usage_out_min: 3=0,6=0 tres_usage_out_min_taskid: 6=0 tres_usage_out_min_nodeid: 3=0,6=0 tres_usage_out_tot: 3=0,6=0 *************************** 2. row *************************** job_db_inx: 2643 deleted: 0 exit_code: 0 id_step: -4 step_het_comp: 4294967294 kill_requid: -1 nodelist: gen-slurm-sarchive-s01.cluster.ihme.washington.edu nodes_alloc: 1 node_inx: 0 state: 3 step_name: extern task_cnt: 1 task_dist: 0 time_start: 1630003657 time_end: 1630003659 time_suspended: 0 user_sec: 0 user_usec: 2093 sys_sec: 0 sys_usec: 0 act_cpufreq: 1197 consumed_energy: 0 req_cpufreq_min: 0 req_cpufreq: 0 req_cpufreq_gov: 0 tres_alloc: 1=3,2=178,3=18446744073709551614,4=1,5=3 tres_usage_in_ave: 1=0,2=0,3=0,6=4292,7=8269824,8=0 tres_usage_in_max: 1=0,2=0,3=0,6=4292,7=8269824,8=0 tres_usage_in_max_taskid: 1=0,2=0,6=0,7=0,8=0 tres_usage_in_max_nodeid: 1=0,2=0,3=0,6=0,7=0,8=0 tres_usage_in_min: 1=0,2=0,3=0,6=4292,7=8269824,8=0 tres_usage_in_min_taskid: 1=0,2=0,6=0,7=0,8=0 tres_usage_in_min_nodeid: 1=0,2=0,3=0,6=0,7=0,8=0 tres_usage_in_tot: 1=0,2=0,3=0,6=4292,7=8269824,8=0 tres_usage_out_ave: 3=0,6=0 tres_usage_out_max: 3=0,6=0 tres_usage_out_max_taskid: 6=0 tres_usage_out_max_nodeid: 3=0,6=0 tres_usage_out_min: 3=0,6=0 tres_usage_out_min_taskid: 6=0 tres_usage_out_min_nodeid: 3=0,6=0 tres_usage_out_tot: 3=0,6=0 2 rows in set (0.001 sec)
(In reply to Nate Rini from comment #8) > (In reply to Bill Britt from comment #0) > > the value is returned by the API > > when called directly with cURL to the /slurmdb/v0.0.36/job/{job_id} route; > > however we cannot find the result from "slurm_api.slurmdbd_get_job". > > Please also attach this curl output. Please attach an updated query of this output.
Created attachment 21225 [details] slurmdbd_get_job_result_j1523_0910 Please see attached for the result of slurmdbd_get_job for j1523.
(In reply to Sam Hu from comment #38) > Created attachment 21225 [details] > slurmdbd_get_job_result_j1523_0910 > > Please see attached for the result of slurmdbd_get_job for j1523. Extracting the mem request for the job sacct output in comment#17: > ReqMem AllocTRES > 178Mn billing=3,cpu=3,mem=178M,node=1 > 178Mn cpu=3,mem=178M,node=1 > 178Mn billing=3,cpu=3,mem=178M,node=1 Matches the value provided in attachment in comment#38: > jobtres = {'allocated': [... > {'count': 178, 'id': 2, 'name': None, 'type': 'mem'}, Both are 178MiB per node. Please also provide a new copy of the curl output from doing the query directly, same command as comment#10.
Created attachment 21231 [details] curl_output_0910 The latest curl_output no longer has the info we were seeing before. Not sure what happened. But the point remains. The sacct can produce 32K MaxRSS, and the OpenAPI version does not provide that.
Created attachment 21232 [details] sacct_output_0910
(In reply to Sam Hu from comment #40) > Created attachment 21231 [details] > curl_output_0910 This looks like a different bug. Somehow there is a "job_id": 0 with some values but not all that should be present. Please attach the slurmrestd logs when this was generated. > The latest curl_output no longer has the info we were seeing before. Not > sure what happened. Does this cluster purge jobs? If not, a job should never get lost by slurmdbd which is concerning. Please also attach the slurmctld and slurmdbd logs from around the time of the query. > But the point remains. The sacct can produce 32K MaxRSS, > and the OpenAPI version does not provide that. I have not been able to reproduce the issue locally. The RSS used in my tests always matches the value in sacct and slurmrestd's output. When I got the raw data from MySQL, it didn't match the original output which suggests something else is going on as slurmrestd shouldn't manipulate the TRES values reported.
The log level for both slurmctld and slurmdbd was set to info, what log level would you like for us to use when we reproduce this issue?
Regarding the purging jobs question: I do not believe our cluster is configured to do this.
(In reply to Bill Britt from comment #44) > Regarding the purging jobs question: I do not believe our cluster is > configured to do this. It was not present in the slurmdbd.conf but your site may use direct sacctmgr dump commands to do it. I just want to be sure.
(In reply to Bill Britt from comment #43) > The log level for both slurmctld and slurmdbd was set to info, what log > level would you like for us to use when we reproduce this issue? This is specific to slurmdbd, so please set this in slurmdbd.conf: > DebugFlags=DB_TRES > DebugLevel=debug3 Normal logs for slurmctld are sufficient. I just want to make sure it's not having issues sending updates to slurmdbd. Also the output of 'sdiag' would be helpful.
Created attachment 21272 [details] sacct_output0914
Created attachment 21273 [details] curl_output_0914
Created attachment 21274 [details] slurmdbd_get_job_result_0914
Reviewing logs
Created attachment 21278 [details] Another set of test (with job created today)
Created attachment 21279 [details] curl_output_0914_1
Created attachment 21280 [details] slurmdbd_get_job_result_0914_1
Created attachment 21282 [details] slurmdbd logs
Created attachment 21283 [details] slurmctld logs
Please also provide: > sacct -p -o all -D -j 1523 > sacct -p -o all -D -j 16445
(In reply to Bill Britt from comment #55) > Created attachment 21283 [details] > slurmctld logs > [2021-09-14T15:06:49.360] error: High latency for 1000 calls to gettimeofday(): 754 microseconds This is probably unrelated to this ticket but this is a known issue with certain clocks. Please see bug#11492 comment#3.
(In reply to Nate Rini from comment #46) > Also the output of 'sdiag' would be helpful. Please don't forget the sdiag output too.
root@gen-slurm-sctl-s01:~# sdiag ******************************************************* sdiag output at Tue Sep 14 22:50:23 2021 (1631659823) Data since Tue Sep 14 15:06:53 2021 (1631632013) ******************************************************* Server thread count: 3 Agent queue size: 0 Agent count: 0 Agent thread count: 0 DBD Agent queue size: 0 Jobs submitted: 72 Jobs started: 72 Jobs completed: 71 Jobs canceled: 0 Jobs failed: 0 Job states ts: Tue Sep 14 22:50:02 2021 (1631659802) Jobs pending: 0 Jobs running: 1 Main schedule statistics (microseconds): Last cycle: 33 Max cycle: 214949 Total cycles: 535 Mean cycle: 685 Mean depth cycle: 0 Cycles per minute: 1 Last queue length: 0 Backfilling stats Total backfilled jobs (since last slurm start): 0 Total backfilled jobs (since last stats cycle start): 0 Total backfilled heterogeneous job components: 0 Total cycles: 0 Last cycle when: N/A Last cycle: 0 Max cycle: 0 Last depth cycle: 0 Last depth cycle (try sched): 0 Last queue length: 0 Last table size: 0 Latency for 1000 calls to gettimeofday(): 754 microseconds Remote Procedure Call statistics by message type MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:1120 ave_time:1306 total_time:1463634 REQUEST_PARTITION_INFO ( 2009) count:980 ave_time:455 total_time:446508 REQUEST_JOB_INFO ( 2003) count:546 ave_time:3745 total_time:2045247 REQUEST_NODE_INFO ( 2007) count:516 ave_time:541 total_time:279520 ACCOUNTING_UPDATE_MSG (10001) count:110 ave_time:20582650 total_time:2264091594 REQUEST_STEP_COMPLETE ( 5016) count:77 ave_time:115817 total_time:8917960 MESSAGE_EPILOG_COMPLETE ( 6012) count:72 ave_time:292 total_time:21062 REQUEST_COMPLETE_PROLOG ( 6018) count:72 ave_time:425 total_time:30623 REQUEST_SUBMIT_BATCH_JOB ( 4003) count:67 ave_time:2219 total_time:148724 REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:67 ave_time:1143 total_time:76635 REQUEST_AUTH_TOKEN ( 5039) count:14 ave_time:671 total_time:9394 REQUEST_JOB_READY ( 4019) count:10 ave_time:458 total_time:4584 REQUEST_SHARE_INFO ( 2022) count:6 ave_time:5251 total_time:31509 REQUEST_COMPLETE_JOB_ALLOCATION ( 5017) count:5 ave_time:894 total_time:4473 REQUEST_JOB_STEP_CREATE ( 5001) count:5 ave_time:1030 total_time:5152 REQUEST_RESOURCE_ALLOCATION ( 4001) count:5 ave_time:42775 total_time:213876 ACCOUNTING_REGISTER_CTLD (10003) count:1 ave_time:11462 total_time:11462 REQUEST_CANCEL_JOB_STEP ( 5005) count:1 ave_time:8916282 total_time:8916282 Remote Procedure Call statistics by user root ( 0) count:3368 ave_time:3935 total_time:13253543 slurm ( 64030) count:111 ave_time:20397324 total_time:2264103056 samhu ( 701264) count:103 ave_time:972 total_time:100211 sadm_bbritt ( 700713) count:60 ave_time:2243 total_time:134598 dhs2018 ( 700848) count:26 ave_time:349820 total_time:9095322 sadm_alin4 ( 701322) count:6 ave_time:5251 total_time:31509 Pending RPC statistics No pending RPCs
(In reply to Nate Rini from comment #56) > > sacct -p -o all -D -j 1523 > > sacct -p -o all -D -j 16445 Please also provide the above
Created attachment 21288 [details] sacct_p_all_d_j_1523
Created attachment 21289 [details] sacct_p_all_d_j_16445
(In reply to Sam Hu from comment #48) > Created attachment 21273 [details] > curl_output_0914 The log looks odd: > 'https://api-stage.cluster.ihme.washington.edu/slurmdb/v0.0.36/job/.1523' Please verify that the query had '.1523' as the JobId. Looks like adding the '.' caused slurmrestd to query the wrong job: > "job_id": 16153
Created attachment 21317 [details] curl_output_0916_j1523 updated Interesting. New files attached.
Created attachment 21318 [details] curl_output_0916_j16445
(In reply to Sam Hu from comment #64) > Created attachment 21317 [details] > curl_output_0916_j1523 updated > > Interesting. New files attached. I have opened a new child ticket to fix the filter issue: bug#12507
(In reply to Sam Hu from comment #64) > Created attachment 21317 [details] > curl_output_0916_j1523 updated > > Interesting. New files attached. Comparing to the attachment in comment#61: slurmrestd: > $ jq -C '.jobs[0].steps[0].tres.allocated[]|select(.type=="mem").count' curl_output_0916_j1523.json > 178 sacct: AllocTres: mem=178M That looks correct for the job memory allocation. slurmrestd: > $ jq -C '.jobs[0].steps[0].tres.requested.average[]|select(.type=="mem").count' curl_output_0916_j1523.json > 32768 > $ jq -C '.jobs[0].steps[0].tres.requested.max[]|select(.type=="mem").count' curl_output_0916_j1523.json > 32768 converting units: 32768B -> 32KiB sacct: TRESUsageInMin: mem=32K The RSS usage is also correctly reported in the curl output. Can you please provide another dump from the openapi client but please make sure that the request JobID is correct.
Created attachment 21348 [details] slurmdbd_get_job_result_j1523_j16445_0917
Created attachment 21349 [details] python source file This python source file generated the slurmdbd_get_job_result_j1523_j16445_0917 output file just uploaded. The python script uses slurmdbd_get_job. But I have not seen another RSS info produced anywhere from the output set. Eg, if you search for 32 (for 32768 or 32KiB) in the output file, none can be found. But in the curl output, you can find 32 (for 32768 or 32KiB). That's the basic question we have: where can we find the RSS (32) from using slurmdbd_get_job?
(In reply to Sam Hu from comment #71) > where can we find the RSS (32) from using slurmdbd_get_job? The output provided in comment #70 does not have it yet the output from slurmrestd (comment #64) does. I'm looking to see why but this may be a bug with the OpenAPI generator's python parser.
Right: The output provided in comment #70 (which is generated from using slurmdbd_get_job) does not have it yet the output from slurmrestd (comment #64) does. Our need is to use slurmdbd_get_job to gather usage data. Hence we need to have this function working. Please let me know if there is anything else I can help with.
(In reply to Sam Hu from comment #76) > Our need is to use slurmdbd_get_job to gather usage data. Hence we need to > have this function working. For reasons currently unclear, all of the steps array is not getting parsed. I suggest opening a mirror ticket with [https://github.com/OpenAPITools/openapi-generator]. If this is in fact a bug on their part, that is something they will have to fix as we have no control/oversight over their project.
Created attachment 21389 [details] simple script to look for step TRES > [fred@login ~]$ python3 showjob.py 5 > no tres for batch > no tres for extern
(In reply to Nate Rini from comment #78) > Created attachment 21389 [details] > simple script to look for step TRES > > > [fred@login ~]$ python3 showjob.py 5 > > no tres for batch > > no tres for extern Modified one of the example scripts and found that the steps are populated. The TRES values in each step are not. The python code looks like it should recursively parse the data but the tres member is always None despite it being present in the slurmrestd response.
(In reply to Nate Rini from comment #79) > (In reply to Nate Rini from comment #78) > > Created attachment 21389 [details] > > simple script to look for step TRES > > > > > [fred@login ~]$ python3 showjob.py 5 > > > no tres for batch > > > no tres for extern > > Modified one of the example scripts and found that the steps are populated. > The TRES values in each step are not. The python code looks like it should > recursively parse the data but the tres member is always None despite it > being present in the slurmrestd response. I believe I found the issue. The TRES object in the OpenAPI specification is incorrectly placed in the step (id) object instead of the step object. Working on a patch now.
Created attachment 21396 [details] showjob script
Created attachment 21397 [details] patch for 20.11
(In reply to Nate Rini from comment #86) > Created attachment 21397 [details] > patch for 20.11 This patch should correct the OAS for dbv0.0.36 in 20.11. Please verify if possible that it corrects your issue. (In reply to Nate Rini from comment #85) > Created attachment 21396 [details] > showjob script This script can be used to verify. Here is an example output: > [fred@login ~]$ sbatch --wrap 'memhog 5G' --mem=10G > [fred@login ~]$ sacct -o MaxRSS,jobid -j 2 > MaxRSS JobID > ---------- ------------ > 2 > 132K 2.batch > 0 2.extern > [fred@login ~]$ python3 showjob.py 2 > {'count': 135168, > 'id': 2, > 'name': None, > 'node': 'node00', > 'task': 0, > 'type': 'mem'} > {'count': 0, 'id': 2, 'name': None, 'node': 'node00', 'task': 0, 'type': 'mem'}
Created attachment 21475 [details] New set of test 0927 Still an issue after the patch. New set of test conducted 09/27. Basically we are looking for 4K(shown in the sacct_output_j16632_0927) in the output of slurmdbd_get_job_result_j16632_0927.txt.
Created attachment 21476 [details] sacct_output_j16632_0927
Created attachment 21477 [details] test_slurm_worker_node_for_schedmd_4
Created attachment 21478 [details] slurmdbd_get_job_result_j16632_0927
(In reply to Sam Hu from comment #88) > Still an issue after the patch. > > New set of test conducted 09/27. Basically we are looking for 4K(shown in > the sacct_output_j16632_0927) in the output of > slurmdbd_get_job_result_j16632_0927.txt. Was the installed openapi client rebuilt and reinstalled after applying the patch? > AttributeError: 'Dbv0036JobStep' object has no attribute 'tres' This error should have been corrected by the patch. An explicit uninstall may need to be done with the python installer on the pre-patched version.
Created attachment 21514 [details] conda_slurm_reference in my testing I have the conda environment, and using "pip freeze", I can see that I am using the new slurm_rest newly generated.
I also created a new python environment and used the script supplied here: https://bugs.schedmd.com/attachment.cgi?id=21396 with a newly generated client. Is there a way we can verify the patch is applied correctly?
(In reply to Bill Britt from comment #94) > I also created a new python environment and used the script supplied here: > https://bugs.schedmd.com/attachment.cgi?id=21396 > with a newly generated client. > > Is there a way we can verify the patch is applied correctly? The structure returned by slurmdbd_get_job should have job.steps[].tres.requested.max defined. If it doesn't, then the patch wasn't applied. I suggest just removing the egg installed and then reinstalling after the patch and re-generating. Also verify the openapi.json is different too.
Hi Nate, We figured out why we were getting the same results; we had made a copy of the openapi.json file and were re-downloading the old version. However we have a new issue: The newly generated client appears to be missing libraries. I receive the following error after generating the python client with the new openapi.json python3 showjob.py3 Traceback (most recent call last): File "/Users/sadm_bbritt/Downloads/openapi-test/py_api_client/showjob.py3", line 17, in <module> from openapi_client.models import V0036JobSubmission as jobSubmission ImportError: cannot import name 'V0036JobSubmission' from 'openapi_client.models' (/Users/sadm_bbritt/Downloads/openapi-test/py_api_client/openapi_client/models/__init__.py). It appears the "Dbv0036Account" model file is missing. Below is a tree of the generated client. This was generated with: ```docker run --rm \ --volume ${PWD}:/local \ --workdir=/local \ openapitools/openapi-generator-cli:v4.3.1 \ generate \ --input-spec /local/openapi.json \ --output /local/py_api_client\ --generator-name python \ --package-name openapi_client``` And the attached "patch_openapi.json" file ├── README.md ├── docs │ ├── Dbv0036Account.md │ ├── Dbv0036AccountInfo.md │ ├── Dbv0036AccountResponse.md │ ├── Dbv0036Association.md │ ├── Dbv0036AssociationDefault.md │ ├── Dbv0036AssociationMax.md │ ├── Dbv0036AssociationMaxJobs.md │ ├── Dbv0036AssociationMaxJobsPer.md │ ├── Dbv0036AssociationMaxPer.md │ ├── Dbv0036AssociationMaxPerAccount.md │ ├── Dbv0036AssociationMaxTres.md │ ├── Dbv0036AssociationMaxTresMinutes.md │ ├── Dbv0036AssociationMaxTresMinutesPer.md │ ├── Dbv0036AssociationMaxTresPer.md │ ├── Dbv0036AssociationMin.md │ ├── Dbv0036AssociationShortInfo.md │ ├── Dbv0036AssociationUsage.md │ ├── Dbv0036AssociationsInfo.md │ ├── Dbv0036ClusterInfo.md │ ├── Dbv0036ClusterInfoAssociations.md │ ├── Dbv0036ClusterInfoController.md │ ├── Dbv0036ConfigInfo.md │ ├── Dbv0036ConfigResponse.md │ ├── Dbv0036CoordinatorInfo.md │ ├── Dbv0036Diag.md │ ├── Dbv0036DiagRPCs.md │ ├── Dbv0036DiagRollups.md │ ├── Dbv0036DiagTime.md │ ├── Dbv0036DiagTime1.md │ ├── Dbv0036DiagUsers.md │ ├── Dbv0036Error.md │ ├── Dbv0036Job.md │ ├── Dbv0036JobArray.md │ ├── Dbv0036JobArrayLimits.md │ ├── Dbv0036JobArrayLimitsMax.md │ ├── Dbv0036JobArrayLimitsMaxRunning.md │ ├── Dbv0036JobComment.md │ ├── Dbv0036JobExitCode.md │ ├── Dbv0036JobExitCodeSignal.md │ ├── Dbv0036JobHet.md │ ├── Dbv0036JobInfo.md │ ├── Dbv0036JobMcs.md │ ├── Dbv0036JobRequired.md │ ├── Dbv0036JobReservation.md │ ├── Dbv0036JobState.md │ ├── Dbv0036JobStep.md │ ├── Dbv0036JobStepCPU.md │ ├── Dbv0036JobStepCPURequestedFrequency.md │ ├── Dbv0036JobStepNodes.md │ ├── Dbv0036JobStepStatistics.md │ ├── Dbv0036JobStepStatisticsCPU.md │ ├── Dbv0036JobStepStatisticsEnergy.md │ ├── Dbv0036JobStepStep.md │ ├── Dbv0036JobStepStepHet.md │ ├── Dbv0036JobStepTask.md │ ├── Dbv0036JobStepTasks.md │ ├── Dbv0036JobStepTime.md │ ├── Dbv0036JobStepTres.md │ ├── Dbv0036JobStepTresRequested.md │ ├── Dbv0036JobTime.md │ ├── Dbv0036JobTimeSystem.md │ ├── Dbv0036JobTimeTotal.md │ ├── Dbv0036JobTimeUser.md │ ├── Dbv0036JobTres.md │ ├── Dbv0036JobWckey.md │ ├── Dbv0036Qos.md │ ├── Dbv0036QosInfo.md │ ├── Dbv0036QosLimits.md │ ├── Dbv0036QosLimitsMax.md │ ├── Dbv0036QosLimitsMaxAccruing.md │ ├── Dbv0036QosLimitsMaxAccruingPer.md │ ├── Dbv0036QosLimitsMaxJobs.md │ ├── Dbv0036QosLimitsMaxJobsPer.md │ ├── Dbv0036QosLimitsMaxTres.md │ ├── Dbv0036QosLimitsMaxTresMinutes.md │ ├── Dbv0036QosLimitsMaxTresMinutesPer.md │ ├── Dbv0036QosLimitsMaxTresPer.md │ ├── Dbv0036QosLimitsMaxWallClock.md │ ├── Dbv0036QosLimitsMaxWallClockPer.md │ ├── Dbv0036QosLimitsMin.md │ ├── Dbv0036QosLimitsMinTres.md │ ├── Dbv0036QosLimitsMinTresPer.md │ ├── Dbv0036QosPreempt.md │ ├── Dbv0036ResponseAccountDelete.md │ ├── Dbv0036ResponseAssociationDelete.md │ ├── Dbv0036ResponseClusterAdd.md │ ├── Dbv0036ResponseClusterDelete.md │ ├── Dbv0036ResponseQosDelete.md │ ├── Dbv0036ResponseTres.md │ ├── Dbv0036ResponseUserDelete.md │ ├── Dbv0036ResponseUserUpdate.md │ ├── Dbv0036ResponseWckeyAdd.md │ ├── Dbv0036ResponseWckeyDelete.md │ ├── Dbv0036TresInfo.md │ ├── Dbv0036User.md │ ├── Dbv0036UserAssociations.md │ ├── Dbv0036UserDefault.md │ ├── Dbv0036UserInfo.md │ ├── Dbv0036Wckey.md │ ├── Dbv0036WckeyInfo.md │ ├── OpenapiApi.md │ └── SlurmApi.md ├── git_push.sh ├── openapi_client │ ├── __init__.py │ ├── __init__.pyc │ ├── __pycache__ │ │ ├── __init__.cpython-39.pyc │ │ ├── api_client.cpython-39.pyc │ │ ├── configuration.cpython-39.pyc │ │ ├── exceptions.cpython-39.pyc │ │ └── rest.cpython-39.pyc │ ├── api │ │ ├── __init__.py │ │ ├── __init__.pyc │ │ ├── __pycache__ │ │ │ ├── __init__.cpython-39.pyc │ │ │ ├── openapi_api.cpython-39.pyc │ │ │ └── slurm_api.cpython-39.pyc │ │ ├── openapi_api.py │ │ ├── openapi_api.pyc │ │ └── slurm_api.py │ ├── api_client.py │ ├── api_client.pyc │ ├── configuration.py │ ├── configuration.pyc │ ├── exceptions.py │ ├── models │ │ ├── __init__.py │ │ ├── __pycache__ │ │ │ ├── __init__.cpython-39.pyc │ │ │ ├── dbv0036_account.cpython-39.pyc │ │ │ ├── dbv0036_account_info.cpython-39.pyc │ │ │ ├── dbv0036_account_response.cpython-39.pyc │ │ │ ├── dbv0036_association.cpython-39.pyc │ │ │ ├── dbv0036_association_default.cpython-39.pyc │ │ │ ├── dbv0036_association_max.cpython-39.pyc │ │ │ ├── dbv0036_association_max_jobs.cpython-39.pyc │ │ │ ├── dbv0036_association_max_jobs_per.cpython-39.pyc │ │ │ ├── dbv0036_association_max_per.cpython-39.pyc │ │ │ ├── dbv0036_association_max_per_account.cpython-39.pyc │ │ │ ├── dbv0036_association_max_tres.cpython-39.pyc │ │ │ ├── dbv0036_association_max_tres_minutes.cpython-39.pyc │ │ │ ├── dbv0036_association_max_tres_minutes_per.cpython-39.pyc │ │ │ ├── dbv0036_association_max_tres_per.cpython-39.pyc │ │ │ ├── dbv0036_association_min.cpython-39.pyc │ │ │ ├── dbv0036_association_short_info.cpython-39.pyc │ │ │ ├── dbv0036_association_usage.cpython-39.pyc │ │ │ ├── dbv0036_associations_info.cpython-39.pyc │ │ │ ├── dbv0036_cluster_info.cpython-39.pyc │ │ │ ├── dbv0036_cluster_info_associations.cpython-39.pyc │ │ │ ├── dbv0036_cluster_info_controller.cpython-39.pyc │ │ │ ├── dbv0036_config_info.cpython-39.pyc │ │ │ ├── dbv0036_config_response.cpython-39.pyc │ │ │ ├── dbv0036_coordinator_info.cpython-39.pyc │ │ │ ├── dbv0036_diag.cpython-39.pyc │ │ │ ├── dbv0036_diag_rollups.cpython-39.pyc │ │ │ ├── dbv0036_diag_rp_cs.cpython-39.pyc │ │ │ ├── dbv0036_diag_time.cpython-39.pyc │ │ │ ├── dbv0036_diag_time1.cpython-39.pyc │ │ │ ├── dbv0036_diag_users.cpython-39.pyc │ │ │ ├── dbv0036_error.cpython-39.pyc │ │ │ ├── dbv0036_job.cpython-39.pyc │ │ │ ├── dbv0036_job_array.cpython-39.pyc │ │ │ ├── dbv0036_job_array_limits.cpython-39.pyc │ │ │ ├── dbv0036_job_array_limits_max.cpython-39.pyc │ │ │ ├── dbv0036_job_array_limits_max_running.cpython-39.pyc │ │ │ ├── dbv0036_job_comment.cpython-39.pyc │ │ │ ├── dbv0036_job_exit_code.cpython-39.pyc │ │ │ ├── dbv0036_job_exit_code_signal.cpython-39.pyc │ │ │ ├── dbv0036_job_het.cpython-39.pyc │ │ │ ├── dbv0036_job_info.cpython-39.pyc │ │ │ ├── dbv0036_job_mcs.cpython-39.pyc │ │ │ ├── dbv0036_job_required.cpython-39.pyc │ │ │ ├── dbv0036_job_reservation.cpython-39.pyc │ │ │ ├── dbv0036_job_state.cpython-39.pyc │ │ │ ├── dbv0036_job_step.cpython-39.pyc │ │ │ ├── dbv0036_job_step_cpu.cpython-39.pyc │ │ │ ├── dbv0036_job_step_cpu_requested_frequency.cpython-39.pyc │ │ │ ├── dbv0036_job_step_nodes.cpython-39.pyc │ │ │ ├── dbv0036_job_step_statistics.cpython-39.pyc │ │ │ ├── dbv0036_job_step_statistics_cpu.cpython-39.pyc │ │ │ ├── dbv0036_job_step_statistics_energy.cpython-39.pyc │ │ │ ├── dbv0036_job_step_step.cpython-39.pyc │ │ │ ├── dbv0036_job_step_step_het.cpython-39.pyc │ │ │ ├── dbv0036_job_step_task.cpython-39.pyc │ │ │ ├── dbv0036_job_step_tasks.cpython-39.pyc │ │ │ ├── dbv0036_job_step_time.cpython-39.pyc │ │ │ ├── dbv0036_job_step_tres.cpython-39.pyc │ │ │ ├── dbv0036_job_step_tres_requested.cpython-39.pyc │ │ │ ├── dbv0036_job_time.cpython-39.pyc │ │ │ ├── dbv0036_job_time_system.cpython-39.pyc │ │ │ ├── dbv0036_job_time_total.cpython-39.pyc │ │ │ ├── dbv0036_job_time_user.cpython-39.pyc │ │ │ ├── dbv0036_job_tres.cpython-39.pyc │ │ │ ├── dbv0036_job_wckey.cpython-39.pyc │ │ │ ├── dbv0036_qos.cpython-39.pyc │ │ │ ├── dbv0036_qos_info.cpython-39.pyc │ │ │ ├── dbv0036_qos_limits.cpython-39.pyc │ │ │ ├── dbv0036_qos_limits_max.cpython-39.pyc │ │ │ ├── dbv0036_qos_limits_max_accruing.cpython-39.pyc │ │ │ ├── dbv0036_qos_limits_max_accruing_per.cpython-39.pyc │ │ │ ├── dbv0036_qos_limits_max_jobs.cpython-39.pyc │ │ │ ├── dbv0036_qos_limits_max_jobs_per.cpython-39.pyc │ │ │ ├── dbv0036_qos_limits_max_tres.cpython-39.pyc │ │ │ ├── dbv0036_qos_limits_max_tres_minutes.cpython-39.pyc │ │ │ ├── dbv0036_qos_limits_max_tres_minutes_per.cpython-39.pyc │ │ │ ├── dbv0036_qos_limits_max_tres_per.cpython-39.pyc │ │ │ ├── dbv0036_qos_limits_max_wall_clock.cpython-39.pyc │ │ │ ├── dbv0036_qos_limits_max_wall_clock_per.cpython-39.pyc │ │ │ ├── dbv0036_qos_limits_min.cpython-39.pyc │ │ │ ├── dbv0036_qos_limits_min_tres.cpython-39.pyc │ │ │ ├── dbv0036_qos_limits_min_tres_per.cpython-39.pyc │ │ │ ├── dbv0036_qos_preempt.cpython-39.pyc │ │ │ ├── dbv0036_response_account_delete.cpython-39.pyc │ │ │ ├── dbv0036_response_association_delete.cpython-39.pyc │ │ │ ├── dbv0036_response_cluster_add.cpython-39.pyc │ │ │ ├── dbv0036_response_cluster_delete.cpython-39.pyc │ │ │ ├── dbv0036_response_qos_delete.cpython-39.pyc │ │ │ ├── dbv0036_response_tres.cpython-39.pyc │ │ │ ├── dbv0036_response_user_delete.cpython-39.pyc │ │ │ ├── dbv0036_response_user_update.cpython-39.pyc │ │ │ ├── dbv0036_response_wckey_add.cpython-39.pyc │ │ │ ├── dbv0036_response_wckey_delete.cpython-39.pyc │ │ │ ├── dbv0036_tres_info.cpython-39.pyc │ │ │ ├── dbv0036_user.cpython-39.pyc │ │ │ ├── dbv0036_user_associations.cpython-39.pyc │ │ │ ├── dbv0036_user_default.cpython-39.pyc │ │ │ ├── dbv0036_user_info.cpython-39.pyc │ │ │ ├── dbv0036_wckey.cpython-39.pyc │ │ │ └── dbv0036_wckey_info.cpython-39.pyc │ │ ├── dbv0036_account.py │ │ ├── dbv0036_account_info.py │ │ ├── dbv0036_account_response.py │ │ ├── dbv0036_association.py │ │ ├── dbv0036_association_default.py │ │ ├── dbv0036_association_max.py │ │ ├── dbv0036_association_max_jobs.py │ │ ├── dbv0036_association_max_jobs_per.py │ │ ├── dbv0036_association_max_per.py │ │ ├── dbv0036_association_max_per_account.py │ │ ├── dbv0036_association_max_tres.py │ │ ├── dbv0036_association_max_tres_minutes.py │ │ ├── dbv0036_association_max_tres_minutes_per.py │ │ ├── dbv0036_association_max_tres_per.py │ │ ├── dbv0036_association_min.py │ │ ├── dbv0036_association_short_info.py │ │ ├── dbv0036_association_usage.py │ │ ├── dbv0036_associations_info.py │ │ ├── dbv0036_cluster_info.py │ │ ├── dbv0036_cluster_info_associations.py │ │ ├── dbv0036_cluster_info_controller.py │ │ ├── dbv0036_config_info.py │ │ ├── dbv0036_config_response.py │ │ ├── dbv0036_coordinator_info.py │ │ ├── dbv0036_diag.py │ │ ├── dbv0036_diag_rollups.py │ │ ├── dbv0036_diag_rp_cs.py │ │ ├── dbv0036_diag_time.py │ │ ├── dbv0036_diag_time1.py │ │ ├── dbv0036_diag_users.py │ │ ├── dbv0036_error.py │ │ ├── dbv0036_job.py │ │ ├── dbv0036_job_array.py │ │ ├── dbv0036_job_array_limits.py │ │ ├── dbv0036_job_array_limits_max.py │ │ ├── dbv0036_job_array_limits_max_running.py │ │ ├── dbv0036_job_comment.py │ │ ├── dbv0036_job_exit_code.py │ │ ├── dbv0036_job_exit_code_signal.py │ │ ├── dbv0036_job_het.py │ │ ├── dbv0036_job_info.py │ │ ├── dbv0036_job_mcs.py │ │ ├── dbv0036_job_required.py │ │ ├── dbv0036_job_reservation.py │ │ ├── dbv0036_job_state.py │ │ ├── dbv0036_job_step.py │ │ ├── dbv0036_job_step_cpu.py │ │ ├── dbv0036_job_step_cpu_requested_frequency.py │ │ ├── dbv0036_job_step_nodes.py │ │ ├── dbv0036_job_step_statistics.py │ │ ├── dbv0036_job_step_statistics_cpu.py │ │ ├── dbv0036_job_step_statistics_energy.py │ │ ├── dbv0036_job_step_step.py │ │ ├── dbv0036_job_step_step_het.py │ │ ├── dbv0036_job_step_task.py │ │ ├── dbv0036_job_step_tasks.py │ │ ├── dbv0036_job_step_time.py │ │ ├── dbv0036_job_step_tres.py │ │ ├── dbv0036_job_step_tres_requested.py │ │ ├── dbv0036_job_time.py │ │ ├── dbv0036_job_time_system.py │ │ ├── dbv0036_job_time_total.py │ │ ├── dbv0036_job_time_user.py │ │ ├── dbv0036_job_tres.py │ │ ├── dbv0036_job_wckey.py │ │ ├── dbv0036_qos.py │ │ ├── dbv0036_qos_info.py │ │ ├── dbv0036_qos_limits.py │ │ ├── dbv0036_qos_limits_max.py │ │ ├── dbv0036_qos_limits_max_accruing.py │ │ ├── dbv0036_qos_limits_max_accruing_per.py │ │ ├── dbv0036_qos_limits_max_jobs.py │ │ ├── dbv0036_qos_limits_max_jobs_per.py │ │ ├── dbv0036_qos_limits_max_tres.py │ │ ├── dbv0036_qos_limits_max_tres_minutes.py │ │ ├── dbv0036_qos_limits_max_tres_minutes_per.py │ │ ├── dbv0036_qos_limits_max_tres_per.py │ │ ├── dbv0036_qos_limits_max_wall_clock.py │ │ ├── dbv0036_qos_limits_max_wall_clock_per.py │ │ ├── dbv0036_qos_limits_min.py │ │ ├── dbv0036_qos_limits_min_tres.py │ │ ├── dbv0036_qos_limits_min_tres_per.py │ │ ├── dbv0036_qos_preempt.py │ │ ├── dbv0036_response_account_delete.py │ │ ├── dbv0036_response_association_delete.py │ │ ├── dbv0036_response_cluster_add.py │ │ ├── dbv0036_response_cluster_delete.py │ │ ├── dbv0036_response_qos_delete.py │ │ ├── dbv0036_response_tres.py │ │ ├── dbv0036_response_user_delete.py │ │ ├── dbv0036_response_user_update.py │ │ ├── dbv0036_response_wckey_add.py │ │ ├── dbv0036_response_wckey_delete.py │ │ ├── dbv0036_tres_info.py │ │ ├── dbv0036_user.py │ │ ├── dbv0036_user_associations.py │ │ ├── dbv0036_user_default.py │ │ ├── dbv0036_user_info.py │ │ ├── dbv0036_wckey.py │ │ └── dbv0036_wckey_info.py │ └── rest.py ├── requirements.txt ├── setup.cfg ├── setup.py ├── showjob.py3 ├── test │ ├── __init__.py │ ├── test_dbv0036_account.py │ ├── test_dbv0036_account_info.py │ ├── test_dbv0036_account_response.py │ ├── test_dbv0036_association.py │ ├── test_dbv0036_association_default.py │ ├── test_dbv0036_association_max.py │ ├── test_dbv0036_association_max_jobs.py │ ├── test_dbv0036_association_max_jobs_per.py │ ├── test_dbv0036_association_max_per.py │ ├── test_dbv0036_association_max_per_account.py │ ├── test_dbv0036_association_max_tres.py │ ├── test_dbv0036_association_max_tres_minutes.py │ ├── test_dbv0036_association_max_tres_minutes_per.py │ ├── test_dbv0036_association_max_tres_per.py │ ├── test_dbv0036_association_min.py │ ├── test_dbv0036_association_short_info.py │ ├── test_dbv0036_association_usage.py │ ├── test_dbv0036_associations_info.py │ ├── test_dbv0036_cluster_info.py │ ├── test_dbv0036_cluster_info_associations.py │ ├── test_dbv0036_cluster_info_controller.py │ ├── test_dbv0036_config_info.py │ ├── test_dbv0036_config_response.py │ ├── test_dbv0036_coordinator_info.py │ ├── test_dbv0036_diag.py │ ├── test_dbv0036_diag_rollups.py │ ├── test_dbv0036_diag_rp_cs.py │ ├── test_dbv0036_diag_time.py │ ├── test_dbv0036_diag_time1.py │ ├── test_dbv0036_diag_users.py │ ├── test_dbv0036_error.py │ ├── test_dbv0036_job.py │ ├── test_dbv0036_job_array.py │ ├── test_dbv0036_job_array_limits.py │ ├── test_dbv0036_job_array_limits_max.py │ ├── test_dbv0036_job_array_limits_max_running.py │ ├── test_dbv0036_job_comment.py │ ├── test_dbv0036_job_exit_code.py │ ├── test_dbv0036_job_exit_code_signal.py │ ├── test_dbv0036_job_het.py │ ├── test_dbv0036_job_info.py │ ├── test_dbv0036_job_mcs.py │ ├── test_dbv0036_job_required.py │ ├── test_dbv0036_job_reservation.py │ ├── test_dbv0036_job_state.py │ ├── test_dbv0036_job_step.py │ ├── test_dbv0036_job_step_cpu.py │ ├── test_dbv0036_job_step_cpu_requested_frequency.py │ ├── test_dbv0036_job_step_nodes.py │ ├── test_dbv0036_job_step_statistics.py │ ├── test_dbv0036_job_step_statistics_cpu.py │ ├── test_dbv0036_job_step_statistics_energy.py │ ├── test_dbv0036_job_step_step.py │ ├── test_dbv0036_job_step_step_het.py │ ├── test_dbv0036_job_step_task.py │ ├── test_dbv0036_job_step_tasks.py │ ├── test_dbv0036_job_step_time.py │ ├── test_dbv0036_job_step_tres.py │ ├── test_dbv0036_job_step_tres_requested.py │ ├── test_dbv0036_job_time.py │ ├── test_dbv0036_job_time_system.py │ ├── test_dbv0036_job_time_total.py │ ├── test_dbv0036_job_time_user.py │ ├── test_dbv0036_job_tres.py │ ├── test_dbv0036_job_wckey.py │ ├── test_dbv0036_qos.py │ ├── test_dbv0036_qos_info.py │ ├── test_dbv0036_qos_limits.py │ ├── test_dbv0036_qos_limits_max.py │ ├── test_dbv0036_qos_limits_max_accruing.py │ ├── test_dbv0036_qos_limits_max_accruing_per.py │ ├── test_dbv0036_qos_limits_max_jobs.py │ ├── test_dbv0036_qos_limits_max_jobs_per.py │ ├── test_dbv0036_qos_limits_max_tres.py │ ├── test_dbv0036_qos_limits_max_tres_minutes.py │ ├── test_dbv0036_qos_limits_max_tres_minutes_per.py │ ├── test_dbv0036_qos_limits_max_tres_per.py │ ├── test_dbv0036_qos_limits_max_wall_clock.py │ ├── test_dbv0036_qos_limits_max_wall_clock_per.py │ ├── test_dbv0036_qos_limits_min.py │ ├── test_dbv0036_qos_limits_min_tres.py │ ├── test_dbv0036_qos_limits_min_tres_per.py │ ├── test_dbv0036_qos_preempt.py │ ├── test_dbv0036_response_account_delete.py │ ├── test_dbv0036_response_association_delete.py │ ├── test_dbv0036_response_cluster_add.py │ ├── test_dbv0036_response_cluster_delete.py │ ├── test_dbv0036_response_qos_delete.py │ ├── test_dbv0036_response_tres.py │ ├── test_dbv0036_response_user_delete.py │ ├── test_dbv0036_response_user_update.py │ ├── test_dbv0036_response_wckey_add.py │ ├── test_dbv0036_response_wckey_delete.py │ ├── test_dbv0036_tres_info.py │ ├── test_dbv0036_user.py │ ├── test_dbv0036_user_associations.py │ ├── test_dbv0036_user_default.py │ ├── test_dbv0036_user_info.py │ ├── test_dbv0036_wckey.py │ ├── test_dbv0036_wckey_info.py │ ├── test_openapi_api.py │ └── test_slurm_api.py ├── test-requirements.txt └── tox.ini
Created attachment 21520 [details] OpenAPI file from after patching
We realized again that the openapi.json file we just uploaded was incorrect (it was from the build source). Now with the correct openapi.json downloaded with: export $(scontrol token lifespan=99999);curl -v -s -H X-SLURM-USER-NAME:$(whoami) -H X-SLURM-USER-TOKEN:$SLURM_JWT https://REST_HOST/openapi > openapi.json we run the script but it hangs. Using curl to connect directly to the API also hangs: Example curl: export $(scontrol token lifespan=99999);curl -X 'GET' 'https://REST_HOST/slurmdb/v0.0.36/job/5868759' -H 'accept: application/json' -H X-SLURM-USER-NAME:$(whoami) -H X-SLURM-USER-TOKEN:$SLURM_JWT
(In reply to Bill Britt from comment #98) > Using curl to connect directly to the API also hangs: Please attach the slurmrestd log.
Created attachment 21537 [details] slurmdrestd logs
Please place the headers for curl inside of quotes. Somehow they are getting put into lower case: > Sep 30 01:45:39 gen-slurm-sapi-d01 slurmrestd[2659091]: debug2: _on_header_value: [[localhost]:50580] Header: x-slurm-user-name Value: sadm_alin4 Sep 30 01:45:39 gen-slurm-sapi-d01 slurmrestd[2659091]: debug2: _on_header_value: [[localhost]:50580] Header: x-slurm-user-token Value: Please attach slurmctld log after.
Using this command: export $(scontrol token lifespan=99999);curl -X 'GET' 'https://REST_HOST/slurmdb/v0.0.36/job/5868759' -H 'accept: application/json' -H "X-SLURM-USER-NAME":"$(whoami)" -H "X-SLURM-USER-TOKEN":"$SLURM_JWT" This is the output of the logs: Sep 30 17:11:04 gen-slurm-sapi-d01 slurmrestd[2659091]: debug: parse_http: [[localhost]:54330] Accepted HTTP connection Sep 30 17:11:04 gen-slurm-sapi-d01 slurmrestd[2659091]: debug: _on_url: [[localhost]:54330] url path: /slurmdb/v0.0.36/job/5868759 query: (null) Sep 30 17:11:04 gen-slurm-sapi-d01 slurmrestd[2659091]: debug2: _on_header_value: [[localhost]:54330] Header: Host Value: api-dev.cluster.ihme.washington.edu Sep 30 17:11:04 gen-slurm-sapi-d01 slurmrestd[2659091]: debug2: _on_header_value: [[localhost]:54330] Header: X-Real-Ip Value: 10.158.154.56 Sep 30 17:11:04 gen-slurm-sapi-d01 slurmrestd[2659091]: debug2: _on_header_value: [[localhost]:54330] Header: X-Forwarded-For Value: 10.158.154.56 Sep 30 17:11:04 gen-slurm-sapi-d01 slurmrestd[2659091]: debug2: _on_header_value: [[localhost]:54330] Header: X-Frame-Options Value: SAMEORIGIN Sep 30 17:11:04 gen-slurm-sapi-d01 slurmrestd[2659091]: debug2: _on_header_value: [[localhost]:54330] Header: X-Forwarded-Port Value: 443 Sep 30 17:11:04 gen-slurm-sapi-d01 slurmrestd[2659091]: debug2: _on_header_value: [[localhost]:54330] Header: Connection Value: upgrade Sep 30 17:11:04 gen-slurm-sapi-d01 slurmrestd[2659091]: error: _on_header_value: [[localhost]:54330] ignoring unsupported header request: upgrade Sep 30 17:11:04 gen-slurm-sapi-d01 slurmrestd[2659091]: debug2: _on_header_value: [[localhost]:54330] Header: user-agent Value: curl/7.68.0 Sep 30 17:11:04 gen-slurm-sapi-d01 slurmrestd[2659091]: debug2: _on_header_value: [[localhost]:54330] Header: accept Value: application/json Sep 30 17:11:04 gen-slurm-sapi-d01 slurmrestd[2659091]: debug2: _on_header_value: [[localhost]:54330] Header: x-slurm-user-name Value: sadm_bbritt Sep 30 17:11:04 gen-slurm-sapi-d01 slurmrestd[2659091]: debug2: _on_header_value: [[localhost]:54330] Header: x-slurm-user-token Value: ***REMOVED*** Sep 30 17:11:04 gen-slurm-sapi-d01 slurmrestd[2659091]: operations_router: [[localhost]:54330] GET /slurmdb/v0.0.36/job/5868759 Sep 30 17:11:04 gen-slurm-sapi-d01 slurmrestd[2659091]: debug2: No jobstep requested Sep 30 17:11:04 gen-slurm-sapi-d01 slurmrestd[2659091]: debug2: No jobarray or hetjob requested Sep 30 17:11:04 gen-slurm-sapi-d01 slurmrestd[2659091]: debug: accounting_storage/slurmdbd: _connect_dbd_conn: Sent PersistInit msg The curl call just hangs.
Please verify that Slurm is healthy otherwise: >srun uptime >sacct If those work, please use gdb to get a backtrace and attach: > gdb -ex 't a a bt full' -ex 'quit' -p $(pgrep slurmrestd)
Both srun and sacct work fine: # srun -p all.q -A general -c 1 --mem 128 uptime 18:04:28 up 49 days, 12:17, 0 users, load average: 0.12, 0.12, 0.09 # sacct JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 5868783 uptime all.q general 1 COMPLETED 0:0 5868783.ext+ extern general 1 COMPLETED 0:0 5868783.0 uptime general 1 COMPLETED 0:0
(In reply to Nate Rini from comment #103) > If those work, please use gdb to get a backtrace and attach: > > gdb -ex 't a a bt full' -ex 'quit' -p $(pgrep slurmrestd) Please attach the backtrace
Created attachment 21546 [details] backtrace
Please attach the slurmdbd log. Looks like slurmrestd is waiting on an RPC reply from slurmdbd.
Update: a sample curl call eventually did return with an error: # export $(scontrol token lifespan=99999);curl -X 'GET' 'https://api-dev.cluster.ihme.washington.edu/slurmdb/v0.0.36/job/5868759' -H 'accept: application/json' -H "X-SLURM-USER-NAME":"$(whoami)" -H "X-SLURM-USER-TOKEN":"$SLURM_JWT" { "meta": { "plugin": { "type": "openapi\/dbv0.0.36", "name": "REST DB v0.0.36" }, "Slurm": { "version": { "major": 20, "micro": 8, "minor": 11 }, "release": "20.11.8" } }, "errors": [ { "description": "Unknown error with query", "error_number": 9000, "error": "Query empty or not RFC7320 compliant", "source": "slurmdb_jobs_get" } ], "jobs": [ ] }% # Sacct of the same job: # sacct -j 5868759 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 5868759 hostname all.q general 8 COMPLETED 0:0 5868759.ext+ extern general 8 COMPLETED 0:0 5868759.0 hostname general 8 COMPLETED 0:0
Created attachment 21555 [details] slurmdbd log
(In reply to Bill Britt from comment #109) > Update: a sample curl call eventually did return with an error: > "description": "Unknown error with query", > "error_number": 9000, > "error": "Query empty or not RFC7320 compliant", > "source": "slurmdb_jobs_get" Please attach slurmdbd and slurmrestd log after this error posted.
(In reply to Nate Rini from comment #111) > (In reply to Bill Britt from comment #109) > > Update: a sample curl call eventually did return with an error: > > "description": "Unknown error with query", > > "error_number": 9000, > > "error": "Query empty or not RFC7320 compliant", > > "source": "slurmdb_jobs_get" > > Please attach slurmdbd and slurmrestd log after this error posted. > Please attach slurmrestd log after this error posted.
Created attachment 21576 [details] slurmrestd log curl error Sorry for the delay, here is the slurmrestd log for the curl error.
(In reply to Ali Nikkhah from comment #113) > Created attachment 21576 [details] > slurmrestd log curl error > > Sorry for the delay, here is the slurmrestd log for the curl error. Happy to work on your timeline here.
(In reply to Nate Rini from comment #114) > (In reply to Ali Nikkhah from comment #113) > > Created attachment 21576 [details] > > slurmrestd log curl error > > > > Sorry for the delay, here is the slurmrestd log for the curl error. > > Happy to work on your timeline here. Please attach the slurmdbd logs around Sep 30 21:48:15
Created attachment 21581 [details] slurmdbd log around curl time This is the full slurmdbd log from around that time. There is not much to it.
(In reply to Ali Nikkhah from comment #116) > Created attachment 21581 [details] > slurmdbd log around curl time > > This is the full slurmdbd log from around that time. There is not much to it. We will have to increase the logging level for slurmdbd in slurmdbd.conf: > DebugLevel=debug3 > DebugFlags=DB_QUERY,DB_JOB,network,protocol Slurmdbd will need to be restarted to activate the higher logging level. Please run the query at least a couple of times. Then please revert the logging changes and restarted slurmdbd. Please then upload the logs from slurmdbd and slurmrestd.
Created attachment 21593 [details] slurmdbd debug3 and debugflags log
Created attachment 21594 [details] slurmrestd debug3 log
> debug: _conn_readable: poll for fd 9 timeout after 900000 msecs of total wait 900000 msecs. > error: Getting response to message type: DBD_GET_JOBS_COND Looks like slurmrestd is unable to contact slurmdbd and is timing out. Does calling 'sacct' on the same node as slurmrestd work? Does it work when setting 'SLURM_JWT' in the env?
(In reply to Nate Rini from comment #120) > > debug: _conn_readable: poll for fd 9 timeout after 900000 msecs of total wait 900000 msecs. > > error: Getting response to message type: DBD_GET_JOBS_COND > > Looks like slurmrestd is unable to contact slurmdbd and is timing out. Does > calling 'sacct' on the same node as slurmrestd work? Does it work when > setting 'SLURM_JWT' in the env? Calling 'sacct' from the slurmrestd node works with and without SLURM_JWT set in the environment. The results are identical.
Please attach strace to slurmrestd and attach the resultant logs after a test request: > strace -p $(grep slurmrestd) -o /tmp/strace.slurmrestd -s999 -tt
(In reply to Nate Rini from comment #122) > Please attach strace to slurmrestd and attach the resultant logs after a > test request: Slight typo: > > strace -p $(pgrep slurmrestd) -o /tmp/strace.slurmrestd -s999 -tt
Created attachment 21610 [details] slurmrestd strace
(In reply to Ali Nikkhah from comment #124) > Created attachment 21610 [details] > slurmrestd strace Was strace run for the duration of a curl request? The attached log is of an idle slurmrestd.
(In reply to Nate Rini from comment #125) > (In reply to Ali Nikkhah from comment #124) > > Created attachment 21610 [details] > > slurmrestd strace > > Was strace run for the duration of a curl request? The attached log is of an > idle slurmrestd. Yes, it was. Looking closer at this, it seems that there is something else wrong with this particular cluster that we were doing the latest debugging on. Bill will provide a further update- it looks like we may be good now.
Created attachment 21681 [details] j18261_job_output_of_slurmdbd_get_job We have applied the patch for missing tres value. Now the tres values show up; Thank you. There are 3 questions along that line. Please refer to the 2 files attached today: 1. For the cmd output file j18261_job_output_of_sacct, the 4K value for MaxRSS should show up in the j18261_job_output_of_slurmdbd_get_job's total tres section(the section almost to the EOF); but we only see 1234. Shouldn't 4K(4096) show up at the total level? 2. We see "allocated" and "requested" in tres at the top level. Shouldn't "consumed" be populated at the top level, and 'mem' gets populated there? 3. We can see "time" values at the top level(just above total tres), but they are 0s. However if we look into each step's "time", they have good >0 values there. Why wasn't "time" totaled at the top level?
Created attachment 21682 [details] j18261_job_output_of_sacct
(In reply to Sam Hu from comment #127) > 1. For the cmd output file j18261_job_output_of_sacct, the 4K value for > MaxRSS should show up in the j18261_job_output_of_slurmdbd_get_job's total > tres section(the section almost to the EOF); but we only see 1234. Shouldn't > 4K(4096) show up at the total level? It does for the step: > 203 'total': [ > 207 {'count': 4096,⏎ > 208 'id': 2,⏎ > 209 'name': None,⏎ > 210 'type': 'mem'},⏎ > 2. We see "allocated" and "requested" in tres at the top level. Shouldn't > "consumed" be populated at the top level, and 'mem' gets populated there? > 3. We can see "time" values at the top level(just above total tres), but > they are 0s. However if we look into each step's "time", they have good >0 > values there. slurmrestd does not currently sum the usage(s) of the steps to create a whole job usage the same as sacct does. If this is desired functionality, please submit an RFE, but the expectation was the sites could easily add this up as they wished as slurmrestd provides all of the data. > Why wasn't "time" totaled at the top level? In most cases slurmrestd is dumping the values provided by slurmctld or slurmdbd. In this case, the suspended, system, total and user times are in the RPC but not populated (aka zeroes). We may just remove them from the job tree since they are provided for each step directly.
The corrective patches are now upstream for the upcoming 21.08.3 release: > https://github.com/SchedMD/slurm/commit/7595e0f5409d8308471874f488197bf24403294e > https://github.com/SchedMD/slurm/commit/b87275e4807d1c8681fc4ae5792cd2a84a47b410
Closing as the patches are now upstream. Please respond if there are any more questions.