| Summary: | Erroneous "Requested node configuration is not available" | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Nicholas McCollum <nmccollum> |
| Component: | Scheduling | Assignee: | Dominik Bartkiewicz <bart> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | bart |
| Version: | 15.08.10 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | ASC | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
Our slurm.conf
Our job_submit.lua |
||
|
Description
Nicholas McCollum
2017-06-09 10:33:02 MDT
Created attachment 4738 [details]
Our job_submit.lua
Attached job_submit.lua
Below is the gres.conf for the nodes in that reservation: dmc[1,4] [root@dmc1 ~]# cat /etc/slurm/gres.conf name=gpu type=kepler file=/dev/nvidia0 name=gpu type=kepler file=/dev/nvidia1 name=gpu type=kepler file=/dev/nvidia2 name=gpu type=kepler file=/dev/nvidia3 I'm going to ask Dominik to take a look at this on Monday - he's much more versed on these allocation issue than I am. In the meantime, are you able to attach the slurmctld log from around when that job is submitted? There should be some extra hints in there as to what's going on that may make reproducing this a bit easier on our side. - Tim Sure, this issue is easy to recreate. Are there any debug flags you would like for me to include? Hi Normal log should be sufficient. Dominik Hi Do you have any updates? Dominik [2017-06-21T13:23:28.336] debug2: Processing RPC: REQUEST_SUBMIT_BATCH_JOB from uid=2573 [2017-06-21T13:23:28.341] debug3: JobDesc: user_id=2573 job_id=N/A partition=gpu_kepler name=ls_test [2017-06-21T13:23:28.341] debug3: cpus=10-4294967294 pn_min_cpus=-1 core_spec=-1 [2017-06-21T13:23:28.341] debug3: Nodes=1-[1] Sock/Node=65534 Core/Sock=65534 Thread/Core=65534 [2017-06-21T13:23:28.341] debug3: pn_min_memory_cpu=1000 pn_min_tmp_disk=-1 [2017-06-21T13:23:28.341] debug3: immediate=0 features=(null) reservation=class_gpu [2017-06-21T13:23:28.341] debug3: req_nodes=(null) exc_nodes=(null) gres=gpu:2 [2017-06-21T13:23:28.341] debug3: time_limit=60-60 priority=-1 contiguous=0 shared=-1 [2017-06-21T13:23:28.342] debug3: kill_on_node_fail=-1 script=#!/bin/sh # # script to do some checks o... [2017-06-21T13:23:28.342] debug3: argv="/mnt/homeapps/home/asnnam/ls_test" [2017-06-21T13:23:28.342] debug3: environment=REMOTEHOST=nyx.asc.edu,MANPATH=/apps/dmc/apps/lmod_rhel/lmod/lmod/share/man:/opt/asn/apps/lua_5.3.4/man::/usr/man,XDG_SESSION_ID=c222213,... [2017-06-21T13:23:28.342] debug3: stdin=/dev/null stdout=(null) stderr=(null) [2017-06-21T13:23:28.342] debug3: work_dir=/mnt/homeapps/home/asnnam alloc_node:sid=uv:74561 [2017-06-21T13:23:28.342] debug3: sicp_mode=0 power_flags= [2017-06-21T13:23:28.342] debug3: resp_host=(null) alloc_resp_port=0 other_port=0 [2017-06-21T13:23:28.342] debug3: dependency=(null) account=class qos=class_gpu comment=(null) [2017-06-21T13:23:28.342] debug3: mail_type=0 mail_user=(null) nice=0 num_tasks=10 open_mode=0 overcommit=-1 acctg_freq=(null) [2017-06-21T13:23:28.342] debug3: network=(null) begin=Unknown cpus_per_task=-1 requeue=-1 licenses=(null) [2017-06-21T13:23:28.342] debug3: end_time= signal=0@0 wait_all_nodes=-1 cpu_freq= [2017-06-21T13:23:28.342] debug3: ntasks_per_node=-1 ntasks_per_socket=-1 ntasks_per_core=-1 [2017-06-21T13:23:28.342] debug3: mem_bind=65534:(null) plane_size:65534 [2017-06-21T13:23:28.342] debug3: array_inx=(null) [2017-06-21T13:23:28.342] debug3: burst_buffer=(null) [2017-06-21T13:23:28.342] debug3: _find_assoc_rec: not the right user 2573 != 1469 [2017-06-21T13:23:28.342] debug3: found correct association [2017-06-21T13:23:28.342] debug3: found correct qos [2017-06-21T13:23:28.342] debug3: before alteration asking for nodes 1-1 cpus 10-4294967294 [2017-06-21T13:23:28.342] debug3: after alteration asking for nodes 1-1 cpus 10-4294967294 [2017-06-21T13:23:28.369] debug2: initial priority for job 99624 is 16058 [2017-06-21T13:23:28.369] job_test_resv: job:99624 reservation:class_gpu nodes:dmc[1,4] [2017-06-21T13:23:28.369] debug2: found 1 usable nodes from config containing dmc1 [2017-06-21T13:23:28.369] debug2: found 1 usable nodes from config containing dmc4 [2017-06-21T13:23:28.369] job_test_resv: job:99624 reservation:class_gpu nodes:dmc[1,4] [2017-06-21T13:23:28.369] debug3: _pick_best_nodes: job 99624 idle_nodes 6 share_nodes 56 [2017-06-21T13:23:28.369] debug2: select_p_job_test for job 99624 [2017-06-21T13:23:28.369] debug2: select_p_job_test for job 99624 [2017-06-21T13:23:28.369] debug2: select_p_job_test for job 99624 [2017-06-21T13:23:28.369] _pick_best_nodes: job 99624 never runnable [2017-06-21T13:23:28.369] debug3: powercapping: checking job 99624 : skipped, not eligible [2017-06-21T13:23:28.369] error: slurm_jobcomp plugin context not initialized [2017-06-21T13:23:28.369] _slurm_rpc_submit_batch_job: Requested node configuration is not available Hi Problem is deeper than I thought. If it is possible could you send me slurmctld log with debug flag: "Select Type" and gres.conf? Dominik Hi Any news? Dominik Sorry for the delay, we have decided to bump up to SLURM17 here in a week or so in order to get to the most current version. Feel free to close this ticket. If it persists in SLURM17 I will re-open a ticket. |