Ticket 7648 - CoreSpecCount breaks slurmctld with FastSchedule=0 and enabled cons_res
Summary: CoreSpecCount breaks slurmctld with FastSchedule=0 and enabled cons_res
Status: RESOLVED INVALID
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 19.05.2
Hardware: Linux Linux
: 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-08-28 09:33 MDT by Taras Shapovalov
Modified: 2019-08-29 03:32 MDT (History)
0 users

See Also:
Site: -Other-
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Configuration file (2.48 KB, text/plain)
2019-08-28 09:33 MDT, Taras Shapovalov
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Taras Shapovalov 2019-08-28 09:33:45 MDT
Created attachment 11388 [details]
Configuration file

when CoreSpecCount=1 for some node, FastSchedule=0, and SelectType=select/cons_res, SelectTypeParameters=CR_CPU, then slurmctld crashes?

[root@ts-tr1 ~]# slurmctld -D -v
slurmctld: slurmctld version 19.05.2 started on cluster slurm_cluster
slurmctld: Munge credential signature plugin loaded
slurmctld: Consumable Resources (CR) Node Selection plugin loaded with argument 1
slurmctld: select/cons_tres loaded with argument 1
slurmctld: Linear node selection plugin loaded with argument 1
slurmctld: preempt/none loaded
slurmctld: ExtSensors NONE plugin loaded
slurmctld: Accounting storage SLURMDBD plugin loaded
slurmctld: slurmdbd: recovered 0 pending RPCs
slurmctld: No memory enforcing mechanism configured.
slurmctld: layouts: no layout to initialize
slurmctld: topology NONE plugin loaded
slurmctld: sched: Backfill scheduler plugin loaded
slurmctld: route default plugin loaded
slurmctld: layouts: loading entities/relations information
slurmctld: Recovered state of 1 nodes
slurmctld: Recovered information about 0 jobs
slurmctld: cons_res: select_p_node_init
slurmctld: cons_res: preparing for 1 partitions
slurmctld: Recovered state of 0 reservations
slurmctld: State of 0 triggers recovered
slurmctld: _preserve_plugins: backup_controller not specified
slurmctld: cons_res: select_p_reconfigure
slurmctld: cons_res: select_p_node_init
slurmctld: cons_res: preparing for 1 partitions
slurmctld: Running as primary controller
slurmctld: Registering slurmctld at port 6817 with slurmdbd.
slurmctld: No parameter for mcs plugin, default values set
slurmctld: mcs: MCSParameters = (null). ondemand set.
slurmctld: bitstring.c:292: bit_nclear: Assertion `(start) < ((b)[1])' failed.
Aborted
[root@ts-tr1 ~]#
Comment 2 Jacob Jenson 2019-08-28 10:16:58 MDT
Taras,

We can't duplicate this issue on our systems with these setting so something else must be contributing to this issue. 

If Bright would like to start including Slurm support as part of its offering with the cluster manager then SchedMD would be able to allocate engineering time to help with these types of issues as well as help optimize the Slurm configurations for each client. Please let me know if there is an opportunity for our businesses to work together. 

Thanks,
Jacob
Comment 3 Taras Shapovalov 2019-08-29 03:32:54 MDT
Hi Jacob,

Thank you for the attempt to reproduce. Can you confirm you used 19.05.2 on centos7 together with slurm.conf (with modified paths perhaps) that I attached? 


Best regards,

Taras