Ticket 7648

Summary: CoreSpecCount breaks slurmctld with FastSchedule=0 and enabled cons_res
Product: Slurm Reporter: Taras Shapovalov <taras.shapovalov>
Component: slurmctldAssignee: Jacob Jenson <jacob>
Status: RESOLVED INVALID QA Contact:
Severity: 6 - No support contract    
Priority: ---    
Version: 19.05.2   
Hardware: Linux   
OS: Linux   
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: Configuration file

Description Taras Shapovalov 2019-08-28 09:33:45 MDT
Created attachment 11388 [details]
Configuration file

when CoreSpecCount=1 for some node, FastSchedule=0, and SelectType=select/cons_res, SelectTypeParameters=CR_CPU, then slurmctld crashes?

[root@ts-tr1 ~]# slurmctld -D -v
slurmctld: slurmctld version 19.05.2 started on cluster slurm_cluster
slurmctld: Munge credential signature plugin loaded
slurmctld: Consumable Resources (CR) Node Selection plugin loaded with argument 1
slurmctld: select/cons_tres loaded with argument 1
slurmctld: Linear node selection plugin loaded with argument 1
slurmctld: preempt/none loaded
slurmctld: ExtSensors NONE plugin loaded
slurmctld: Accounting storage SLURMDBD plugin loaded
slurmctld: slurmdbd: recovered 0 pending RPCs
slurmctld: No memory enforcing mechanism configured.
slurmctld: layouts: no layout to initialize
slurmctld: topology NONE plugin loaded
slurmctld: sched: Backfill scheduler plugin loaded
slurmctld: route default plugin loaded
slurmctld: layouts: loading entities/relations information
slurmctld: Recovered state of 1 nodes
slurmctld: Recovered information about 0 jobs
slurmctld: cons_res: select_p_node_init
slurmctld: cons_res: preparing for 1 partitions
slurmctld: Recovered state of 0 reservations
slurmctld: State of 0 triggers recovered
slurmctld: _preserve_plugins: backup_controller not specified
slurmctld: cons_res: select_p_reconfigure
slurmctld: cons_res: select_p_node_init
slurmctld: cons_res: preparing for 1 partitions
slurmctld: Running as primary controller
slurmctld: Registering slurmctld at port 6817 with slurmdbd.
slurmctld: No parameter for mcs plugin, default values set
slurmctld: mcs: MCSParameters = (null). ondemand set.
slurmctld: bitstring.c:292: bit_nclear: Assertion `(start) < ((b)[1])' failed.
Aborted
[root@ts-tr1 ~]#
Comment 2 Jacob Jenson 2019-08-28 10:16:58 MDT
Taras,

We can't duplicate this issue on our systems with these setting so something else must be contributing to this issue. 

If Bright would like to start including Slurm support as part of its offering with the cluster manager then SchedMD would be able to allocate engineering time to help with these types of issues as well as help optimize the Slurm configurations for each client. Please let me know if there is an opportunity for our businesses to work together. 

Thanks,
Jacob
Comment 3 Taras Shapovalov 2019-08-29 03:32:54 MDT
Hi Jacob,

Thank you for the attempt to reproduce. Can you confirm you used 19.05.2 on centos7 together with slurm.conf (with modified paths perhaps) that I attached? 


Best regards,

Taras