| Summary: | CoreSpecCount breaks slurmctld with FastSchedule=0 and enabled cons_res | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Taras Shapovalov <taras.shapovalov> |
| Component: | slurmctld | Assignee: | Jacob Jenson <jacob> |
| Status: | RESOLVED INVALID | QA Contact: | |
| Severity: | 6 - No support contract | ||
| Priority: | --- | ||
| Version: | 19.05.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | -Other- | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | Configuration file | ||
Taras, We can't duplicate this issue on our systems with these setting so something else must be contributing to this issue. If Bright would like to start including Slurm support as part of its offering with the cluster manager then SchedMD would be able to allocate engineering time to help with these types of issues as well as help optimize the Slurm configurations for each client. Please let me know if there is an opportunity for our businesses to work together. Thanks, Jacob Hi Jacob, Thank you for the attempt to reproduce. Can you confirm you used 19.05.2 on centos7 together with slurm.conf (with modified paths perhaps) that I attached? Best regards, Taras |
Created attachment 11388 [details] Configuration file when CoreSpecCount=1 for some node, FastSchedule=0, and SelectType=select/cons_res, SelectTypeParameters=CR_CPU, then slurmctld crashes? [root@ts-tr1 ~]# slurmctld -D -v slurmctld: slurmctld version 19.05.2 started on cluster slurm_cluster slurmctld: Munge credential signature plugin loaded slurmctld: Consumable Resources (CR) Node Selection plugin loaded with argument 1 slurmctld: select/cons_tres loaded with argument 1 slurmctld: Linear node selection plugin loaded with argument 1 slurmctld: preempt/none loaded slurmctld: ExtSensors NONE plugin loaded slurmctld: Accounting storage SLURMDBD plugin loaded slurmctld: slurmdbd: recovered 0 pending RPCs slurmctld: No memory enforcing mechanism configured. slurmctld: layouts: no layout to initialize slurmctld: topology NONE plugin loaded slurmctld: sched: Backfill scheduler plugin loaded slurmctld: route default plugin loaded slurmctld: layouts: loading entities/relations information slurmctld: Recovered state of 1 nodes slurmctld: Recovered information about 0 jobs slurmctld: cons_res: select_p_node_init slurmctld: cons_res: preparing for 1 partitions slurmctld: Recovered state of 0 reservations slurmctld: State of 0 triggers recovered slurmctld: _preserve_plugins: backup_controller not specified slurmctld: cons_res: select_p_reconfigure slurmctld: cons_res: select_p_node_init slurmctld: cons_res: preparing for 1 partitions slurmctld: Running as primary controller slurmctld: Registering slurmctld at port 6817 with slurmdbd. slurmctld: No parameter for mcs plugin, default values set slurmctld: mcs: MCSParameters = (null). ondemand set. slurmctld: bitstring.c:292: bit_nclear: Assertion `(start) < ((b)[1])' failed. Aborted [root@ts-tr1 ~]#