slurm version - 14.11.8 Hi, While creating a reservation, the slurmctld stopped: service slurm status slurmctld is stopped slurmctld is stopped It is immediately stopping after re-starting it. So, our cluster is not usable right now. Here is the output from /var/log/slurm/slurmcontrol.log. [2018-01-24T03:16:02.494] restoring original state of nodes [2018-01-24T03:16:02.494] restoring original partition state [2018-01-24T03:16:02.495] cons_res: select_p_node_init [2018-01-24T03:16:02.496] cons_res: preparing for 4 partitions [2018-01-24T03:16:02.500] init_requeue_policy: kill_invalid_depend is set to 0 [2018-01-24T03:16:02.595] cons_res: select_p_reconfigure [2018-01-24T03:16:02.595] cons_res: select_p_node_init [2018-01-24T03:16:02.595] cons_res: preparing for 4 partitions [2018-01-24T03:16:04.260] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=4,partition_job_depth=0 [2018-01-24T03:16:04.260] error: Bad core_bitmap size for reservation (null) (2236 != 2044), ignoring core reservation [2018-01-24T03:16:04.261] error: Bad core_bitmap size for reservation (null) (2236 != 2044), ignoring core reservation [2018-01-24T03:16:04.261] error: Bad core_bitmap size for reservation (null) (2236 != 2044), ignoring core reservation [2018-01-24T03:16:04.262] error: Bad core_bitmap size for reservation (null) (2236 != 2044), ignoring core reservation [2018-01-24T03:16:04.263] error: Bad core_bitmap size for reservation (null) (2236 != 2044), ignoring core reservation [2018-01-24T03:16:14.273] error: Bad core_bitmap size for reservation (null) (2236 != 2044), ignoring core reservation [2018-01-24T03:16:14.274] error: Bad core_bitmap size for reservation (null) (2236 != 2044), ignoring core reservation [2018-01-24T03:16:14.275] error: Bad core_bitmap size for reservation (null) (2236 != 2044), ignoring core reservation [2018-01-24T03:16:14.275] error: Bad core_bitmap size for reservation (null) (2236 != 2044), ignoring core reservation [2018-01-24T03:16:14.276] error: Bad core_bitmap size for reservation (null) (2236 != 2044), ignoring core reservation [2018-01-24T03:16:19.888] error: Bad core_bitmap size for reservation (null) (2236 != 2044), ignoring core reservation [2018-01-24T03:16:19.889] error: Bad core_bitmap size for reservation (null) (2236 != 2044), ignoring core reservation [2018-01-24T03:16:19.889] error: Bad core_bitmap size for reservation (null) (2236 != 2044), ignoring core reservation ... [2018-01-24T18:27:06.441] Recovered JobID=7019931 State=0x1 NodeCnt=0 Assoc=1212 [2018-01-24T18:27:06.441] Recovered JobID=7019932 State=0x1 NodeCnt=0 Assoc=1212 [2018-01-24T18:27:06.441] Recovered JobID=7019933 State=0x0 NodeCnt=0 Assoc=1212 [2018-01-24T18:27:06.441] Recovered information about 299 jobs [2018-01-24T18:27:06.442] cons_res: select_p_node_init [2018-01-24T18:27:06.442] cons_res: preparing for 4 partitions [2018-01-24T18:27:06.449] init_requeue_policy: kill_invalid_depend is set to 0 We would appreciate your help. Thank you, -Sanjaya
That's a really old of Slurm. Our support model requires customers to stay within the last two major releases (currently 17.02 or 17.11). You should upgrade ASAP. Will you attach your slurm.conf and the full logs? Will you start the slurmctld with a high debug level (debug3) and attach those? Are there any core files? If so will you give the backtraces from them.
Were you able to get the cluster back up?
Hi Brian, We had to issue the command "service slurm startclean" to bring the cluster back though it killed the running jobs. This is our old cluster (RedCat) and utilization was low. The issue seems to be the result of the reservation made earlier when the cluster had more cores (2236) than when the reservation actually started (2044 cores) after transitioning 10 nodes in our new cluster (Rider). Thank you, -Sanjaya On Thu, Jan 25, 2018 at 11:12 AM, <bugs@schedmd.com> wrote: > Brian Christiansen <brian@schedmd.com> changed bug 4680 > <https://bugs.schedmd.com/show_bug.cgi?id=4680> > What Removed Added > Assignee support@schedmd.com brian@schedmd.com > > *Comment # 2 <https://bugs.schedmd.com/show_bug.cgi?id=4680#c2> on bug > 4680 <https://bugs.schedmd.com/show_bug.cgi?id=4680> from Brian > Christiansen <brian@schedmd.com> * > > Were you able to get the cluster back up? > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Hey Sanjaya, Glad to hear that you're back up. I'm not able to reproduce the crash in 17.11. Let us know if you have any other questions. Thanks, Brian