Ticket 4680

Summary: Slurm won't restart in
Product: Slurm Reporter: Sanjaya Gajurel <sxg125>
Component: slurmctldAssignee: Brian Christiansen <brian>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 1 - System not usable    
Priority: ---    
Version: - Unsupported Older Versions   
Hardware: Linux   
OS: Linux   
Site: Case Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Sanjaya Gajurel 2018-01-24 16:48:10 MST
slurm version - 14.11.8 

Hi,

While creating a reservation, the slurmctld stopped:

service slurm status
slurmctld is stopped
slurmctld is stopped

It is immediately stopping after re-starting it. So, our cluster is not usable right now.

Here is the output from /var/log/slurm/slurmcontrol.log.

[2018-01-24T03:16:02.494] restoring original state of nodes
[2018-01-24T03:16:02.494] restoring original partition state
[2018-01-24T03:16:02.495] cons_res: select_p_node_init
[2018-01-24T03:16:02.496] cons_res: preparing for 4 partitions
[2018-01-24T03:16:02.500] init_requeue_policy: kill_invalid_depend is set to 0
[2018-01-24T03:16:02.595] cons_res: select_p_reconfigure
[2018-01-24T03:16:02.595] cons_res: select_p_node_init
[2018-01-24T03:16:02.595] cons_res: preparing for 4 partitions
[2018-01-24T03:16:04.260] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=4,partition_job_depth=0
[2018-01-24T03:16:04.260] error: Bad core_bitmap size for reservation (null) (2236 != 2044), ignoring core reservation
[2018-01-24T03:16:04.261] error: Bad core_bitmap size for reservation (null) (2236 != 2044), ignoring core reservation
[2018-01-24T03:16:04.261] error: Bad core_bitmap size for reservation (null) (2236 != 2044), ignoring core reservation
[2018-01-24T03:16:04.262] error: Bad core_bitmap size for reservation (null) (2236 != 2044), ignoring core reservation
[2018-01-24T03:16:04.263] error: Bad core_bitmap size for reservation (null) (2236 != 2044), ignoring core reservation
[2018-01-24T03:16:14.273] error: Bad core_bitmap size for reservation (null) (2236 != 2044), ignoring core reservation
[2018-01-24T03:16:14.274] error: Bad core_bitmap size for reservation (null) (2236 != 2044), ignoring core reservation
[2018-01-24T03:16:14.275] error: Bad core_bitmap size for reservation (null) (2236 != 2044), ignoring core reservation
[2018-01-24T03:16:14.275] error: Bad core_bitmap size for reservation (null) (2236 != 2044), ignoring core reservation
[2018-01-24T03:16:14.276] error: Bad core_bitmap size for reservation (null) (2236 != 2044), ignoring core reservation
[2018-01-24T03:16:19.888] error: Bad core_bitmap size for reservation (null) (2236 != 2044), ignoring core reservation
[2018-01-24T03:16:19.889] error: Bad core_bitmap size for reservation (null) (2236 != 2044), ignoring core reservation
[2018-01-24T03:16:19.889] error: Bad core_bitmap size for reservation (null) (2236 != 2044), ignoring core reservation
...
[2018-01-24T18:27:06.441] Recovered JobID=7019931 State=0x1 NodeCnt=0 Assoc=1212
[2018-01-24T18:27:06.441] Recovered JobID=7019932 State=0x1 NodeCnt=0 Assoc=1212
[2018-01-24T18:27:06.441] Recovered JobID=7019933 State=0x0 NodeCnt=0 Assoc=1212
[2018-01-24T18:27:06.441] Recovered information about 299 jobs
[2018-01-24T18:27:06.442] cons_res: select_p_node_init
[2018-01-24T18:27:06.442] cons_res: preparing for 4 partitions
[2018-01-24T18:27:06.449] init_requeue_policy: kill_invalid_depend is set to 0

We would appreciate your help.

Thank you,

-Sanjaya
Comment 1 Brian Christiansen 2018-01-24 17:07:22 MST
That's a really old of Slurm. Our support model requires customers to stay within the last two major releases (currently 17.02 or 17.11). You should upgrade ASAP.

Will you attach your slurm.conf and the full logs?
Will you start the slurmctld with a high debug level (debug3) and attach those?
Are there any core files? If so will you give the backtraces from them.
Comment 2 Brian Christiansen 2018-01-25 09:12:17 MST
Were you able to get the cluster back up?
Comment 3 Sanjaya Gajurel 2018-01-25 10:01:26 MST
Hi Brian,

We had to issue the command "service slurm startclean" to bring the cluster
back though it killed the running jobs. This is our old cluster (RedCat)
and utilization was low. The issue seems to be the result of the
reservation made earlier when the cluster had more cores (2236) than when
the reservation actually started (2044 cores) after transitioning 10 nodes
in our new cluster (Rider).

Thank you,

-Sanjaya

On Thu, Jan 25, 2018 at 11:12 AM, <bugs@schedmd.com> wrote:

> Brian Christiansen <brian@schedmd.com> changed bug 4680
> <https://bugs.schedmd.com/show_bug.cgi?id=4680>
> What Removed Added
> Assignee support@schedmd.com brian@schedmd.com
>
> *Comment # 2 <https://bugs.schedmd.com/show_bug.cgi?id=4680#c2> on bug
> 4680 <https://bugs.schedmd.com/show_bug.cgi?id=4680> from Brian
> Christiansen <brian@schedmd.com> *
>
> Were you able to get the cluster back up?
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 4 Brian Christiansen 2018-01-25 13:17:39 MST
Hey Sanjaya,

Glad to hear that you're back up. I'm not able to reproduce the crash in 17.11. Let us know if you have any other questions.

Thanks,
Brian