| Summary: | slurmctld crashes in _rm_job_from_res after upgrade from 14.11 to 15.08 | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Gianluca Castellani <gianluca.castellani> |
| Component: | slurmctld | Assignee: | Alejandro Sanchez <alex> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | CC: | alex |
| Version: | 15.08.9 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | KAUST | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Gianluca Castellani
2016-04-12 23:45:43 MDT
Hi Gianluca,
We think this was fixed in 15.08.10:
Fixed backfill scheduler race condition that could cause invalid pointer in
select/cons_res plugin. Bug introduced in 15.08.9, commit:
efd9d35
The scenario is as follows
1. Backfill scheduler is running, then releases locks
2. Main scheduling loop starts a job "A"
3. Backfill scheduler resumes, finds job "A" in its queue and
resets it's partition pointer.
4. Job "A" completes and tries to remove resource allocation record
from select/cons_res data structure, but fails to find it because
it is looking in the table for the wrong partition.
5. Job "A" record gets purged from slurmctld
6. Select/cons_res plugin attempts to operate on resource allocation
data structure, finds pointer into the now purged data structure
of job "A" and aborts or gets SEGV
Commit here:
https://github.com/SchedMD/slurm/commit/d8b18ff8dc6f2237db3729f4ab085b003e54c89e
Could you please try to upgrade to 15.08.10? Thanks.
(In reply to Gianluca Castellani from comment #0) > Hello, > we did a Slurm upgrade and now slurmctld crashes by time to time. Note, this problem is a duplicate of bug 2603. > I am also getting a flood of: > error: slurm_receive_msg: Zero Bytes were transmitted or received > > do you think that these two are related? These are probably not related. Are the "Zero Bytes" error messages spread through time or do they all arrive at the same time? Do they happen immediately after restarting the slurmctld daemon? Hey Moe, the RPC error arrived continuously (even after a slurmctld restart). Yesterday we were able to relate these messages to several nodes where munge was down (this was a bit surprising and I had overlooked it). After we had munge running on all the nodes and slurm restarted we did not get any error: slurm_receive_msg: Zero Bytes were transmitted or received; Right now we upgraded to 15.08.10 in the master and slave node (thanks Alejandro). So far so good. Gianluca you might consider configuring a NodeHealthProgram. Among other things, there are implementations taking care of services such as munge being available on nodes, and taking consequent actions, such as setting the node state to DOWN if service not running. Please, let us know if we can close this bug or if you have any more questions. Gianluca, closing this bug. Please reopen if needed. Thanks. I have been a bit busy this week. Anyway no slurmctld crashes. Best On Tuesday, April 19, 2016, <bugs@schedmd.com> wrote: > Alejandro Sanchez <javascript:_e(%7B%7D,'cvml','alex@schedmd.com');> changed > bug 2633 <https://bugs.schedmd.com/show_bug.cgi?id=2633> > What Removed Added > Resolution --- INFOGIVEN > Status UNCONFIRMED RESOLVED > > *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=2633#c6> on bug > 2633 <https://bugs.schedmd.com/show_bug.cgi?id=2633> from Alejandro Sanchez > <javascript:_e(%7B%7D,'cvml','alex@schedmd.com');> * > > Gianluca, closing this bug. Please reopen if needed. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > |