Ticket 24049

Summary:	sacctmgr hangs in a logrotate cron job
Product:	Slurm	Reporter:	Ole.H.Nielsen <Ole.H.Nielsen>
Component:	User Commands	Assignee:	Ricard Zarco Badia <ricard>
Status:	OPEN ---	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	ricard
Version:	25.05.4
Hardware:	Linux
OS:	Linux
Site:	DTU Physics	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Ole.H.Nielsen@fysik.dtu.dk 2025-11-05 01:37:00 MST

Besides daily database dumps, we have started dumping also the associations etc. using "sacctmgr dump clustername" to have complete backups.  A cron job can do this with an entry:

15 7 * * * /usr/bin/sacctmgr --quiet dump niflheim file=/var/log/slurm/niflheim.cfg 2>/dev/null

(Note: We need to pipe stderr due to the issue in bug 24010).  Thanks to the 25.05 fix in bug 22951 sacctmgr now works in a cron job.

We would like to run the sacctmgr command from a logrotate script in order to record the associations history, so I created this logrotate script:

$ cat /etc/logrotate.d/slurm_assoc_backup
/var/log/slurm/niflheim.cfg {
    daily
    dateext
    dateyesterday
    rotate 8
    nocompress
    missingok
    create 640 slurm slurm
    postrotate
      # Dump Slurm association data for cluster "niflheim"
      /usr/bin/yes | /usr/bin/sacctmgr dump niflheim file=/var/log/slurm/niflheim.cfg 2>/dev/null
    endscript
}

I have tried many variants of the "postrotate" script, adding "yes" to provide an stdin.  In all my attempts the crond daemon is hanging during execution of logrotate:

$ systemctl status crond
● crond.service - Command Scheduler
   Loaded: loaded (/usr/lib/systemd/system/crond.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2025-11-04 11:53:24 CET; 20h ago
 Main PID: 45527 (crond)
    Tasks: 10 (limit: 1643732)
   Memory: 499.4M
   CGroup: /system.slice/crond.service
           ├─45527 /usr/sbin/crond -n
           ├─50265 /usr/sbin/anacron -s
           ├─50318 /bin/bash /bin/run-parts /etc/cron.daily
           ├─50706 /bin/sh /etc/cron.daily/logrotate
           ├─50707 sed 1i\ /etc/cron.daily/logrotate:\
           ├─50708 /usr/sbin/logrotate /etc/logrotate.conf
           ├─50719 sh -c       # Dump Slurm association data for cluster "niflheim"       /usr/bin/yes | /usr/bin/sacctmgr dump niflheim file=/var/log/slurm/niflheim.cfg 2>/dev/null logrotate_script /var/log/slurm/niflheim.cfg
           ├─50720 /usr/bin/yes
           └─50721 /usr/bin/sacctmgr dump niflheim file=/var/log/slurm/niflheim.cfg

For some reason the sacctmgr command seems to be hanging and must be killed manually.  Note that a regular cron job works as expected.  It would seem that a logrotate cron job behaves differently and isn't solved by the fix in bug 22951.

Can you suggest a workaround and perhaps also a fix?

Thanks,
Ole

Comment 1 Ricard Zarco Badia 2025-11-05 08:23:06 MST

Hello Ole,

I have been trying to reproduce this via raw logrotate execution (without cron) and I have not been able to get the same behavior. I have to check if there is any difference when excecuting it from a cron context yet. 

Just to clarify, does this happen every time? Also, I assume that you have tried that same sacctmgr without the /usr/bin/yes pipe and the behavior stays the same, right?

Best regards, Ricard.

Comment 2 Ole.H.Nielsen@fysik.dtu.dk 2025-11-05 12:05:56 MST

Hi Ricard,

(In reply to Ricard Zarco Badia from comment #1)
> I have been trying to reproduce this via raw logrotate execution (without
> cron) and I have not been able to get the same behavior. I have to check if
> there is any difference when excecuting it from a cron context yet. 

The logrotate scripts are always running from the cron service on EL Linux.

Running the logrotate command from the CLI seems to work correctly:

# logrotate  -v /etc/logrotate.d/slurm_assoc_backup
reading config file /etc/logrotate.d/slurm_assoc_backup
Reading state from file: /var/lib/logrotate/logrotate.status
Allocating hash table for state file, size 64 entries
Creating new state
Creating new state
Creating new state
Creating new state
Creating new state
Creating new state
Creating new state
Creating new state
Creating new state
Creating new state
Creating new state
Creating new state
Creating new state
Creating new state
Creating new state
Creating new state
Creating new state
Creating new state
Creating new state
Creating new state

Handling 1 logs

rotating pattern: /var/log/slurm/niflheim.cfg  after 1 days (8 rotations)
empty log files are rotated, old logs are removed
considering log /var/log/slurm/niflheim.cfg
  Now: 2025-11-05 19:59
  Last rotated at 2025-11-05 03:15
  log does not need rotating (log has been rotated at 2025-11-5 3:15, that is not day ago yet)
set default create context to system_u:object_r:logrotate_var_lib_t:s0


> Just to clarify, does this happen every time? Also, I assume that you have
> tried that same sacctmgr without the /usr/bin/yes pipe and the behavior
> stays the same, right?

I think I tried all combinations.  I'll try again without the "yes" command during the coming night's logrotate cron job.

Best regards,
Ole

Comment 3 Ole.H.Nielsen@fysik.dtu.dk 2025-11-06 00:13:15 MST

Hi Ricard,

I changed the /etc/logrotate.d/slurm_assoc_backup file to remove the "yes" command:

/var/log/slurm/niflheim.cfg {
    daily
    dateext
    dateyesterday
    rotate 8
    nocompress
    missingok
    create 640 slurm slurm
    postrotate
      # Dump Slurm association data for cluster "niflheim"
      /usr/bin/sacctmgr dump niflheim file=/var/log/slurm/niflheim.cfg 2>/dev/null
    endscript
}

When the crond logrotate runs during the night, the sacctmgr command is still hanging:

# systemctl status crond
● crond.service - Command Scheduler
   Loaded: loaded (/usr/lib/systemd/system/crond.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2025-11-04 11:53:24 CET; 1 day 20h ago
 Main PID: 45527 (crond)
    Tasks: 9 (limit: 1643732)
   Memory: 996.5M
   CGroup: /system.slice/crond.service
           ├─45527 /usr/sbin/crond -n
           ├─56957 /usr/sbin/anacron -s
           ├─57032 /bin/bash /bin/run-parts /etc/cron.daily
           ├─57420 /bin/sh /etc/cron.daily/logrotate
           ├─57421 sed 1i\ /etc/cron.daily/logrotate:\
           ├─57422 /usr/sbin/logrotate /etc/logrotate.conf
           ├─57434 sh -c       # Dump Slurm association data for cluster "niflheim"       /usr/bin/sacctmgr dump niflheim file=/var/log/slurm/niflheim.cfg 2>/dev/null logrotate_script /var/log/slurm/niflheim.cfg
           └─57435 /usr/bin/sacctmgr dump niflheim file=/var/log/slurm/niflheim.cfg

When I kill the PID 57435 the job ends, and crond sends an error message mail:

/etc/cron.daily/logrotate:

logrotate_script: line 2: 57435 Terminated              /usr/bin/sacctmgr dump niflheim file=/var/log/slurm/niflheim.cfg 2> /dev/null
error: error running non-shared postrotate script for /var/log/slurm/niflheim.cfg of '/var/log/slurm/niflheim.cfg '

I don't know why sacctmgr refuses to run in the crond/logrotate environment.

Best regards,
Ole

Comment 4 Ricard Zarco Badia 2025-11-06 04:37:18 MST

Hello,

I have been trying to reproduce this within a cron job and I am still not able to get the same behavior. However, my testing rig runs Ubuntu and I do not have the exact same cron service. I put an extra sleep to see the process tree and compare it:

● cron.service - Regular background program processing daemon
     Loaded: loaded (/usr/lib/systemd/system/cron.service; enabled; preset: enabled)
     Active: active (running) since Mon 2025-10-13 09:07:09 CEST; 3 weeks 3 days ago
       Docs: man:cron(8)
   Main PID: 1698 (cron)
      Tasks: 6 (limit: 18728)
     Memory: 2.8M ()
     CGroup: /system.slice/cron.service
             ├─   1698 /usr/sbin/cron -f -P
             ├─1155555 /usr/sbin/CRON -f -P
             ├─1155556 /bin/sh -c "logrotate -f /etc/logrotate.d/t24049"
             ├─1155557 logrotate -f /etc/logrotate.d/t24049
             ├─1155558 sh -c "\n      sleep 60\n      # Dump Slurm association data for cluster \"niflheim\"\n      /usr/bi>
             └─1155559 sleep 60
 
Seeing that I am not able to reproduce this at the moment, could you try the following?

1 - Set the time for your logrotate cron to another time during your workday. Try any hour + X minutes, to avoid running it during any other periodic rollovers or similar operations.

2 - Check if you still get the stuck sacctmgr once that cron runs. If you do not, the issue might be related to some unfortunate timing of your daily cron. If it gets stuck, then I will ask you to get the backtrace of that process:

>> gdb -p <PID_of_sacctmgr>
>> //Once inside GDB:
>> t a a bt full

That should at least tell us the general nature of this hang.

Best regards, Ricard.

Comment 5 Ole.H.Nielsen@fysik.dtu.dk 2025-11-07 00:52:30 MST

Hi Ricard,

(In reply to Ricard Zarco Badia from comment #4)
> I have been trying to reproduce this within a cron job and I am still not
> able to get the same behavior. However, my testing rig runs Ubuntu and I do
> not have the exact same cron service. I put an extra sleep to see the
> process tree and compare it:

Probably logrotate differs between Rocky 8 and Ubuntu.  In EL8 the cron hourly script /etc/cron.hourly/0anacron (as installed by the cronie-anacron package) starts the anacron task.
The file /etc/anacrontab starts all tasks in the /etc/cron.daily/ folder, which runs /etc/cron.daily/logrotate and executes the command /usr/sbin/logrotate /etc/logrotate.conf

The anacron process tree runs the sacctmgr command which is hanging:

  |-anacron -s
  |   `-run-parts /bin/run-parts /etc/cron.daily
  |       |-logrotate /etc/cron.daily/logrotate
  |       |   `-logrotate /etc/logrotate.conf
  |       |       `-sh -c...
  |       |           `-sacctmgr dump niflheim file=/var/log/slurm/niflheim.cfg
  |       |               `-(sacctmgr)
  |       `-sed 1i\\\012/etc/cron.daily/logrotate:\\\012


> Seeing that I am not able to reproduce this at the moment, could you try the
> following?
> 
> 1 - Set the time for your logrotate cron to another time during your
> workday. Try any hour + X minutes, to avoid running it during any other
> periodic rollovers or similar operations.

If possible, I would like to avoid reconfiguring anacron because it runs many other tasks.  I hope we can wait for the automatic overnight logrotate job.

> 2 - Check if you still get the stuck sacctmgr once that cron runs. If you do
> not, the issue might be related to some unfortunate timing of your daily
> cron. If it gets stuck, then I will ask you to get the backtrace of that
> process:
> 
> >> gdb -p <PID_of_sacctmgr>
> >> //Once inside GDB:
> >> t a a bt full
> 
> That should at least tell us the general nature of this hang.

That's a good idea!

$ gdb -p 69568
GNU gdb (GDB) Red Hat Enterprise Linux 8.2-20.el8
(lines deleted)

(gdb) t a a bt full

Thread 1 (Thread 0x7f9864fa23c0 (LWP 69568)):
#0  0x00007f98642b3722 in read () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00007f98649c1734 in read (__nbytes=4, __buf=0x7fff6da5f63c, __fd=<optimized out>) at /usr/include/bits/unistd.h:36
No locals.
#2  _fetch_parent (pid=69569) at fetch_config.c:84
        remaining = 4
        ptr = 0x7fff6da5f63c ""
        rc = <optimized out>
        len = 0
        buffer = <optimized out>
        config = 0x0
        status = 0
        len = <optimized out>
        buffer = <optimized out>
        config = <optimized out>
        status = <optimized out>
        __func__ = "_fetch_parent"
        remaining = <optimized out>
        ptr = <optimized out>
        rc = <optimized out>
        remaining = <optimized out>
        ptr = <optimized out>
        rc = <optimized out>
#3  fetch_config (conf_server=conf_server@entry=0x0, flags=flags@entry=0, sackd_port=sackd_port@entry=0, ca_cert_file=ca_cert_file@entry=0x0) at fetch_config.c:291
        env_conf_server = <optimized out>
        controllers = 0x0
        pid = 69569
        sack_jwks = 0x0
        sack_key = 0x0
        statbuf = {st_dev = 0, st_ino = 0, st_nlink = 0, st_mode = 0, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 0, st_blksize = 0, st_blocks = 0, st_atim = {tv_sec = 0, tv_nsec = 0}, st_mtim = {tv_sec = 0, tv_nsec = 0}, st_ctim = {
            tv_sec = 0, tv_nsec = 0}, __glibc_reserved = {0, 0, 0}}
        __func__ = "fetch_config"
#4  0x00007f9864a0fd3d in _establish_config_source (memfd=<synthetic pointer>, config_file=0x7fff6da5f748) at read_config.c:3077
        stat_buf = {st_dev = 0, st_ino = 0, st_nlink = 0, st_mode = 0, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 0, st_blksize = 0, st_blocks = 0, st_atim = {tv_sec = 0, tv_nsec = 0}, st_mtim = {tv_sec = 0, tv_nsec = 0}, st_ctim = {
            tv_sec = 0, tv_nsec = 0}, __glibc_reserved = {0, 0, 0}}
        config_tmp = <optimized out>
        config = 0x0
        stat_buf = <optimized out>
        config_tmp = <optimized out>
        config = <optimized out>
        __func__ = "_establish_config_source"
#5  slurm_conf_init (file_name=file_name@entry=0x0) at read_config.c:3208
        config_file = 0x0
        memfd = false
        __func__ = "slurm_conf_init"
#6  0x00007f986498187c in slurm_init (conf=conf@entry=0x0) at init.c:49
No locals.
#7  0x0000000000427afe in main (argc=4, argv=0x7fff6da5f9b8) at sacctmgr.c:129
        error_code = 0
        opt_char = <optimized out>
        opts = {stderr_level = LOG_LEVEL_INFO, syslog_level = <optimized out>, logfile_level = <optimized out>, prefix_level = <optimized out>, buffered = <optimized out>, raw = <optimized out>, logfile_fmt = <optimized out>}
        local_exit_code = 0
        option_index = 32664
        persist_conn_flags = 0
        long_options = {{name = 0x437aeb "autocomplete", has_arg = 1, flag = 0x0, val = 256}, {name = 0x4378ce "help", has_arg = 0, flag = 0x0, val = 104}, {name = 0x43264c "usage", has_arg = 0, flag = 0x0, val = 104}, {name = 0x437af8 "immediate", 
            has_arg = 0, flag = 0x0, val = 105}, {name = 0x437b02 "noheader", has_arg = 0, flag = 0x0, val = 110}, {name = 0x4378d8 "oneliner", has_arg = 0, flag = 0x0, val = 111}, {name = 0x437b0b "parsable", has_arg = 0, flag = 0x0, val = 112}, {
            name = 0x437b14 "parsable2", has_arg = 0, flag = 0x0, val = 80}, {name = 0x4378e1 "quiet", has_arg = 0, flag = 0x0, val = 81}, {name = 0x437a3a "readonly", has_arg = 0, flag = 0x0, val = 114}, {name = 0x4378bc "associations", has_arg = 0, 
            flag = 0x0, val = 115}, {name = 0x4379ff "verbose", has_arg = 0, flag = 0x0, val = 118}, {name = 0x437a6a "version", has_arg = 0, flag = 0x0, val = 86}, {name = 0x437a9c "json", has_arg = 2, flag = 0x0, val = 257}, {name = 0x437aaf "yaml", 
            has_arg = 2, flag = 0x0, val = 258}, {name = 0x0, has_arg = 0, flag = 0x0, val = 0}}
        __func__ = "main"
(gdb) 

If I read this correctly, there is a problem in the stack of the slurm_init() function, at fetch_config.c:84

        safe_read(to_parent[0], &len, sizeof(int));

Maybe this is related to our use of Configless Slurm?  The /etc/slurm folder doesn't contain any slurm.conf since the server is only running slurmdbd:

$ ls -la /etc/slurm/
total 24
drwxr-xr-x.   2 root  root     27 Oct  6 19:37 .
drwxr-xr-x. 112 root  root  16384 Nov  7 08:30 ..
-rw-------.   1 slurm slurm   504 Feb 28  2023 slurmdbd.conf

If this is the culprit, then I don't understand how "sacctmgr dump" can work from the command line as well as in a simple crontab job, but apparently fails in an anacron job.  The Configless Slurm should provide the slurm.conf file from the slurmctld server.

Do you have any ideas?

Thanks,
Ole

Comment 6 Ole.H.Nielsen@fysik.dtu.dk 2025-11-07 01:04:54 MST

Note added: This is our Configless server DNS record:

$ dig +short +search +ndots=2 -t SRV -n _slurmctld._tcp
0 0 6817 que2.fysik.dtu.dk.

Comment 7 Ole.H.Nielsen@fysik.dtu.dk 2025-11-07 01:36:54 MST

Regarding anacron, I found a useful page about it: https://linuxconfig.org/how-to-run-commands-periodically-with-anacron-on-linux

Comment 8 Ole.H.Nielsen@fysik.dtu.dk 2025-11-07 01:54:26 MST

I've tracked down the issue to SELinux!  The logrotate environment is restricted as compared to CLI and cron jobs.

SELinux enforces that logrotate only create files in the /var/log/ folder and below. If logrotate tries to create files in other locations it will get permission denied errors, and errors will be present in /var/log/audit/audit.log. See the logrotate_selinux manual page https://linux.die.net/man/8/logrotate_selinux and and this Red Hat solution: https://access.redhat.com/solutions/39006
I had actually written about this in my Wiki page at https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_database/#backup-script-with-logrotate

We're getting these SELinux audit events from sacctmgr:

$ ausearch -i | grep sacctmgr
(lines deleted)
type=PROCTITLE msg=audit(11/07/2025 03:45:57.784:19328) : proctitle=/usr/bin/sacctmgr dump niflheim file=/var/log/slurm/niflheim.cfg 
type=SYSCALL msg=audit(11/07/2025 03:45:57.784:19328) : arch=x86_64 syscall=write success=no exit=EACCES(Permission denied) a0=0x5 a1=0x24435e0 a2=0x4a a3=0x0 items=0 ppid=69568 pid=69569 auid=unset uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none) ses=unset comm=sacctmgr exe=/usr/bin/sacctmgr subj=system_u:system_r:logrotate_t:s0-s0:c0.c1023 key=(null) 
type=AVC msg=audit(11/07/2025 03:45:57.784:19328) : avc:  denied  { write } for  pid=69569 comm=sacctmgr path=/memfd:slurm.conf (deleted) dev="tmpfs" ino=4291579 scontext=system_u:system_r:logrotate_t:s0-s0:c0.c1023 tcontext=system_u:object_r:tmpfs_t:s0 tclass=file permissive=0 

It might seem that sacctmgr attempts to write a file that isn't below the /var/log/ folder: "syscall=write success=no exit=EACCES(Permission denied)".

It seems that sacctmgr attempts to write its local copy of the slurm.conf file: path=/memfd:slurm.conf (deleted)
I found a memfd manual page at https://man7.org/linux/man-pages/man2/memfd_create.2.html

Now I don't know the most clean and maintainable approach for solving this SELinux issue?

Best regards,
Ole

Comment 9 Ole.H.Nielsen@fysik.dtu.dk 2025-11-07 02:40:28 MST

I propose that we need sacctmgr to check for errors (due to SELinux) when creating the path=/memfd:slurm.conf file in memory.
The sacctmgr should print an error message and exit.  This seems relevant for any tool running under logrotate!

My resolution of this issue is to remove sacctmgr from the logrotate script altogether! I will create "sacctmgr dump" files using a regular cron job, and then rotate the dump files using logrotate.

Examples are shown in my Wiki page at https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_database/#backup-and-restore-of-slurm-associations

Best regards,
Ole

Comment 10 Ole.H.Nielsen@fysik.dtu.dk 2025-11-07 03:18:14 MST

Note added: The SELinux restrictions occur only when you have configured it as enforcing:

$ getenforce 
Enforcing

Comment 11 Ricard Zarco Badia 2025-11-07 04:50:31 MST

Hello Ole,

I came back to this ticket thinking that I had a full investigation ahead of me and I see that you already found the culprit! Glad to know that you noticed it was SELinux, I did not know that you were using it.

I agree that we should be catching this, but the backtrace reveals something weird. Note that the process has been stuck in a 4-byte read syscall against a pipe for ages. To provide more context, in that specific part of the code we fork and create a child process, the latter being the one actually reponsible for fetching the config.

Something must have happened in the child process, causing it to never write to the pipe connecting father and child. The father then gets stuck trying to read from that pipe. Right now this is the only thing I can say, since we do not know the state of that child process.

I will take a look at the code and try to see where that child process could potentially be hanging. Knowing that the issue is SELinux here, I will probably be able to reproduce this if needed. I will let you know what I find out or if I need something else from your side.

Best regards, Ricard.

Comment 12 Ricard Zarco Badia 2025-11-07 08:08:30 MST

Hello Ole,

I think I have finished localizing the issue. SELinux is not technically the problem, it is just the thing that triggers the condition for sacctmgr to hang. I managed to force the same behavior you were getting by (very liberally) modifying the code and getting to the same codepath you were, no SELinux involved. The fix is simple, I will be proposing it for review shortly.

If you are curious about what was happening... it was just a classic oversight of pipe management. In my last comment, I was talking about this communication between parent and child process via pipe. I thought that the child process would be stuck somewhere, but it probably just died before writing anything (which is not strictly the problem, we treat that case as an error).

That read should just return 0 if no one was on the other side. Well, it turns out that both ends of the pipe were open in both processes. So the parent tries to read from that pipe, the child dies before writing anything, but the parent is still stuck on that read because there is technically another possible writer... himself (gasp in horror).

This can be confirmed once the hang happens by checking the open file descriptors of the parent process:

>> $ lsof | grep sacctmgr
>> ...
>> sacctmgr  1443029       rzarco    3r     FIFO      0,14       0t0  6517184 pipe
>> sacctmgr  1443029       rzarco    4w     FIFO      0,14       0t0  6517184 pipe

And that was it. With the fix, it should not hang anymore. I will let you know once it passes the QA process and gets merged.

Best regards, Ricard.

Comment 14 Ole.H.Nielsen@fysik.dtu.dk 2025-11-07 11:46:59 MST

Hi Ricard,

Thanks for your wonderful debugging of the issue!  This has indeed been a complex problem to go from the error symptoms to your final analysis and bug fix!  Congratulations on this!

Even after your fix has landed, I suspect that SELinux (Enforcing) is going to cause the same problem as in comment 8 when sacctmgr tries to write to the file path=/memfd:slurm.conf.

Can you confirm that the writing may still get rejected by SELinux?  As suggested in comment 9 an error should be detected and handled by sacctmgr.

(In reply to Ricard Zarco Badia from comment #12)
> I think I have finished localizing the issue. SELinux is not technically the
> problem, it is just the thing that triggers the condition for sacctmgr to
> hang. I managed to force the same behavior you were getting by (very
> liberally) modifying the code and getting to the same codepath you were, no
> SELinux involved. The fix is simple, I will be proposing it for review
> shortly.
> 
> If you are curious about what was happening... it was just a classic
> oversight of pipe management. In my last comment, I was talking about this
> communication between parent and child process via pipe. I thought that the
> child process would be stuck somewhere, but it probably just died before
> writing anything (which is not strictly the problem, we treat that case as
> an error).
> 
> That read should just return 0 if no one was on the other side. Well, it
> turns out that both ends of the pipe were open in both processes. So the
> parent tries to read from that pipe, the child dies before writing anything,
> but the parent is still stuck on that read because there is technically
> another possible writer... himself (gasp in horror).

Best regards,
Ole