I know we have no support contract, but I want to make you aware of the fact that there seems to be a possible deadlock in commit 52ce3ff0ec5e25bf9704b0ad5f6f59b3b56c63f5 . When we submit job arrays of more than a few (e.g. 16) jobs to one node that start at the same time, only one or two of them start, the other ones hang. I was able to avoid this problem by replacing slurm_mutex_lock(&dummy_lock); slurm_cond_wait(&conf->prolog_running_cond, &dummy_lock); slurm_mutex_unlock(&dummy_lock); by sleep(1); . I know this is no real solution but it works until this is fixed. I guess conf->prolog_running_cond is triggered while _prolog_is_running (job_id) has not finished yet. I think slurm_cond_wait should not be used with a dummy mutex but with a real one. Best regards, Thomas
Hi Thanks for your report. Like you can see we solved this in commit: https://github.com/SchedMD/slurm/commit/b40bd8d35ef851 dominik *** This ticket has been marked as a duplicate of ticket 3977 ***