Process limit is not propagated properly when we run the program as follows. Some nodes are unlimited, but others are 1073741824. We set PropagateResourceLimitsExcept=CPU . We use slurm 2.6.2 and Intel MPI 4.1.1. *Job Script ========================= #!/bin/sh #SBATCH -p test #SBATCH -N 11 #SBATCH -n 220 #SBATCH -J test #SBATCH -o stdout.%J #SBATCH -e stderr.%J PROG=./test.out export I_MPI_DEBUG=5 export I_MPI_DEBUG_OUTPUT=debug.${SLURM_JOB_ID} export SLURM_TASKS_PER_NODE='1,4(x10)' mpirun -np 41 $PROG ========================= Program ========================= #include "sys/time.h" #include "sys/resource.h" #include <stdio.h> #include "mpi.h" int main(int argc, char *argv[]) { int numprocs, rank, namelen; char processor_name[MPI_MAX_PROCESSOR_NAME]; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Get_processor_name(processor_name, &namelen); int res; struct rlimit rlim; getrlimit(RLIMIT_STACK, &rlim); printf("process %2d out of %2d on %s, (0) Before: cur=%d,hard=%d\n", rank, numprocs, processor_name,(int)rlim.rlim_cur,(int)rlim.rlim_max); rlim.rlim_cur=RLIM_INFINITY; rlim.rlim_max=RLIM_INFINITY; res=setrlimit(RLIMIT_STACK, &rlim); getrlimit(RLIMIT_STACK, &rlim); printf("process %2d out of %2d on %s, (1) After: res=%d,cur=%d,hard=%d\n", rank, numprocs, processor_name,res,(int)rlim.rlim_cur,(int)rlim.rlim_max); MPI_Finalize(); } ========================= Output ========================= Hello from process 0 out of 41 on e004, After: res=0,cur=-1,hard=-1 Hello from process 0 out of 41 on e004, Before: cur=-1,hard=-1 Hello from process 1 out of 41 on e005, After: res=0,cur=-1,hard=-1 Hello from process 1 out of 41 on e005, Before: cur=-1,hard=-1 Hello from process 2 out of 41 on e005, After: res=0,cur=-1,hard=-1 Hello from process 2 out of 41 on e005, Before: cur=-1,hard=-1 Hello from process 3 out of 41 on e005, After: res=0,cur=-1,hard=-1 Hello from process 3 out of 41 on e005, Before: cur=-1,hard=-1 Hello from process 4 out of 41 on e005, After: res=0,cur=-1,hard=-1 Hello from process 4 out of 41 on e005, Before: cur=-1,hard=-1 Hello from process 5 out of 41 on e006, After: res=0,cur=-1,hard=-1 Hello from process 5 out of 41 on e006, Before: cur=-1,hard=-1 Hello from process 6 out of 41 on e006, After: res=0,cur=-1,hard=-1 Hello from process 6 out of 41 on e006, Before: cur=-1,hard=-1 Hello from process 7 out of 41 on e006, After: res=0,cur=-1,hard=-1 Hello from process 7 out of 41 on e006, Before: cur=-1,hard=-1 Hello from process 8 out of 41 on e006, After: res=0,cur=-1,hard=-1 Hello from process 8 out of 41 on e006, Before: cur=-1,hard=-1 Hello from process 9 out of 41 on e007, After: res=0,cur=-1,hard=-1 Hello from process 9 out of 41 on e007, Before: cur=-1,hard=-1 Hello from process 10 out of 41 on e007, After: res=0,cur=-1,hard=-1 Hello from process 10 out of 41 on e007, Before: cur=-1,hard=-1 Hello from process 11 out of 41 on e007, After: res=0,cur=-1,hard=-1 Hello from process 11 out of 41 on e007, Before: cur=-1,hard=-1 Hello from process 12 out of 41 on e007, After: res=0,cur=-1,hard=-1 Hello from process 12 out of 41 on e007, Before: cur=-1,hard=-1 Hello from process 13 out of 41 on e008, After: res=0,cur=-1,hard=-1 Hello from process 13 out of 41 on e008, Before: cur=-1,hard=-1 Hello from process 14 out of 41 on e008, After: res=0,cur=-1,hard=-1 Hello from process 14 out of 41 on e008, Before: cur=-1,hard=-1 Hello from process 15 out of 41 on e008, After: res=0,cur=-1,hard=-1 Hello from process 15 out of 41 on e008, Before: cur=-1,hard=-1 Hello from process 16 out of 41 on e008, After: res=0,cur=-1,hard=-1 Hello from process 16 out of 41 on e008, Before: cur=-1,hard=-1 Hello from process 17 out of 41 on e009, After: res=0,cur=-1,hard=-1 Hello from process 17 out of 41 on e009, Before: cur=-1,hard=-1 Hello from process 18 out of 41 on e009, After: res=0,cur=-1,hard=-1 Hello from process 18 out of 41 on e009, Before: cur=-1,hard=-1 Hello from process 19 out of 41 on e009, After: res=0,cur=-1,hard=-1 Hello from process 19 out of 41 on e009, Before: cur=-1,hard=-1 Hello from process 20 out of 41 on e009, After: res=0,cur=-1,hard=-1 Hello from process 20 out of 41 on e009, Before: cur=-1,hard=-1 Hello from process 21 out of 41 on e010, After: res=0,cur=-1,hard=-1 Hello from process 21 out of 41 on e010, Before: cur=-1,hard=-1 Hello from process 22 out of 41 on e010, After: res=0,cur=-1,hard=-1 Hello from process 22 out of 41 on e010, Before: cur=-1,hard=-1 Hello from process 23 out of 41 on e010, After: res=0,cur=-1,hard=-1 Hello from process 23 out of 41 on e010, Before: cur=-1,hard=-1 Hello from process 24 out of 41 on e010, After: res=0,cur=-1,hard=-1 Hello from process 24 out of 41 on e010, Before: cur=-1,hard=-1 Hello from process 25 out of 41 on e011, After: res=-1,cur=1073741824,hard=1073741824 Hello from process 25 out of 41 on e011, Before: cur=1073741824,hard=1073741824 Hello from process 26 out of 41 on e011, After: res=-1,cur=1073741824,hard=1073741824 Hello from process 26 out of 41 on e011, Before: cur=1073741824,hard=1073741824 Hello from process 27 out of 41 on e011, After: res=-1,cur=1073741824,hard=1073741824 Hello from process 27 out of 41 on e011, Before: cur=1073741824,hard=1073741824 Hello from process 28 out of 41 on e011, After: res=-1,cur=1073741824,hard=1073741824 Hello from process 28 out of 41 on e011, Before: cur=1073741824,hard=1073741824 Hello from process 29 out of 41 on e012, After: res=-1,cur=1073741824,hard=1073741824 Hello from process 29 out of 41 on e012, Before: cur=1073741824,hard=1073741824 Hello from process 30 out of 41 on e012, After: res=-1,cur=1073741824,hard=1073741824 Hello from process 30 out of 41 on e012, Before: cur=1073741824,hard=1073741824 Hello from process 31 out of 41 on e012, After: res=-1,cur=1073741824,hard=1073741824 Hello from process 31 out of 41 on e012, Before: cur=1073741824,hard=1073741824 Hello from process 32 out of 41 on e012, After: res=-1,cur=1073741824,hard=1073741824 Hello from process 32 out of 41 on e012, Before: cur=1073741824,hard=1073741824 Hello from process 33 out of 41 on e013, After: res=-1,cur=1073741824,hard=1073741824 Hello from process 33 out of 41 on e013, Before: cur=1073741824,hard=1073741824 Hello from process 34 out of 41 on e013, After: res=-1,cur=1073741824,hard=1073741824 Hello from process 34 out of 41 on e013, Before: cur=1073741824,hard=1073741824 Hello from process 35 out of 41 on e013, After: res=-1,cur=1073741824,hard=1073741824 Hello from process 35 out of 41 on e013, Before: cur=1073741824,hard=1073741824 Hello from process 36 out of 41 on e013, After: res=-1,cur=1073741824,hard=1073741824 Hello from process 36 out of 41 on e013, Before: cur=1073741824,hard=1073741824 Hello from process 37 out of 41 on e014, After: res=-1,cur=1073741824,hard=1073741824 Hello from process 37 out of 41 on e014, Before: cur=1073741824,hard=1073741824 Hello from process 38 out of 41 on e014, After: res=-1,cur=1073741824,hard=1073741824 Hello from process 38 out of 41 on e014, Before: cur=1073741824,hard=1073741824 Hello from process 39 out of 41 on e014, After: res=-1,cur=1073741824,hard=1073741824 Hello from process 39 out of 41 on e014, Before: cur=1073741824,hard=1073741824 Hello from process 40 out of 41 on e014, After: res=-1,cur=1073741824,hard=1073741824 Hello from process 40 out of 41 on e014, Before: cur=1073741824,hard=1073741824
This should answer your question: http://slurm.schedmd.com/faq.html#rlimit Using the google search on our web site should answer most of your questions. That should be your first tool for getting questions answered.
->After: res=-1,cur=1073741824,hard=1073741824 setrlimit() fails on some of your nodes, res = -1, printing the errno using perror() should help to understand why. Also I would print the rlim values as lu rather than cast them to int. getrlimit(RLIMIT_STACK, &rl); printf("max %lu curr %lu\n", rl.rlim_max, rl.rlim_cur); David
I guess that setrlimit fails to set "unlimited" value on the ranks(>=25) because hard limit is already set to 1073741824. All node has the same limits.conf, so I believe that all node (or all rank) has the same limits, but actually it does not. Why does this problem occur?
Perhaps the slurmd was started with different limits?
All node is booted from the same image and shares the same slurm.conf. So I believe the limits are the same, but I will confirm. Do I have to check the limit of SlurmUser, or root?
Check both, but root is the one that matter since slurmstepd sets the limit as root. This is explained in the link sent by my colleague this morning. David
The most common reason for limits being different on different nodes are 1) different limit configurations on different nodes 2) some slurmd daemons restarted manually be a user (so that user's limits would be used, even if he used sudo before starting the daemon).
Information given. David
Some slurmd are restarted under the condition where stacksize has limited value. Please close this bug.