When we set the following in slurm.conf: MaxMemPerCPU=2048 And run the following command: srun --mem-per-cpu=4G hostname We get the following on the command line: srun: job 27801 queued and waiting for resources srun: job 27801 has been allocated resources srun: Job step creation temporarily disabled, retrying With slurmctld and slurmd both running with '-vvv' we see the following in the log files: slurmctld.log: [2013-06-14T14:42:53-05:00] debug2: select_p_job_test for job 27801 [2013-06-14T14:42:53-05:00] debug2: got 1 threads to send out [2013-06-14T14:42:53-05:00] debug2: _adjust_limit_usage: job 27801: MPC: job_memory set to 4096 [2013-06-14T14:42:53-05:00] debug2: Tree head got back 0 looking for 3 [2013-06-14T14:42:53-05:00] sched: Allocate JobId=27801 NodeList=n32 #CPUs=2 [2013-06-14T14:42:53-05:00] debug2: Spawning RPC agent for msg_type 4002 [2013-06-14T14:42:53-05:00] debug2: Performing full system state save [2013-06-14T14:42:53-05:00] debug2: got 1 threads to send out [2013-06-14T14:42:53-05:00] debug2: Tree head got back 1 [2013-06-14T14:42:53-05:00] debug2: Tree head got back 2 [2013-06-14T14:42:53-05:00] debug2: Tree head got back 3 [2013-06-14T14:42:53-05:00] debug2: Tree head got them all [2013-06-14T14:42:53-05:00] debug2: _slurm_rpc_job_ready(27801)=3 usec=6 [2013-06-14T14:42:53-05:00] debug2: Processing RPC: REQUEST_JOB_STEP_CREATE from uid=0 [2013-06-14T14:42:53-05:00] debug: Configuration for job 27801 complete [2013-06-14T14:42:53-05:00] _slurm_rpc_job_step_create for job 27801: Requested nodes are busy [2013-06-14T14:42:53-05:00] debug2: node_did_resp n31 [2013-06-14T14:42:53-05:00] debug2: node_did_resp n33 [2013-06-14T14:42:53-05:00] debug2: node_did_resp n32 [2013-06-14T14:42:53-05:00] debug2: Processing RPC: REQUEST_JOB_STEP_CREATE from uid=0 [2013-06-14T14:42:53-05:00] debug: Configuration for job 27801 complete [2013-06-14T14:42:53-05:00] _slurm_rpc_job_step_create for job 27801: Requested nodes are busy [2013-06-14T14:42:54-05:00] debug2: Processing RPC: REQUEST_JOB_STEP_CREATE from uid=0 [2013-06-14T14:42:54-05:00] debug: Configuration for job 27801 complete [2013-06-14T14:42:54-05:00] _slurm_rpc_job_step_create for job 27801: Requested nodes are busy [2013-06-14T14:42:54-05:00] debug2: Processing RPC: REQUEST_JOB_STEP_CREATE from uid=0 [2013-06-14T14:42:54-05:00] debug: Configuration for job 27801 complete [2013-06-14T14:42:54-05:00] _slurm_rpc_job_step_create for job 27801: Requested nodes are busy [2013-06-14T14:42:55-05:00] debug2: Processing RPC: REQUEST_JOB_STEP_CREATE from uid=0 [2013-06-14T14:42:55-05:00] debug: Configuration for job 27801 complete [2013-06-14T14:42:55-05:00] _slurm_rpc_job_step_create for job 27801: Requested nodes are busy [2013-06-14T14:42:57-05:00] debug2: Processing RPC: REQUEST_JOB_STEP_CREATE from uid=0 [2013-06-14T14:42:57-05:00] debug: Configuration for job 27801 complete [2013-06-14T14:42:57-05:00] _slurm_rpc_job_step_create for job 27801: Requested nodes are busy [2013-06-14T14:43:02-05:00] debug2: Processing RPC: REQUEST_JOB_STEP_CREATE from uid=0 [2013-06-14T14:43:02-05:00] debug: Configuration for job 27801 complete [2013-06-14T14:43:02-05:00] _slurm_rpc_job_step_create for job 27801: Requested nodes are busy slurmd.log: [2013-06-14T14:42:53-05:00] debug2: got this type of message 1011 [2013-06-14T14:42:53-05:00] debug2: Processing RPC: REQUEST_HEALTH_CHECK [2013-06-14T14:42:53-05:00] debug: attempting to run health_check [/srv/slurm/sbin/healthcheck.sh] It looks as thought the problem is solely with slurmctld as the slurmd never seems to get any request for the job! I would expect the behaviour to be that the job submission should just be rejected.
Could you attach your slurm.conf configuration file? It can also be helpful to identify the specific version of Slurm in the trouble ticket, which I believe is v2.5.7 in your case.
Created attachment 312 [details] Slurm.conf with MaxMemPerCPU commented out Here is the config with the MaxMemPerCPU commented out. I get the same behaviour with 2.5.6 and 2.5.7. Chris
Created attachment 315 [details] Disable setting implicit value of a job's cpus_per_task value This removes logic added three years ago that would automatically set a job's cpus_per_task value in order to reset a job's mem_per_cpu value and scale the cpus_per_task by the same value. Equivalent logic did not exist in the step allocation logic. Just return an error instead. This change will be made in Slurm version 2.6, but this batch is made for version 2.5. The original patch introducing the problem is in commit: cc00cc70b9c90816afc511e0261e449857176332 This is commit e3b7c2be4393d921679f3e0cddcb9ca7943fb1f6
See attached patch
Thanks, tested in our dev environment, confirmed fixed.