Hello,Slurm Support team ! I'm Toru Matsuoka in Cray Japan Engineer. Please teach me about following contents. We customer want use sstat commands. But , following error occured. ■sstat command Note: the sstat command requires that the jobacct_gather plugin be installed and operational. [root@mgmt2 slurm]# sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j 27005 AveCPU AvePages AveRSS AveVMSize JobID ---------- ---------- ---------- ---------- ------------ sstat: error: Malformed RPC of type 5020 received sstat: error: slurm_receive_msgs: Header lengths are longer than data received sstat: error: Malformed RPC of type 5020 received sstat: error: slurm_receive_msgs: Header lengths are longer than data received sstat: error: Malformed RPC of type 5020 received sstat: error: slurm_receive_msgs: Header lengths are longer than data received sstat: error: slurm_job_step_stat: unknown return given from e035: 9001 rc = Communication connection failure sstat: error: slurm_job_step_stat: unknown return given from e036: 9001 rc = Communication connection failure sstat: error: slurm_job_step_stat: unknown return given from e034: 9001 rc = Communication connection failure sstat: error: problem getting step_layout for 27005.0: Communication connection failure ■sacct command It look likes use sacct command. [root@mgmt2 slurm]# sacct --j 27061 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 27061 prog4 ye016uta7+ root 40 RUNNING 0:0 27061.0 pmi_proxy root 2 RUNNING 0:0 In Slurm.conf , ProctrackType=proctrack/pgid JobAcctGatherType parameter is not exist. Is it necessary JobAcctGatherType in slurm.conf or cause of other problem? Best Regards... Toru Matsuoka
Sorry, The Case Bug 853 is same contents of Bug 852. Please close this Case (Bug 852). The contents is duplicate.
*** Ticket 852 has been marked as a duplicate of this ticket. ***
(In reply to toru matsuoka from comment #0) > Hello,Slurm Support team ! > > I'm Toru Matsuoka in Cray Japan Engineer. > > Please teach me about following contents. > > We customer want use sstat commands. > > But , following error occured. > > ■sstat command > > Note: the sstat command requires that the jobacct_gather plugin be > installed and operational. > > [root@mgmt2 slurm]# sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j > 27005 > AveCPU AvePages AveRSS AveVMSize JobID > ---------- ---------- ---------- ---------- ------------ > sstat: error: Malformed RPC of type 5020 received > sstat: error: slurm_receive_msgs: Header lengths are longer than data > received > sstat: error: Malformed RPC of type 5020 received > sstat: error: slurm_receive_msgs: Header lengths are longer than data > received > sstat: error: Malformed RPC of type 5020 received > sstat: error: slurm_receive_msgs: Header lengths are longer than data > received > sstat: error: slurm_job_step_stat: unknown return given from e035: 9001 rc = > Communication connection failure > sstat: error: slurm_job_step_stat: unknown return given from e036: 9001 rc = > Communication connection failure > sstat: error: slurm_job_step_stat: unknown return given from e034: 9001 rc = > Communication connection failure > sstat: error: problem getting step_layout for 27005.0: Communication > connection failure This is almost certainly due to differences in your slurm.conf file between nodes. Make sure that your slurm.conf file on the node where you execute sstat is identical to that of the compute nodes where job 27005 is running. RPC type 5020 is RESPONSE_JOB_STEP_STAT from the compute node. > ■sacct command > > It look likes use sacct command. > > [root@mgmt2 slurm]# sacct --j 27061 > JobID JobName Partition Account AllocCPUS State ExitCode > ------------ ---------- ---------- ---------- ---------- ---------- -------- > 27061 prog4 ye016uta7+ root 40 RUNNING 0:0 > 27061.0 pmi_proxy root 2 RUNNING 0:0 > > > In Slurm.conf , > > ProctrackType=proctrack/pgid > JobAcctGatherType parameter is not exist. > > Is it necessary JobAcctGatherType in slurm.conf or cause of other problem? Without a plugin to collect accounting information (e.g. memory, cpu use, etc.), that information will be reported as zero by sstat.
Hello,Slurm Support Team! thanks for support. Please teach me about following contents. =================================================== Although our systems are CRAY CCS (CS300) products, the management node and the all computing node are using the same slurm.conf file fundamentally. (/opt/slurm/slurm/2.6.2/etc/sysconfig/slurm) What kind of thing is it that such an error message outputs on such conditions? It was the same result although the sstat command was executed by the node by which the job is incidentally performed. [root@e037 ~]# cd /opt/slurm/2.6.2/bin [root@e037 bin]# ./sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j 27179 AveCPU AvePages AveRSS AveVMSize JobID ---------- ---------- ---------- ---------- ------------ sstat: error: Malformed RPC of type 5020 received sstat: error: slurm_receive_msgs: Header lengths are longer than data received sstat: error: Malformed RPC of type 5020 received sstat: error: slurm_receive_msgs: Header lengths are longer than data received sstat: error: Malformed RPC of type 5020 received sstat: error: slurm_receive_msgs: Header lengths are longer than data received sstat: error: slurm_job_step_stat: unknown return given from e037: 9001 rc = Communication connection failure sstat: error: slurm_job_step_stat: unknown return given from e038: 9001 rc = Communication connection failure sstat: error: slurm_job_step_stat: unknown return given from e036: 9001 rc = Communication connection failure sstat: error: problem getting step_layout for 27179.0: Communication connection failure If JobAcctGatherType parameter is not set to a Slurm.conf file, are if all the information is displayed as 0, but If JobAcctGatherType parameter is set up in a slurm.conf file, isn't the sstat command displayed normally and become usable? =================================================== Best Regards..
Hello,Slurm Support team! Thanks for your support ! I would like to check in a hurry about the following contents. Is JobAcctGatherType plug-in introduced together with introduction of SLURM? Or Is there any necessity of introducing independently? Or Is it OK only by describing in Slurm.conf? =============================================== ・sstat can not running → # squeue -l (part of slurm job) 27416 ye016uta7 prog4 uuta0026 RUNNING 20:15 3-00:00:00 2 e[042-043] [root@mgmt2 ~]# ssh e042 Last login: Wed May 28 13:27:01 2014 from mgmt2 [root@e042 ~]# cd /opt/slurm/2.6.2./bin [root@e042 bin]# ls generate_pbs_nodefile sacct sinfo sstat mpiexec sacctmgr sjobexitmod strigger pbsnodes salloc sjstat sview qdel sattach smap weekqueue.sh qhold sbatch sprio weekqueue.sh.20140203 qrls sbcast squeue weekqueue.sh.20140314 qstat scancel sreport weekqueue.sh.20140319 qsub scontrol srun queue_up_down.sh sdiag sshare [root@e042 bin]# ./sstat sstat: error: You didn't give me any jobs to stat. [root@e042 bin]# ./sstat --j 27416 JobID MaxVMSize MaxVMSizeNode MaxVMSizeTask AveVMSize MaxRSS MaxRSSNode MaxRSSTask AveRSS MaxPages MaxPagesNode MaxPagesTask AvePages MinCPU MinCPUNode MinCPUTask AveCPU NTasks AveCPUFreq ReqCPUFreq ConsumedEnergy MaxDiskRead MaxDiskReadNode MaxDiskReadTask AveDiskRead MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask AveDiskWrite ------------ ---------- -------------- -------------- ---------- ---------- ---------- ---------- ---------- -------- ------------ -------------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ---------- -------------- ------------ --------------- --------------- ------------ ------------ ---------------- ---------------- ------------ sstat: error: Malformed RPC of type 5020 received sstat: error: slurm_receive_msgs: Header lengths are longer than data received sstat: error: Malformed RPC of type 5020 received sstat: error: slurm_receive_msgs: Header lengths are longer than data received sstat: error: slurm_job_step_stat: unknown return given from e042: 9001 rc = Communication connection failure sstat: error: slurm_job_step_stat: unknown return given from e043: 9001 rc = Communication connection failure sstat: error: problem getting step_layout for 27416.0: Communication connection failure Best Regards.. Toru Matsuoka
(In reply to toru matsuoka from comment #5) > Hello,Slurm Support team! > > Thanks for your support ! > > I would like to check in a hurry about the following contents. > > Is JobAcctGatherType plug-in introduced together with introduction of SLURM? > Or Is there any necessity of introducing independently? > Or Is it OK only by describing in Slurm.conf? I do not understand your questions. You do need to configure JobAcctGatherType in slurm.conf. Did you change slurm.conf and not restart the slurm daemons (so they would be using old configuration information)?
Thank you for your support. I understood it. >You do need to configure JobAcctGatherType in slurm.conf. >Did you change slurm.conf and not restart the slurm daemons >(so they would be using old configuration information)? We and customer done configure JobAcctGatherType in slurm.conf and change slurm.conf and not restart the slurm daemons. We and customer will scontrol reconfigure after configure JobAcctGatherType in slurm.conf. Best Regards.. Toru Matsuoka
Please keep in mind a simple scontrol reconfig will not load the setting correctly. You need to restart the daemons. Restart the slurmctld as well as all the slurmd's after the configuration change.
Hello,Slurm Support team! I understood following contents and illustrated for customer. >Please keep in mind a simple scontrol reconfig will not load the setting >correctly. >You need to restart the daemons. >Restart the slurmctld as well as all the slurmd's after the configuration >change. Thank you! If add question is occured,I will open as new case. Please close this case! Best Regards.. Toru Matsuoka
Closed per customer request. SLurm daemons needed to be restarted to enable change in accounting information collection plugin.