Ticket 853

Summary: sstat command can not use
Product: Slurm Reporter: toru matsuoka <tmatsuoka>
Component: OtherAssignee: Moe Jette <jette>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: da
Version: 2.6.2   
Hardware: Linux   
OS: Linux   
Site: CRAY Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description toru matsuoka 2014-06-02 20:30:01 MDT
Hello,Slurm Support team !

I'm Toru Matsuoka in Cray Japan Engineer.

Please teach me about following contents.

We customer want use sstat commands. 

But , following error occured.

■sstat command

Note: the sstat  command requires that the jobacct_gather plugin be installed and operational.

[root@mgmt2 slurm]# sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j 27005
    AveCPU   AvePages     AveRSS  AveVMSize        JobID
---------- ---------- ---------- ---------- ------------
sstat: error: Malformed RPC of type 5020 received
sstat: error: slurm_receive_msgs: Header lengths are longer than data received
sstat: error: Malformed RPC of type 5020 received
sstat: error: slurm_receive_msgs: Header lengths are longer than data received
sstat: error: Malformed RPC of type 5020 received
sstat: error: slurm_receive_msgs: Header lengths are longer than data received
sstat: error: slurm_job_step_stat: unknown return given from e035: 9001 rc = Communication connection failure
sstat: error: slurm_job_step_stat: unknown return given from e036: 9001 rc = Communication connection failure
sstat: error: slurm_job_step_stat: unknown return given from e034: 9001 rc = Communication connection failure
sstat: error: problem getting step_layout for 27005.0: Communication connection failure

■sacct command 

It look likes use sacct command.

[root@mgmt2 slurm]# sacct --j 27061
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
27061             prog4 ye016uta7+       root         40    RUNNING      0:0
27061.0       pmi_proxy                  root          2    RUNNING      0:0


In Slurm.conf , 

ProctrackType=proctrack/pgid
JobAcctGatherType parameter is not exist.

Is it necessary JobAcctGatherType in slurm.conf or cause of other problem?

Best Regards...
Toru Matsuoka
Comment 1 toru matsuoka 2014-06-02 21:15:43 MDT
Sorry, The Case Bug 853 is same contents of Bug 852.

Please close this Case (Bug 852).

The contents is duplicate.
Comment 2 Moe Jette 2014-06-03 02:36:52 MDT
*** Ticket 852 has been marked as a duplicate of this ticket. ***
Comment 3 Moe Jette 2014-06-03 02:46:54 MDT
(In reply to toru matsuoka from comment #0)
> Hello,Slurm Support team !
> 
> I'm Toru Matsuoka in Cray Japan Engineer.
> 
> Please teach me about following contents.
> 
> We customer want use sstat commands. 
> 
> But , following error occured.
> 
> ■sstat command
> 
> Note: the sstat  command requires that the jobacct_gather plugin be
> installed and operational.
> 
> [root@mgmt2 slurm]# sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j
> 27005
>     AveCPU   AvePages     AveRSS  AveVMSize        JobID
> ---------- ---------- ---------- ---------- ------------
> sstat: error: Malformed RPC of type 5020 received
> sstat: error: slurm_receive_msgs: Header lengths are longer than data
> received
> sstat: error: Malformed RPC of type 5020 received
> sstat: error: slurm_receive_msgs: Header lengths are longer than data
> received
> sstat: error: Malformed RPC of type 5020 received
> sstat: error: slurm_receive_msgs: Header lengths are longer than data
> received
> sstat: error: slurm_job_step_stat: unknown return given from e035: 9001 rc =
> Communication connection failure
> sstat: error: slurm_job_step_stat: unknown return given from e036: 9001 rc =
> Communication connection failure
> sstat: error: slurm_job_step_stat: unknown return given from e034: 9001 rc =
> Communication connection failure
> sstat: error: problem getting step_layout for 27005.0: Communication
> connection failure

This is almost certainly due to differences in your slurm.conf file between nodes. Make sure that your slurm.conf file on the node where you execute sstat is identical to that of the compute nodes where job 27005 is running. RPC type 5020 is RESPONSE_JOB_STEP_STAT from the compute node.


> ■sacct command 
> 
> It look likes use sacct command.
> 
> [root@mgmt2 slurm]# sacct --j 27061
>        JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
> ------------ ---------- ---------- ---------- ---------- ---------- --------
> 27061             prog4 ye016uta7+       root         40    RUNNING      0:0
> 27061.0       pmi_proxy                  root          2    RUNNING      0:0
> 
> 
> In Slurm.conf , 
> 
> ProctrackType=proctrack/pgid
> JobAcctGatherType parameter is not exist.
> 
> Is it necessary JobAcctGatherType in slurm.conf or cause of other problem?

Without a plugin to collect accounting information (e.g. memory, cpu use, etc.), that information will be reported as zero by sstat.
Comment 4 toru matsuoka 2014-06-03 04:53:31 MDT
Hello,Slurm Support Team!

thanks for support.

Please teach me about following contents.

===================================================

Although our systems are CRAY CCS (CS300) products, the management node and the all computing node are using the same slurm.conf file fundamentally. 
(/opt/slurm/slurm/2.6.2/etc/sysconfig/slurm)
 
What kind of thing is it that such an error message outputs on such conditions? 

It was the same result although the sstat command was executed by the node by which the job is incidentally performed.


[root@e037 ~]# cd /opt/slurm/2.6.2/bin
[root@e037 bin]# ./sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j 27179
    AveCPU   AvePages     AveRSS  AveVMSize        JobID
---------- ---------- ---------- ---------- ------------
sstat: error: Malformed RPC of type 5020 received
sstat: error: slurm_receive_msgs: Header lengths are longer than data received
sstat: error: Malformed RPC of type 5020 received
sstat: error: slurm_receive_msgs: Header lengths are longer than data received
sstat: error: Malformed RPC of type 5020 received
sstat: error: slurm_receive_msgs: Header lengths are longer than data received
sstat: error: slurm_job_step_stat: unknown return given from e037: 9001 rc = Communication connection failure
sstat: error: slurm_job_step_stat: unknown return given from e038: 9001 rc = Communication connection failure
sstat: error: slurm_job_step_stat: unknown return given from e036: 9001 rc = Communication connection failure
sstat: error: problem getting step_layout for 27179.0: Communication connection failure


If JobAcctGatherType parameter is not set to a Slurm.conf file, are if all the information is displayed as 0, but If JobAcctGatherType parameter is set up in a slurm.conf file, isn't the sstat command displayed normally and become usable? 

===================================================
 
Best Regards..
Comment 5 toru matsuoka 2014-06-03 21:00:54 MDT
Hello,Slurm Support team!

Thanks for your support ! 

I would like to check in a hurry about the following contents. 

Is JobAcctGatherType plug-in introduced together with introduction of SLURM? 
Or Is there any necessity of introducing independently? 
Or Is it OK only by describing in Slurm.conf? 

===============================================
・sstat can not running →

# squeue -l (part of slurm job)

             27416 ye016uta7    prog4 uuta0026  RUNNING      20:15 3-00:00:00      2 e[042-043]

[root@mgmt2 ~]# ssh e042
Last login: Wed May 28 13:27:01 2014 from mgmt2

[root@e042 ~]# cd /opt/slurm/2.6.2./bin
[root@e042 bin]# ls
generate_pbs_nodefile  sacct     sinfo        sstat
mpiexec                sacctmgr  sjobexitmod  strigger
pbsnodes               salloc    sjstat       sview
qdel                   sattach   smap         weekqueue.sh
qhold                  sbatch    sprio        weekqueue.sh.20140203
qrls                   sbcast    squeue       weekqueue.sh.20140314
qstat                  scancel   sreport      weekqueue.sh.20140319
qsub                   scontrol  srun
queue_up_down.sh       sdiag     sshare

[root@e042 bin]# ./sstat
sstat: error: You didn't give me any jobs to stat.

[root@e042 bin]# ./sstat --j 27416
       JobID  MaxVMSize  MaxVMSizeNode  MaxVMSizeTask  AveVMSize     MaxRSS MaxRSSNode MaxRSSTask     AveRSS MaxPages MaxPagesNode   MaxPagesTask   AvePages     MinCPU MinCPUNode MinCPUTask     AveCPU   NTasks AveCPUFreq ReqCPUFreq ConsumedEnergy  MaxDiskRead MaxDiskReadNode MaxDiskReadTask  AveDiskRead MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask AveDiskWrite
------------ ---------- -------------- -------------- ---------- ---------- ---------- ---------- ---------- -------- ------------ -------------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ---------- -------------- ------------ --------------- --------------- ------------ ------------ ---------------- ---------------- ------------
sstat: error: Malformed RPC of type 5020 received
sstat: error: slurm_receive_msgs: Header lengths are longer than data received
sstat: error: Malformed RPC of type 5020 received
sstat: error: slurm_receive_msgs: Header lengths are longer than data received
sstat: error: slurm_job_step_stat: unknown return given from e042: 9001 rc = Communication connection failure
sstat: error: slurm_job_step_stat: unknown return given from e043: 9001 rc = Communication connection failure
sstat: error: problem getting step_layout for 27416.0: Communication connection failure


Best Regards..
Toru Matsuoka
Comment 6 Moe Jette 2014-06-04 03:08:46 MDT
(In reply to toru matsuoka from comment #5)
> Hello,Slurm Support team!
> 
> Thanks for your support ! 
> 
> I would like to check in a hurry about the following contents. 
> 
> Is JobAcctGatherType plug-in introduced together with introduction of SLURM? 
> Or Is there any necessity of introducing independently? 
> Or Is it OK only by describing in Slurm.conf? 

I do not understand your questions.

You do need to configure JobAcctGatherType in slurm.conf.

Did you change slurm.conf and not restart the slurm daemons (so they would be using old configuration information)?
Comment 7 toru matsuoka 2014-06-04 12:11:46 MDT
Thank you for your support.

I understood it.

>You do need to configure JobAcctGatherType in slurm.conf.
>Did you change slurm.conf and not restart the slurm daemons 
>(so they would be using old configuration information)?

We and customer done configure JobAcctGatherType in slurm.conf
and change slurm.conf and not restart the slurm daemons.

We and customer will scontrol reconfigure after configure 
JobAcctGatherType in slurm.conf.

Best Regards..
Toru Matsuoka
Comment 8 Danny Auble 2014-06-04 16:06:06 MDT
Please keep in mind a simple scontrol reconfig will not load the setting correctly.

You need to restart the daemons.

Restart the slurmctld as well as all the slurmd's after the configuration change.
Comment 9 toru matsuoka 2014-06-04 16:42:08 MDT
Hello,Slurm Support team! 

I understood following contents and illustrated for customer. 

>Please keep in mind a simple scontrol reconfig will not load the setting >correctly.
>You need to restart the daemons.
>Restart the slurmctld as well as all the slurmd's after the configuration >change.

Thank you!

If add question is occured,I will open as new case.

Please close this case!

Best Regards..
Toru Matsuoka
Comment 10 Moe Jette 2014-06-05 03:45:59 MDT
Closed per customer request. SLurm daemons needed to be restarted to enable change in accounting information collection plugin.