| Summary: | seff reporting impossibly high memory consumption for completed jobs | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Alex Mamach <alex.mamach> |
| Component: | User Commands | Assignee: | Oriol Vilarrubi <jvilarru> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | ich+schedmd |
| Version: | 20.02.6 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Northwestern | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | slurm.conf | ||
|
Description
Alex Mamach
2021-08-02 20:08:36 MDT
Hello Alex, What seff does is basically query slurm for the data parse it and present it to the user, so if you see strange data, most probably the problem is the data stored in slurm. In order to further debug this could you please execute seff in debug mode for one of this jobs that you see have impossible data, in order to do so you need to give seff the -d option, like this: seff -d 1234 The only difference is that it will print the raw data, that way we can see if some data in the slurm database is strange. Could you please also attach the slurm.conf, that way I can also see all the other relevant settings like JobAcctGatherType. Greetings. Hello Alex, Did you had the time to execute seff in debug mode? Greetings. Hi Oriol, Sorry for the delay, I needed to comb through the jobs to find a good candidate, (some of the jobs I was previously looking at began reporting 0% memory utilization instead of their previously > 100% memory utilization). seff -d 9007814 Slurm data: JobID ArrayJobID User Group State Clustername Ncpus Nnodes Ntasks Reqmem PerNode Cput Walltime Mem ExitStatus Slurm data: 9007814 lmc0633 lmc0633 TIMEOUT quest 16 1 16 104857600 1 294 14709 1214633856 0 Job ID: 9007814 Cluster: quest User/Group: lmc0633/lmc0633 State: TIMEOUT (exit code 0) Nodes: 1 Cores per node: 16 CPU Utilized: 00:04:54 CPU Efficiency: 0.12% of 2-17:22:24 core-walltime Job Wall-clock time: 04:05:09 Memory Utilized: 1.13 TB (estimated maximum) Memory Efficiency: 1158.37% of 100.00 GB (100.00 GB/node) I've also attached our slurm.conf Created attachment 21152 [details]
slurm.conf
Hello Alex, As I suspected the problem seems is not seff, the raw data that comes from the DB is wrong (The Mem field, that in your example is 1214633856[1158GB rounded]) So now we know that the problem comes from either the data acquisition on the storage of this one in the DB. Do you think you could run again the job 9007814 so that we can see if this happens again? Greetings. (In reply to Oriol Vilarrubi from comment #5) > Hello Alex, > > As I suspected the problem seems is not seff, the raw data that comes from > the DB is wrong (The Mem field, that in your example is 1214633856[1158GB > rounded]) > > So now we know that the problem comes from either the data acquisition on > the storage of this one in the DB. Do you think you could run again the job > 9007814 so that we can see if this happens again? > > Greetings. Also, in order to better isolate the problem, could you get the data from sacct directly? The command would be something like this: sacct -j 9007814 -o JobID,TRESUsageInMax Hi Oriol,
I'll work on tracking down the job submission script and seeing if we can re-run it.
In the meantime here's the output from sacct you requested:
JobID TRESUsageInMax
-------------- -----------------------------------------------------
9007814
9007814.extern cpu=00:00:04,energy=0,fs/disk=214304173,mem=1952K,pages=6,vmem=140516K
9007814.0 cpu=02:56:20,energy=0,fs/disk=11064919239,mem=75914616K,pages=39,vmem=79772964K
Hi, I'm going to close this for now since after upgrading Slurm to 20.11 and doing some database optimizations we haven't seen this again. Thanks for your time! Hi, we are running into the same issue: $ seff 680215 Job ID: 680215 Cluster: ag_gagneur Use of uninitialized value $user in concatenation (.) or string at /bin/seff line 154, <DATA> line 602. User/Group: /ag_gagneur State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 10 CPU Utilized: 6-16:21:26 CPU Efficiency: 33.08% of 20-04:46:30 core-walltime Job Wall-clock time: 2-00:28:39 Memory Utilized: 501.90 GB Memory Efficiency: 392.11% of 128.00 GB Probably this comes from shared-memory usage. Is there anything we can do about? More details: seff -d 680215 Slurm data: JobID ArrayJobID User Group State Clustername Ncpus Nnodes Ntasks Reqmem PerNode Cput Walltime Mem ExitStatus Use of uninitialized value $user in concatenation (.) or string at /usr/bin/seff line 147, <DATA> line 602. Slurm data: 680215 ag_gagneur COMPLETED ag_gagneur 10 1 1 134217728 1 577286 174519 526284124 0 [...] |