Ticket 943

Summary: The Job is runnning with worng partition
Product: Slurm Reporter: kunihiko Katayanagi <kuni>
Component: AccountingAssignee: Moe Jette <jette>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 2 - High Impact    
Priority: --- CC: brian.gilmer, da
Version: 2.6.5   
Hardware: Linux   
OS: Linux   
Site: CRAY Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name: CCS12207
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description kunihiko Katayanagi 2014-07-06 14:22:52 MDT
Hello Support,
This is kuni of Cray Japan Inc.

I have a question about the runnning job partition.

The user:doan is submit the some job with changed the GroupID.

Refer the below. The job is runnning on partition normal and wrf.
The JID:106365 is running on the partition wrf.
Also,The JID:108711 is running on the partition normal.


  JobID  GID      Group   UID      User              Submit               Start                 End    Elapsed NCPUS NNode CPUTimeRAW     State  Partition
------- ---- ---------- ----- --------- ------------------- ------------------- ------------------- ---------- ----- ----- -------------------- ----------
  99838 30121        RCM 30556      doan 2014-07-02T02:41:22 2014-07-04T01:29:09 2014-07-05T01:28:13   23:59:04    96     6    8289024    TIMEOUT     normal
 106365 30121        RCM 30556      doan 2014-07-04T22:33:47 2014-07-04T22:33:47 2014-07-05T20:27:20   21:53:33    60     3    4728780  COMPLETED        wrf *****
 107350 30121        RCM 30556      doan 2014-07-05T12:15:44 2014-07-05T12:15:44 2014-07-06T10:48:53   22:33:09   128     8   10392192  COMPLETED     normal
 108711 30076        WRF 30556      doan        wrf 2014-07-06T14:18:23 2014-07-06T14:18:29             Unknown   19:19:29    32     2    2226208    RUNNING     normal *****

The user:doan are belong to two groups (The RCM and WRF)
He chaged the GroupID with newgrp command when submit job.

Here is a information for partition of normal and wrf.

[root@mgmt1 ~]# scontrol show partition
PartitionName=normal
   AllocNodes=ALL 

AllowGroups=DWFHYP,RCM,TKBNDFT,LES,GEO,GALAXIES,LATTICE,MCSM,NUMLIB2,MMTDMKSM,NONHOMO,XMP,XQCD,GORILLA2,GALEVO,BBILQCD,CDEF,H4ES,COSMTURB,BRIDGEPP,ASTRO,ARGOT,LSC,NUCL
DFT,TDDFT,LATNUC,TCAPROFC,NIMSCQ,ARARAT,RSDFTC,RKYDENG,ADPFFFF,PSTCP,SCAT,KPIPI2,cray,CRAY Default=YES
   DefaultTime=NONE DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=1 MaxCPUsPerNode=16
   Nodes=coma-[001-241]
   Priority=1 RootOnly=NO ReqResv=NO Shared=EXCLUSIVE PreemptMode=OFF
   State=UP TotalCPUs=4820 TotalNodes=241 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerCPU=15000


PartitionName=wrf
   AllocNodes=ALL AllowGroups=WRF,cray,CRAY Default=NO
   DefaultTime=NONE DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=1 MaxCPUsPerNode=UNLIMITED
   Nodes=coma-[246-249]
   Priority=500 RootOnly=NO ReqResv=NO Shared=EXCLUSIVE PreemptMode=OFF
   State=UP TotalCPUs=80 TotalNodes=4 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

I'm not sure.... why,

When setting the GropuID:WRF, the job should be runnning in WRF partition.
Also When setting groupID:RCM, the job should be runnning in normal partition.

If you need the more infomation (for examle, log etc..), let me know.




Best Reagrds
Kuni
Comment 1 Moe Jette 2014-07-07 03:20:20 MDT
Duplicate of bug 921.

Anyone with access to any of the AllowGroups can access the partition. The current group ID is ignored by Slurm, by design.

This is the same way linux access control to files works. In the example below, my gid is 1000 (jette), but I can still read a file with group adm access:

jette@jette:~$ id
uid=1000(jette) gid=1000(jette) groups=1000(jette),4(adm),24(cdrom),27(sudo),30(dip),46(plugdev),112(lpadmin),124(sambashare)
jette@jette:~$  ls -ld /tmp/bug943
-r--r----- 1 root adm 5 Jul  7 08:16 /tmp/bug943
jette@jette:~$ cat /tmp/bug943
test

*** This ticket has been marked as a duplicate of ticket 921 ***