| Summary: | Slurm's job record gets modified to change workdir to /root when an scrontab job fails. | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Aditi Gaur <agaur> |
| Component: | User Commands | Assignee: | Dominik Bartkiewicz <bart> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | alex, bart, csamuel |
| Version: | 20.11.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | NERSC | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 21.08.5 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
On further investigation, this may or may not be related to application failures. In this case we saw that when slurm created a new job_id, the job record contained working dir changed to /root. 329998|/global/homes/k/kadidia|2021-11-10T02:45:03 329999|/global/homes/k/kadidia|2021-11-10T02:45:03 329999|/global/homes/k/kadidia|2021-11-10T02:46:03 329998|/global/homes/k/kadidia|2021-11-10T02:46:03 329998|/global/homes/k/kadidia|2021-11-10T02:47:00 329999|/global/homes/k/kadidia|2021-11-10T02:47:00 329998|/global/homes/k/kadidia|2021-11-10T02:48:03 329999|/global/homes/k/kadidia|2021-11-10T02:48:03 329998|/global/homes/k/kadidia|2021-11-10T02:49:03 329999|/global/homes/k/kadidia|2021-11-10T02:49:03 329998|/global/homes/k/kadidia|2021-11-10T02:50:03 329999|/global/homes/k/kadidia|2021-11-10T02:50:03 329998|/global/homes/k/kadidia|2021-11-10T02:51:03 329999|/global/homes/k/kadidia|2021-11-10T02:51:03 329998|/global/homes/k/kadidia|2021-11-10T02:52:01 329999|/global/homes/k/kadidia|2021-11-10T02:52:03 329999|/global/homes/k/kadidia|2021-11-10T02:53:03 329998|/global/homes/k/kadidia|2021-11-10T02:53:03 329998|/global/homes/k/kadidia|2021-11-10T02:54:03 329999|/global/homes/k/kadidia|2021-11-10T02:54:03 329998|/global/homes/k/kadidia|2021-11-10T02:55:04 329999|/global/homes/k/kadidia|2021-11-10T02:55:04 329998|/global/homes/k/kadidia|2021-11-10T02:55:59 329999|/global/homes/k/kadidia|2021-11-10T02:55:59 330371|/root|2021-11-10T02:56:04 330372|/root|2021-11-10T02:56:04 330372|/root|2021-11-10T02:57:04 330371|/root|2021-11-10T02:57:04 330371|/root|2021-11-10T02:58:04 330372|/root|2021-11-10T02:58:04 330371|/root|2021-11-10T02:59:02 330372|/root|2021-11-10T02:59:02 330371|/root|2021-11-10T03:00:04 330372|/root|2021-11-10T03:00:04 Hi
Job_id of cron job is never changed by slurmctld.
It can change only due to job update, which is creating new and canceling the old job.
Did someone update those jobs as root?
When work_dir is not explicitly set, scrontab sets it to getenv("HOME").
In case when root does 'scrontab -u ...', this will be "/root".
Dominik
Hi Dominik,
thanks for your answer.
Scrontab man page says:
```
OPTIONS
-e Edit the crontab. If a crontab does not exist already, a default example (without any defined entries) will be provided in
the editor.
-l List the crontab. (Prints directly to stdout.)
-r Remove the crontab. Any currently running crontab-defined jobs will continue to run but will no longer recur. All other
crontab-defined jobs will be cancelled.
-u <user>
Edit or view a different user's crontab. Listing is permitted for Operators and Admins. Editing/removal is only permitted for
root and the SlurmUser account.
```
so if I do scrontab -u <user> but don't actually edit the scrontab, why should that change the job itself?
based on the options above, it seems administrator only has scrontab -u command to see a user's cron job if they need to for whatever reason. It also seems -u supports only viewing the job, which is what an administrator may have done.
I am not sure if that should cause the job to change, if the administrator has not actually changed the job?
Hi If you want to view only user crontab, you should use the -l option. Otherwise, you will send an update crontab request. I will see if we can make obtaining workdir smarter when -u is used. Dominik Hi This commit prevents changing the working directory to root/slurm user home. https://github.com/SchedMD/slurm/commit/3512b0c1e859 I'll go ahead and close this out. Feel free to reopen if needed. Dominik |
Hello, We have recently noticed that when a user's scrontab job fails (for a reason specific to their code), the subsequent runs of the same cron job- gets modified that changes its working directory to one in /root. For example this job: ``` erlmutter:login01:~ # scontrol show job=330388 JobId=330388 JobName=cron_job2 UserId=kadidia(79923) GroupId=kadidia(79923) MCS_label=N/A Priority=69119 Nice=0 Account=nstaff QOS=cron JobState=CANCELLED Reason=BeginTime Dependency=(null) Requeue=0 Restarts=2 BatchFlag=1 Reboot=0 ExitCode=0:1 RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A SubmitTime=2021-11-10T03:13:10 EligibleTime=2021-11-10T03:14:00 AccrueTime=Unknown StartTime=2021-11-10T03:13:30 EndTime=2021-11-10T03:13:30 Deadline=N/A CrontabSpec="* * * * *" PreemptEligibleTime=2021-11-10T03:13:30 PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-11-10T03:13:06 Partition=cron AllocNode:Sid=login01:4294967294 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) BatchHost=login01 NumNodes=1 NumCPUs=1 NumTasks=0 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=2G,node=1,billing=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=2G MinTmpDiskNode=0 Features=cron DelayBoot=00:00:00 OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null) Command=$SCRATCH/ldms_perlmutter_2021.sh WorkDir=/root AdminComment={"batchHost":"login01","partition":"cron","packJobOffset":0,"priority":69119,"submitTime":1636513990,"jobId":330388,"gresRequest":"cpu=1,mem=2G,node=1,billing=1","features":"cron","resizing":0,"jobAccount":"nstaff","startTime":1636514010,"arrayTaskId":4294967294,"qos":"cron","endTime":1636514010,"jobDerivedExitCode":0,"argv":["$SCRATCH\/ldms_perlmutter_2021.sh"],"timeLimit":10,"cluster":"perlmutter","workingDirectory":"\/root","uid":79923,"packJobId":0,"arrayJobId":0,"allocCpus":0,"jobExitCode":1,"name":"cron_job2","nodes":"login01","allocNodes":1,"tresRequest":"1=2,2=4096,3=18446744073709551614,4=1,5=2","restartCnt":2,"reboot":0} StdErr=/root/slurm-330388.out StdIn=/dev/null StdOut=/root/slurm-330388.out Power= NtasksPerTRES:0 ``` For other reasons previous iterations of this job failed. We also verified that this job did not have their working dir set to /root. Yet when it gets requeued- its working dir as well as stdout and stderr gets changed to /root. We had some other user complain about this as well. There recently was a GPFS outage after which a user saw their cron jobs fail because the working dir got changed to /root. We think this happens when one or more iterations of the cron job fail- then subsequent iterations get their directories and stdout/err changed. Please let us know if you can reproduce this issue on your end.