| Summary: | Cannot Release Jobs with JobHeldUser | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Will French <will> |
| Component: | Scheduling | Assignee: | Alejandro Sanchez <alex> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | alex, davide.vanzo |
| Version: | 15.08.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Vanderbilt | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurm.conf
slurmdbd.conf cgroup.conf slurmctld logs |
||
Created attachment 2730 [details]
slurm.conf
Created attachment 2731 [details]
slurmdbd.conf
Created attachment 2732 [details]
cgroup.conf
To provide an update since Friday, these jobs did end up getting scheduled and ran correctly. There are still a few lingering issues: 1. Why were the jobs marked as JobHeldUser in the first place? The user never put a hold on these jobs and it appears that this happened as a result of the job(s) landing on a node that was dying. 2. Why was the JobHeldUser field not updated after admins released those jobs for scheduling? It's also unclear if releasing the jobs was a necessary step at all. At the time, the user had run up against its GrpCPU limit so these jobs may have not been starting simply due to resource restrictions. Hi Will, (In reply to Will French from comment #4) > To provide an update since Friday, these jobs did end up getting scheduled > and ran correctly. There are still a few lingering issues: > > 1. Why were the jobs marked as JobHeldUser in the first place? The user > never put a hold on these jobs and it appears that this happened as a result > of the job(s) landing on a node that was dying. On a prolog or job launch failure, the job may end up marked as JobHeldUser. So if the node suffered a hard drive failure, this behavior is expected. > > 2. Why was the JobHeldUser field not updated after admins released those > jobs for scheduling? It's also unclear if releasing the jobs was a necessary > step at all. At the time, the user had run up against its GrpCPU limit so > these jobs may have not been starting simply due to resource restrictions. Let me investigate why JobHeldUser was not updated after admins released these jobs. I believe releasing the jobs is a necessary step. Will,
could you please attach your slurmctld.log file?
I want to look for any of these messages:
info("sched: update_job: releasing hold for job_id %u uid %u", job_ptr->job_id, uid);
info("ignore priority reset request on held job %u", job_ptr->job_id);
debug("%s: job %d already release ignoring request", __func__, job_ptr->job_id);
I believe the Reason should had been changed to WAIT_NO_REASON ("None") after the release command, as it is coded in src/slurmctld/job_mgr.c line 10639, inside the _update_job() function. Meanwhile I'm gonna try to reproduce this by myself.
Created attachment 2736 [details]
slurmctld logs
actually all of /var/log/messages on our primary slurm controller server for last week
So filtering the slurmctld.log I see these messages from these 2 tasks:
alex@pc:~/Downloads$ grep -E "7131402_18|7131402_19" messages-20160214
Feb 12 13:09:43 slurmsched1 slurmctld[12411]: backfill: Started JobId=7131402_18 (7136844) in production on vmp717
Feb 12 13:09:43 slurmsched1 slurmctld[12411]: Requeuing JobID=7131402_18(7136844) State=0x0 NodeCnt=0
Feb 12 13:11:57 slurmsched1 slurmctld[12411]: backfill: Started JobId=7131402_19 (7136862) in production on vmp717
Feb 12 17:00:50 slurmsched1 slurmctld[12411]: _slurm_rpc_update_job complete JobId=7131402_18 uid=112888 usec=413
Feb 12 17:00:51 slurmsched1 slurmctld[12411]: _slurm_rpc_update_job complete JobId=7131402_19 uid=112888 usec=380
Feb 12 17:43:52 slurmsched1 slurmctld[12411]: _slurm_rpc_update_job complete JobId=7131402_18 uid=0 usec=567
Feb 12 17:44:00 slurmsched1 slurmctld[12411]: _slurm_rpc_update_job complete JobId=7131402_19 uid=0 usec=466
Feb 13 08:19:12 slurmsched1 slurmctld[12411]: backfill: Started JobId=7131402_18 (7136844) in production on vmp478
Feb 13 08:35:09 slurmsched1 slurmctld[12411]: backfill: Started JobId=7131402_19 (7136862) in production on vmp424
Feb 13 13:27:14 slurmsched1 slurmctld[12411]: job_complete: JobID=7131402_18(7136844) State=0x1 NodeCnt=1 WEXITSTATUS 0
Feb 13 13:27:14 slurmsched1 slurmctld[12411]: job_complete: JobID=7131402_18(7136844) State=0x8003 NodeCnt=1 done
Feb 13 13:31:56 slurmsched1 slurmctld[12411]: job_complete: JobID=7131402_19(7136862) State=0x1 NodeCnt=1 WEXITSTATUS 0
Feb 13 13:31:56 slurmsched1 slurmctld[12411]: job_complete: JobID=7131402_19(7136862) State=0x8003 NodeCnt=1 done
alex@pc:~/Downloads$
I've tried to reproduce myself by creating a Prolog script that exits with a non-zero value.
$ scontrol show config | grep -w Prolog
Prolog = /path/to/prolog
$ cat /path/to/prolog
#!/bin/bash
exit 1
Check sinfo node idle:
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
part1* up 1:00:00 1 idle compute1
Submit array batch job with 2 tasks:
$ sbatch --array=0-1 --wrap="hostname"
Submitted batch job 20026
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
20026_[0-1] part1 wrap alex PD 0:00 1 (Resources,JobHeldUser)
$ scontrol show job 20026 | grep -E "JobId|JobState|Reason"
JobId=20026 ArrayJobId=20026 ArrayTaskId=1 JobName=wrap
JobState=PENDING Reason=Resources Dependency=(null)
JobId=20027 ArrayJobId=20026 ArrayTaskId=0 JobName=wrap
JobState=PENDING Reason=launch_failed_requeued_held Dependency=(null)
We see task 0 Reason=launch_failed_requeued_held
slurmctld: _slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=0
slurmctld: error: Prolog failure on node compute1, draining the node
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
part1* up 1:00:00 1 drain compute1
$ scontrol release 20026_0
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
20026_[0-1] part1 wrap alex PD 0:00 1 (Resources)
slurmctld: sched: update_job: releasing hold for job_id 20027 uid 1000
$ scontrol update nodename=compute1 state=resume
slurmctld: update_node: node compute1 state set to IDLE
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
part1* up 1:00:00 1 drain compute1
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
20026_[0-1] part1 wrap alex PD 0:00 1 (Resources,JobHeldUser)
Trying to reproduce your context, what I see is that 'scontrol release <jobid>' works as expected. Probably what happened is that jobs were released but node was still failing for whatever reason (prolog/node failures) and job tasks were requeued again.
Would this make sense?
> Trying to reproduce your context, what I see is that 'scontrol release
> <jobid>' works as expected. Probably what happened is that jobs were
> released but node was still failing for whatever reason (prolog/node
> failures) and job tasks were requeued again.
>
> Would this make sense?
Do you mean that these jobs kept going to the same failing node repeatedly? If so, that shouldn't have happened since we downed the node before releasing these jobs.
(In reply to Will French from comment #9) > > Trying to reproduce your context, what I see is that 'scontrol release > > <jobid>' works as expected. Probably what happened is that jobs were > > released but node was still failing for whatever reason (prolog/node > > failures) and job tasks were requeued again. > > > > Would this make sense? > > Do you mean that these jobs kept going to the same failing node repeatedly? > If so, that shouldn't have happened since we downed the node before > releasing these jobs. Well what I'm suggesting, despite I can be wrong, is that the job kept going to the same or another failing node and then was requeued and marked as JobHeldUser again. If you can assure that the node was down before releasing, maybe the job allocated a different node. In fact, seeing the logs the job is allocated different nodes: Feb 12 13:09:43 slurmsched1 slurmctld[12411]: backfill: Started JobId=7131402_18 (7136844) in production on vmp717 Feb 13 08:19:12 slurmsched1 slurmctld[12411]: backfill: Started JobId=7131402_18 (7136844) in production on vmp478 I've tried submitting an array of 2 tasks, one of the tasks is marked as JobHeldUser and node state changed to drain, then I changed the node to down and released the held job, the Reason is changed properly to Resources. So I think that at least in my case this is correct, maybe in your system context there's something that we are missing to reproduce: $ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST part1* up 1:00:00 1 idle compute1 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) $ sbatch --array=0-1 --wrap="hostname" Submitted batch job 20043 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 20043_[0-1] part1 wrap alex PD 0:00 1 (Resources,JobHeldUser) $ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST part1* up 1:00:00 1 drain compute1 $ scontrol update nodename=compute1 state=down reason="test" $ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST part1* up 1:00:00 1 drain compute1 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 20043_[0-1] part1 wrap alex PD 0:00 1 (Resources,JobHeldUser) $ scontrol update nodename=compute1 state=resume $ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST part1* up 1:00:00 1 idle compute1 $ scontrol update nodename=compute1 state=down reason="test" $ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST part1* up 1:00:00 1 down compute1 $ scontrol release 20043_0 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 20043_[0-1] part1 wrap alex PD 0:00 1 (Resources,BeginTime) It's also strange that I can't find in the logs a message like this: Feb 12 17:00:49 slurmsched1 slurmctld[12411]: sched: update_job: releasing hold for job_id 7136808 uid 112888 But for job_id 7131402_18 or 7131402_19. $ grep -E "7131402_18|7131402_19|hold for job_id 7131402" messages-20160214 Will, I think we've managed to clearly identify what's going on here. When I 'greped' your slurmctld.log file in my previous comments, I was just taking into account the array_job_id and the array_task_id, but didn't grep for the job_id itself. If I add the job_id of both tasks to the grep, I can find the release message: alex@pc:~/Downloads$ grep -E "7131402_18|7131402_19|7136844|7136862" messages-20160214 Feb 12 13:09:43 slurmsched1 slurmctld[12411]: backfill: Started JobId=7131402_18 (7136844) in production on vmp717 Feb 12 13:09:43 slurmsched1 slurmctld[12411]: Requeuing JobID=7131402_18(7136844) State=0x0 NodeCnt=0 Feb 12 13:11:57 slurmsched1 slurmctld[12411]: backfill: Started JobId=7131402_19 (7136862) in production on vmp717 Feb 12 17:00:50 slurmsched1 slurmctld[12411]: sched: update_job: releasing hold for job_id 7136844 uid 112888 Feb 12 17:00:50 slurmsched1 slurmctld[12411]: _slurm_rpc_update_job complete JobId=7131402_18 uid=112888 usec=413 Feb 12 17:00:51 slurmsched1 slurmctld[12411]: sched: update_job: releasing hold for job_id 7136862 uid 112888 Feb 12 17:00:51 slurmsched1 slurmctld[12411]: _slurm_rpc_update_job complete JobId=7131402_19 uid=112888 usec=380 Feb 12 17:10:03 slurmsched1 slurmctld[12411]: sched: update_job: setting priority to 293 for job_id 7136862 Feb 12 17:10:03 slurmsched1 slurmctld[12411]: sched: update_job: setting priority to 296 for job_id 7136844 Feb 12 17:43:52 slurmsched1 slurmctld[12411]: sched: update_job: setting priority to 338 for job_id 7136844 Feb 12 17:43:52 slurmsched1 slurmctld[12411]: _slurm_rpc_update_job complete JobId=7131402_18 uid=0 usec=567 Feb 12 17:44:00 slurmsched1 slurmctld[12411]: sched: update_job: setting priority to 335 for job_id 7136862 Feb 12 17:44:00 slurmsched1 slurmctld[12411]: _slurm_rpc_update_job complete JobId=7131402_19 uid=0 usec=466 Feb 13 08:19:12 slurmsched1 slurmctld[12411]: backfill: Started JobId=7131402_18 (7136844) in production on vmp478 Feb 13 08:35:09 slurmsched1 slurmctld[12411]: backfill: Started JobId=7131402_19 (7136862) in production on vmp424 Feb 13 13:27:14 slurmsched1 slurmctld[12411]: job_complete: JobID=7131402_18(7136844) State=0x1 NodeCnt=1 WEXITSTATUS 0 Feb 13 13:27:14 slurmsched1 slurmctld[12411]: job_complete: JobID=7131402_18(7136844) State=0x8003 NodeCnt=1 done Feb 13 13:31:56 slurmsched1 slurmctld[12411]: job_complete: JobID=7131402_19(7136862) State=0x1 NodeCnt=1 WEXITSTATUS 0 Feb 13 13:31:56 slurmsched1 slurmctld[12411]: job_complete: JobID=7131402_19(7136862) State=0x8003 NodeCnt=1 done alex@pc:~/Downloads$ So as you can see, both tasks started in vmp717, which failed, and both got requeud. Then you released them and finally they completed in vmp478 and vmp424 respectively. So the release worked as expected. Regarding the REASON, Slurm had not yet update the REASON when you executed squeue after the scontrol release command. But with time it gets updated. Hope this makes things clear to you now. (In reply to Alejandro Sanchez from comment #11) > Will, > > I think we've managed to clearly identify what's going on here. When I > 'greped' your slurmctld.log file in my previous comments, I was just taking > into account the array_job_id and the array_task_id, but didn't grep for the > job_id itself. If I add the job_id of both tasks to the grep, I can find the > release message: > > alex@pc:~/Downloads$ grep -E "7131402_18|7131402_19|7136844|7136862" > messages-20160214 > Feb 12 13:09:43 slurmsched1 slurmctld[12411]: backfill: Started > JobId=7131402_18 (7136844) in production on vmp717 > Feb 12 13:09:43 slurmsched1 slurmctld[12411]: Requeuing > JobID=7131402_18(7136844) State=0x0 NodeCnt=0 > Feb 12 13:11:57 slurmsched1 slurmctld[12411]: backfill: Started > JobId=7131402_19 (7136862) in production on vmp717 > Feb 12 17:00:50 slurmsched1 slurmctld[12411]: sched: update_job: releasing > hold for job_id 7136844 uid 112888 > Feb 12 17:00:50 slurmsched1 slurmctld[12411]: _slurm_rpc_update_job complete > JobId=7131402_18 uid=112888 usec=413 > Feb 12 17:00:51 slurmsched1 slurmctld[12411]: sched: update_job: releasing > hold for job_id 7136862 uid 112888 > Feb 12 17:00:51 slurmsched1 slurmctld[12411]: _slurm_rpc_update_job complete > JobId=7131402_19 uid=112888 usec=380 > Feb 12 17:10:03 slurmsched1 slurmctld[12411]: sched: update_job: setting > priority to 293 for job_id 7136862 > Feb 12 17:10:03 slurmsched1 slurmctld[12411]: sched: update_job: setting > priority to 296 for job_id 7136844 > Feb 12 17:43:52 slurmsched1 slurmctld[12411]: sched: update_job: setting > priority to 338 for job_id 7136844 > Feb 12 17:43:52 slurmsched1 slurmctld[12411]: _slurm_rpc_update_job complete > JobId=7131402_18 uid=0 usec=567 > Feb 12 17:44:00 slurmsched1 slurmctld[12411]: sched: update_job: setting > priority to 335 for job_id 7136862 > Feb 12 17:44:00 slurmsched1 slurmctld[12411]: _slurm_rpc_update_job complete > JobId=7131402_19 uid=0 usec=466 > Feb 13 08:19:12 slurmsched1 slurmctld[12411]: backfill: Started > JobId=7131402_18 (7136844) in production on vmp478 > Feb 13 08:35:09 slurmsched1 slurmctld[12411]: backfill: Started > JobId=7131402_19 (7136862) in production on vmp424 > Feb 13 13:27:14 slurmsched1 slurmctld[12411]: job_complete: > JobID=7131402_18(7136844) State=0x1 NodeCnt=1 WEXITSTATUS 0 > Feb 13 13:27:14 slurmsched1 slurmctld[12411]: job_complete: > JobID=7131402_18(7136844) State=0x8003 NodeCnt=1 done > Feb 13 13:31:56 slurmsched1 slurmctld[12411]: job_complete: > JobID=7131402_19(7136862) State=0x1 NodeCnt=1 WEXITSTATUS 0 > Feb 13 13:31:56 slurmsched1 slurmctld[12411]: job_complete: > JobID=7131402_19(7136862) State=0x8003 NodeCnt=1 done > alex@pc:~/Downloads$ > > So as you can see, both tasks started in vmp717, which failed, and both got > requeud. Then you released them and finally they completed in vmp478 and > vmp424 respectively. > > So the release worked as expected. Yes, that's my interpretation as well. > > Regarding the REASON, Slurm had not yet update the REASON when you executed > squeue after the scontrol release command. But with time it gets updated. We have a cron job that runs about every hour that checks for any held jobs, and emails admins when there are any held jobs. Based on those email alerts, it appears that after the jobs were released but while they were still in the PENDING state (~6-8 hours depending on the job), the REASON remained JobHeldUser. If that's normal or expected then so be it, I just want to provide all the details in the event that this is not the intended behavior. Reason should be changed right after the info message:
[...]
info("sched: update_job: releasing hold for job_id %u uid %u", job_ptr->job_id, uid);
job_ptr->state_reason = WAIT_NO_REASON;
job_ptr->job_state &= ~JOB_SPECIAL_EXIT;
[...]
src/slurmctld/job_mgr.c 15279L
case WAIT_NO_REASON:
return "None";
src/common/slurm_protocol_defs.c 4305L
So closing this for now as resolved infogiven. Please, if you encounter more jobs stuck in JobHeldUser after release, it would be great to reopen the ticket and attach slurmctld.log and involving slurmd.log files.
|
We had a node die (hard drive failure) today, which we identified when we noticed several jobs getting put in either a JobHeldUser or "launch failed requeued held". After downing the node, we were able to release jobs with "launch failed requeued held" status, but not jobs that list JobHeldUser: [root@vmps11 ~]# squeue | grep -i held | tail -2 7131402_18 productio BCell XXXXXXX PD 0:00 1 (JobHeldUser) 7131402_19 productio BCell XXXXXXX PD 0:00 1 (JobHeldUser) [root@vmps11 ~]# scontrol release 7131402_18 [root@vmps11 ~]# squeue | grep -i held | tail -2 7131402_18 productio BCell XXXXXXX PD 0:00 1 (JobHeldUser) 7131402_19 productio BCell XXXXXXX PD 0:00 1 (JobHeldUser) [root@vmps11 ~]# scontrol release 7131402_19 [root@vmps11 ~]# squeue | grep -i held | tail -2 7131402_18 productio BCell XXXXXXX PD 0:00 1 (JobHeldUser) 7131402_19 productio BCell XXXXXXX PD 0:00 1 (JobHeldUser) We have about 150 jobs (from two different users and all job arrays, if that's important) in this stuck state where we cannot release them for scheduling. We have also tried releasing the job while logged in as the user. No luck. The JobHeldUser state is especially interesting since SLURM docs appear to indicate that this is only listed when a user places a hold on his/her own job. However, both users have confirmed that they did not initiate the hold. Here are some logs from slurmctld: root@slurmsched1:~# grep 7131402_18 /var/log/messages Feb 12 13:09:43 slurmsched1 slurmctld[12411]: backfill: Started JobId=7131402_18 (7136844) in production on vmp717 Feb 12 13:09:43 slurmsched1 slurmctld[12411]: Requeuing JobID=7131402_18(7136844) State=0x0 NodeCnt=0 Feb 12 17:00:50 slurmsched1 slurmctld[12411]: _slurm_rpc_update_job complete JobId=7131402_18 uid=112888 usec=413 Feb 12 17:43:52 slurmsched1 slurmctld[12411]: _slurm_rpc_update_job complete JobId=7131402_18 uid=0 usec=567 root@slurmsched1:~# grep 7131402_19 /var/log/messages Feb 12 13:11:57 slurmsched1 slurmctld[12411]: backfill: Started JobId=7131402_19 (7136862) in production on vmp717 Feb 12 17:00:51 slurmsched1 slurmctld[12411]: _slurm_rpc_update_job complete JobId=7131402_19 uid=112888 usec=380 Feb 12 17:44:00 slurmsched1 slurmctld[12411]: _slurm_rpc_update_job complete JobId=7131402_19 uid=0 usec=466 Here is info about one of these jobs: root@slurmsched1:~# scontrol show job 7131402_19 -dd JobId=7136862 ArrayJobId=7131402 ArrayTaskId=19 JobName=BCell UserId= GroupId= Priority=336 Nice=0 Account=chgr QOS=normal JobState=PENDING Reason=JobHeldUser Dependency=(null) Requeue=1 Restarts=1 BatchFlag=1 Reboot=0 ExitCode=0:0 DerivedExitCode=0:0 RunTime=00:00:00 TimeLimit=10:00:00 TimeMin=N/A SubmitTime=2016-02-12T13:11:59 EligibleTime=2016-02-12T13:14:00 StartTime=Unknown EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=production AllocNode:Sid=vmps09:23221 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) BatchHost=vmp717 NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=19200,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* Nodes=vmp717 CPU_IDs=4-5 Mem=19200 MinCPUsNode=1 MinMemoryNode=19200M MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) StdIn=/dev/null BatchScript= #!/bin/bash ##set job-name to match directory name containing split.raw files ##update wall time as needed ## #SBATCH --job-name=BCell #SBATCH --mail-type=ALL #SBATCH --time=0-10:00:00 #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --mem=19200 #SBATCH --array=0-24 #SBATCH --account=chgr setpkgs -a R_3.2.0 cd /home/rinkerd/scripts/R/ cp runPheWAS_MASTER_split runPheWAS_${SLURM_JOB_NAME}_${SLURM_ARRAY_TASK_ID} sed -i "s/TISSUE/${SLURM_JOB_NAME}/g" runPheWAS_${SLURM_JOB_NAME}_${SLURM_ARRAY_TASK_ID} sed -i "s/XX/${SLURM_ARRAY_TASK_ID}/g" runPheWAS_${SLURM_JOB_NAME}_${SLURM_ARRAY_TASK_ID} echo `date` time R --vanilla <runPheWAS_${SLURM_JOB_NAME}_${SLURM_ARRAY_TASK_ID}> runPheWAS_${SLURM_JOB_NAME}_${SLURM_ARRAY_TASK_ID}.R.out echo `date` rm runPheWAS_${SLURM_JOB_NAME}_${SLURM_ARRAY_TASK_ID} Thanks, Will