| Summary: | scontrol fails to update a partition for a job | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Gianluca Castellani <gianluca.castellani> |
| Component: | Other | Assignee: | David Bigagli <david> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | brian, da |
| Version: | 14.11.1 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | KAUST | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
Gianluca Castellani
2015-03-02 22:43:56 MST
Ciao Gianluca,
I tried to reproduce it but did not see this error.
Who owns the job 24885? Could you please send the output of
'scontrol show job 24885' or any other jobid you reproduce this
problem with. Is there any error logged in the slurmctld log file?
David
ciao,
that job is over, but the same just happened.
I retried with another one.
[root@slurm01 ~]# scontrol update job=24854 partition=defaultq
Invalid user id for job 24854
Although the message complains about invalid user id the job partition has
been altered:
[root@slurm01 ~]# scontrol show job 24854
JobId=24854 JobName=BTTT4PCBM-wB97XD-opt01403TDroot19
UserId=zhany0d(135420) GroupId=noor-users(1001)
Priority=27347 Nice=0 Account=idle QOS=default
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:01:11 TimeLimit=8-23:55:00 TimeMin=N/A
SubmitTime=2015-03-03T11:28:38 EligibleTime=2015-03-03T11:28:38
StartTime=2015-03-04T08:17:33 EndTime=2015-03-13T08:12:33
PreemptTime=None SuspendTime=None SecsPreSuspend=0
* Partition=defaultq* AllocNode:Sid=rcfen02:27197
Looking at the log file
[root@slurm01 ~]# grep 24854 /var/log/slurm/slurmctld.log
[2015-03-03T11:28:38.106] _slurm_rpc_submit_batch_job JobId=24854 usec=391
[2015-03-04T08:17:19.032] job_submit.lua: *slurm_job_modify: for job 24854
from uid 0, setting default comment value: ***TEST_COMMENT****
[2015-03-04T08:17:19.032] *update_job: setting partition to defaultq for
job_id 24854*
[2015-03-04T08:17:33.302] sched: Allocate JobId=24854 NodeList=ca119
#CPUs=16
I am guessing that the culprit is my job_submit.lua (I did not modified the
slurm_job_modify function)
function slurm_job_modify(job_desc, job_rec, part_list, modify_uid)
if job_desc.comment == nil then
local comment = "****TEST_COMMENT****"
slurm.log_info("*slurm_job_modify: for job %u from uid %u, setting
default comment value: %s*",
job_rec.job_id, modify_uid, comment)
job_desc.comment = comment
end
return slurm.SUCCESS
end
I think that you can close the bug.
Best,
Gianluca
On Wed, Mar 4, 2015 at 1:45 AM, <bugs@schedmd.com> wrote:
> *Comment # 1 <http://bugs.schedmd.com/show_bug.cgi?id=1500#c1> on bug
> 1500 <http://bugs.schedmd.com/show_bug.cgi?id=1500> from David Bigagli
> <david@schedmd.com> *
>
> Ciao Gianluca,
> I tried to reproduce it but did not see this error.
> Who owns the job 24885? Could you please send the output of
> 'scontrol show job 24885' or any other jobid you reproduce this
> problem with. Is there any error logged in the slurmctld log file?
>
> David
>
> ------------------------------
> You are receiving this mail because:
>
> - You reported the bug.
>
>
Va bene dajje. You can try without the submit plugin and let us know the results.
Bella,
David
|