Ticket 10487 - slurm job error
Summary: slurm job error
Status: RESOLVED TIMEDOUT
Alias: None
Product: Slurm
Classification: Unclassified
Component: GPU (show other tickets)
Version: 20.11.0
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Director of Support
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-12-18 06:37 MST by Nandini
Modified: 2021-01-12 13:11 MST (History)
1 user (show)

See Also:
Site: TTU
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Nandini 2020-12-18 06:37:42 MST
Hello,

One of the user who ran a job on GPU node failed  and I am trying to find out what caused this job failure. this is the output in the log file

Slurm job error: Failed to invoke task plugins: task_p_pre launch error [2020-12-17T12:36:50.486] [15244.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:256

Can i know what it means by status:256 in the log file.

This is the job submission script and it works when i try to run 2nd time. I would like to know more on ths error as this happens often running different jobs.

thanks
Nandini
Texas Tech University
Comment 1 Nandini 2020-12-18 06:59:05 MST
Additional information

sacct --format="JobID,user,account,elapsed, Timelimit,MaxRSS,ReqMem,MaxVMSize,ncpus,ExitCode" -j 15244
       JobID      User    Account    Elapsed  Timelimit     MaxRSS     ReqMem  MaxVMSize      NCPUS ExitCode
------------ --------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- --------
15244           rliang    default   00:00:01 2-00:00:00                9639Mc                     1      1:0
15244.batch               default   00:00:01                  104K     9639Mc    143380K          1      1:0
15244.extern              default   00:00:01                     0     9639Mc      4352K          1      0:0

what are those exit codes mean?
Comment 2 Nandini 2020-12-18 07:10:56 MST
And also is it not possible to get full information for the finished job on the location of job submission script. I so far couldnt find?

thanks
Nandini
Comment 3 Michael Hinton 2020-12-18 15:13:09 MST
Hi Nandini,

Could you attach the slurmd log for the node, as well as the slurmctld.log?

Can you also attach your current slurm.conf?

Thanks,
-Michael
Comment 5 Nandini 2020-12-18 15:27:13 MST
Created attachment 17230 [details]
slurmctld.log
Comment 6 Nandini 2020-12-18 15:27:13 MST
Created attachment 17231 [details]
slurmd.log
Comment 7 Michael Hinton 2020-12-18 15:31:48 MST
Ok, I went ahead and marked those attachments as private. I'll look into them and get back to you.

Thanks!
-Michael
Comment 9 Michael Hinton 2020-12-18 16:11:51 MST
Could you set SlurmdDebug=debug in slurm.conf, restart the slurmds, reproduce the issue, and then attach the relevant slurmd.log portion? That should give us more information about why it's failing.

Thanks,
-Michael
Comment 10 Nandini 2020-12-18 16:15:42 MST
Will restarting slurmds will affect running jobs. What are the impact of restarting slurmd?

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Friday, December 18, 2020 5:12 PM
To: Ramanathan, Nandini <Nandini.Ramanathan@ttu.edu>
Subject: [Bug 10487] slurm job error

Comment # 9<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10487%23c9&data=04%7C01%7Cnandini.ramanathan%40ttu.edu%7Ca4553447c9c049d727a208d8a3aa4e0c%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637439299133583963%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ZujJDakLt9%2FpxJI7l9BF9POTfBpwYAbUn%2BifHuDSsAY%3D&reserved=0> on bug 10487<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10487&data=04%7C01%7Cnandini.ramanathan%40ttu.edu%7Ca4553447c9c049d727a208d8a3aa4e0c%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637439299133593961%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=0KXh6ExKofYBoMiJNlw0etcnM1YQeYrsVdXvN7hhL8M%3D&reserved=0> from Michael Hinton<mailto:hinton@schedmd.com>

Could you set SlurmdDebug=debug in slurm.conf, restart the slurmds, reproduce

the issue, and then attach the relevant slurmd.log portion? That should give us

more information about why it's failing.



Thanks,

-Michael

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 11 Michael Hinton 2020-12-18 16:31:58 MST
(In reply to Nandini from comment #10)
> Will restarting slurmds will affect running jobs. What are the impact of
> restarting slurmd?
No, restarting the slurmds should not affect already running jobs.
Comment 12 Nandini 2020-12-19 08:22:08 MST
But before that using failed job's id how can I check the folder where the job submission script is located. In UGE we have the option to look at full details of the failed job. In slurm I have tried all the commands that I know off and none of them would get me the information needed.

Thanks

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Friday, December 18, 2020 5:32 PM
To: Ramanathan, Nandini <Nandini.Ramanathan@ttu.edu>
Subject: [Bug 10487] slurm job error

Comment # 11<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10487%23c11&data=04%7C01%7Cnandini.ramanathan%40ttu.edu%7C3d9b31ef51cb46209d2008d8a3ad1dd1%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637439311211720455%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=A9hR5JoiZ4e24qvj9uIra1NE4xX4q4QGrRKjsfolDa8%3D&reserved=0> on bug 10487<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10487&data=04%7C01%7Cnandini.ramanathan%40ttu.edu%7C3d9b31ef51cb46209d2008d8a3ad1dd1%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637439311211730439%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=DkTFm7VxXVF3ln%2B1xT%2BV%2FVv8a7z0lk3BEXYtxNvXR88%3D&reserved=0> from Michael Hinton<mailto:hinton@schedmd.com>

(In reply to Nandini from comment #10<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10487%23c10&data=04%7C01%7Cnandini.ramanathan%40ttu.edu%7C3d9b31ef51cb46209d2008d8a3ad1dd1%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637439311211730439%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=R1LFRpF7j9p23GJ4NLoWKgzyb4ZkE4CbeccXudTfbZQ%3D&reserved=0>)

> Will restarting slurmds will affect running jobs. What are the impact of

> restarting slurmd?

No, restarting the slurmds should not affect already running jobs.

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 13 Michael Hinton 2020-12-21 10:25:57 MST
(In reply to Nandini from comment #12)
> But before that using failed job's id how can I check the folder where the
> job submission script is located. In UGE we have the option to look at full
> details of the failed job. In slurm I have tried all the commands that I
> know off and none of them would get me the information needed.
By default, you can't access the submission script with Slurm. However, you could set up a job completion plugin to save the submission script and I think the job command. Otherwise, the temporary copy of the submission script is dropped once the job goes through.
Comment 15 Michael Hinton 2020-12-22 10:21:47 MST
Hi Nandini,

We can't proceed until you do what I asked in comment 9.

Thanks,
-Michael
Comment 16 Michael Hinton 2021-01-06 14:56:39 MST
Hi Nandini,

I'm going to go ahead and close this out due to inactivity. Feel free to reopen if you want to pursue this further.

Thanks!
-Michael