Hello, One of the user who ran a job on GPU node failed and I am trying to find out what caused this job failure. this is the output in the log file Slurm job error: Failed to invoke task plugins: task_p_pre launch error [2020-12-17T12:36:50.486] [15244.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:256 Can i know what it means by status:256 in the log file. This is the job submission script and it works when i try to run 2nd time. I would like to know more on ths error as this happens often running different jobs. thanks Nandini Texas Tech University
Additional information sacct --format="JobID,user,account,elapsed, Timelimit,MaxRSS,ReqMem,MaxVMSize,ncpus,ExitCode" -j 15244 JobID User Account Elapsed Timelimit MaxRSS ReqMem MaxVMSize NCPUS ExitCode ------------ --------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- -------- 15244 rliang default 00:00:01 2-00:00:00 9639Mc 1 1:0 15244.batch default 00:00:01 104K 9639Mc 143380K 1 1:0 15244.extern default 00:00:01 0 9639Mc 4352K 1 0:0 what are those exit codes mean?
And also is it not possible to get full information for the finished job on the location of job submission script. I so far couldnt find? thanks Nandini
Hi Nandini, Could you attach the slurmd log for the node, as well as the slurmctld.log? Can you also attach your current slurm.conf? Thanks, -Michael
Created attachment 17229 [details] slurm.conf Please check the attached logs. Thanks Nandini From: bugs@schedmd.com <bugs@schedmd.com> Sent: Friday, December 18, 2020 4:13 PM To: Ramanathan, Nandini <Nandini.Ramanathan@ttu.edu> Subject: [Bug 10487] slurm job error Comment # 3<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10487%23c3&data=04%7C01%7Cnandini.ramanathan%40ttu.edu%7Cadf13948bc49415f8ef508d8a3a21b2e%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637439263926300791%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=yyMoTB5wXDCqpwmJ733BerL3hui%2FKfBx8BYNQvYgC3c%3D&reserved=0> on bug 10487<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10487&data=04%7C01%7Cnandini.ramanathan%40ttu.edu%7Cadf13948bc49415f8ef508d8a3a21b2e%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637439263926310784%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=i%2ByQ8kkq0M7FTI7MNI31fUqe2zd%2B5QwAugS1GPb0Ts8%3D&reserved=0> from Michael Hinton<mailto:hinton@schedmd.com> Hi Nandini, Could you attach the slurmd log for the node, as well as the slurmctld.log? Can you also attach your current slurm.conf? Thanks, -Michael ________________________________ You are receiving this mail because: * You reported the bug.
Created attachment 17230 [details] slurmctld.log
Created attachment 17231 [details] slurmd.log
Ok, I went ahead and marked those attachments as private. I'll look into them and get back to you. Thanks! -Michael
Good we don't want to make it as public please. Thanks Nandini From: bugs@schedmd.com <bugs@schedmd.com> Sent: Friday, December 18, 2020 4:32 PM To: Ramanathan, Nandini <Nandini.Ramanathan@ttu.edu> Subject: [Bug 10487] slurm job error Comment # 7<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10487%23c7&data=04%7C01%7Cnandini.ramanathan%40ttu.edu%7C10924f82d0304adbdb3308d8a3a4b5a6%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637439275117760122%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=LNkH5fF6MBs35MdkfhYxbIdpm49Q%2BS6Re9HChytJw%2FQ%3D&reserved=0> on bug 10487<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10487&data=04%7C01%7Cnandini.ramanathan%40ttu.edu%7C10924f82d0304adbdb3308d8a3a4b5a6%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637439275117770115%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=27gbj8PjNcf1Hf7UzQ82dXG8rGxmdpcwKM4aqyXC7RY%3D&reserved=0> from Michael Hinton<mailto:hinton@schedmd.com> Ok, I went ahead and marked those attachments as private. I'll look into them and get back to you. Thanks! -Michael ________________________________ You are receiving this mail because: * You reported the bug.
Could you set SlurmdDebug=debug in slurm.conf, restart the slurmds, reproduce the issue, and then attach the relevant slurmd.log portion? That should give us more information about why it's failing. Thanks, -Michael
Will restarting slurmds will affect running jobs. What are the impact of restarting slurmd? From: bugs@schedmd.com <bugs@schedmd.com> Sent: Friday, December 18, 2020 5:12 PM To: Ramanathan, Nandini <Nandini.Ramanathan@ttu.edu> Subject: [Bug 10487] slurm job error Comment # 9<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10487%23c9&data=04%7C01%7Cnandini.ramanathan%40ttu.edu%7Ca4553447c9c049d727a208d8a3aa4e0c%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637439299133583963%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ZujJDakLt9%2FpxJI7l9BF9POTfBpwYAbUn%2BifHuDSsAY%3D&reserved=0> on bug 10487<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10487&data=04%7C01%7Cnandini.ramanathan%40ttu.edu%7Ca4553447c9c049d727a208d8a3aa4e0c%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637439299133593961%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=0KXh6ExKofYBoMiJNlw0etcnM1YQeYrsVdXvN7hhL8M%3D&reserved=0> from Michael Hinton<mailto:hinton@schedmd.com> Could you set SlurmdDebug=debug in slurm.conf, restart the slurmds, reproduce the issue, and then attach the relevant slurmd.log portion? That should give us more information about why it's failing. Thanks, -Michael ________________________________ You are receiving this mail because: * You reported the bug.
(In reply to Nandini from comment #10) > Will restarting slurmds will affect running jobs. What are the impact of > restarting slurmd? No, restarting the slurmds should not affect already running jobs.
But before that using failed job's id how can I check the folder where the job submission script is located. In UGE we have the option to look at full details of the failed job. In slurm I have tried all the commands that I know off and none of them would get me the information needed. Thanks From: bugs@schedmd.com <bugs@schedmd.com> Sent: Friday, December 18, 2020 5:32 PM To: Ramanathan, Nandini <Nandini.Ramanathan@ttu.edu> Subject: [Bug 10487] slurm job error Comment # 11<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10487%23c11&data=04%7C01%7Cnandini.ramanathan%40ttu.edu%7C3d9b31ef51cb46209d2008d8a3ad1dd1%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637439311211720455%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=A9hR5JoiZ4e24qvj9uIra1NE4xX4q4QGrRKjsfolDa8%3D&reserved=0> on bug 10487<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10487&data=04%7C01%7Cnandini.ramanathan%40ttu.edu%7C3d9b31ef51cb46209d2008d8a3ad1dd1%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637439311211730439%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=DkTFm7VxXVF3ln%2B1xT%2BV%2FVv8a7z0lk3BEXYtxNvXR88%3D&reserved=0> from Michael Hinton<mailto:hinton@schedmd.com> (In reply to Nandini from comment #10<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10487%23c10&data=04%7C01%7Cnandini.ramanathan%40ttu.edu%7C3d9b31ef51cb46209d2008d8a3ad1dd1%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637439311211730439%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=R1LFRpF7j9p23GJ4NLoWKgzyb4ZkE4CbeccXudTfbZQ%3D&reserved=0>) > Will restarting slurmds will affect running jobs. What are the impact of > restarting slurmd? No, restarting the slurmds should not affect already running jobs. ________________________________ You are receiving this mail because: * You reported the bug.
(In reply to Nandini from comment #12) > But before that using failed job's id how can I check the folder where the > job submission script is located. In UGE we have the option to look at full > details of the failed job. In slurm I have tried all the commands that I > know off and none of them would get me the information needed. By default, you can't access the submission script with Slurm. However, you could set up a job completion plugin to save the submission script and I think the job command. Otherwise, the temporary copy of the submission script is dropped once the job goes through.
Hi Nandini, We can't proceed until you do what I asked in comment 9. Thanks, -Michael
Hi Nandini, I'm going to go ahead and close this out due to inactivity. Feel free to reopen if you want to pursue this further. Thanks! -Michael