| Summary: | Jobs left in PD state with reason BeginTime | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Roy <proutyr1> |
| Component: | Scheduling | Assignee: | Nate Rini <nate> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | bart, champ, nate, proutyr1, randy |
| Version: | - Unsupported Older Versions | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | UMBC | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurm.conf file
slurmd log from cnode011 slurmd log from cnode111 slurmctld log from head node. |
||
Please attach the slurmctld log and at least a few slurmd logs.
Please try running trivial job:
> srun -vvv /usr/bin/uptime
Here is the output: [~]$ srun -vvv --partition=batch --account=test_conp --time=1 /usr/bin/uptime srun: defined options srun: -------------------- -------------------- srun: account : test_conp srun: partition : batch srun: time : 00:01:00 srun: verbose : 3 srun: -------------------- -------------------- srun: end of defined options srun: debug: propagating RLIMIT_CPU=18446744073709551615 srun: debug: propagating RLIMIT_FSIZE=18446744073709551615 srun: debug: propagating RLIMIT_DATA=18446744073709551615 srun: debug: propagating RLIMIT_STACK=18446744073709551615 srun: debug: propagating RLIMIT_CORE=0 srun: debug: propagating RLIMIT_RSS=18446744073709551615 srun: debug: propagating RLIMIT_NPROC=26000 srun: debug: propagating RLIMIT_NOFILE=131072 srun: debug: propagating RLIMIT_MEMLOCK=18446744073709551615 srun: debug: propagating RLIMIT_AS=18446744073709551615 srun: debug: propagating SLURM_PRIO_PROCESS=0 srun: debug: propagating UMASK=0007 srun: debug2: srun PMI messages to port=38091 srun: debug: Entering slurm_allocation_msg_thr_create() srun: debug: port from net_stream_listen is 38666 srun: debug: Entering _msg_thr_internal srun: debug: Munge authentication plugin loaded srun: Waiting for nodes to boot (delay looping 7650 times @ 0.100000 secs x index) srun: debug: Waited 0.100000 sec and still waiting: next sleep for 0.200000 sec srun: debug2: eio_message_socket_accept: got message connection from 10.2.15.254:61406 7 srun: PrologSlurmctld failed, job killed srun: debug2: eio_message_socket_accept: got message connection from 10.2.15.254:61408 7 srun: Force Terminated job 3978159 srun: error: Job allocation 3978159 has been revoked [~]$ ----- [root log]# grep 3978159 /var/log/slurmctld [2021-06-17T11:02:45.141] sched: _slurm_rpc_allocate_resources JobId=3978159 NodeList=cnode102 usec=1171 [2021-06-17T11:02:45.324] error: prolog_slurmctld JobId=3978159 prolog exit status 99:0 [2021-06-17T11:02:45.324] unable to requeue JobId=3978159: Only batch jobs are accepted or processed [root log]# Logs en route Created attachment 19992 [details]
slurmd log from cnode011
Created attachment 19993 [details]
slurmd log from cnode111
Created attachment 19994 [details]
slurmctld log from head node.
slurmctld log from head node.
(In reply to Roy from comment #3) > srun: PrologSlurmctld failed, job killed > > PrologSlurmctld=/cm/local/apps/cmd/scripts/prolog-prejob Looks like the Bright provided prolog script is failing (and causing all the jobs to re-queue). (In reply to Nate Rini from comment #7) > (In reply to Roy from comment #3) > > srun: PrologSlurmctld failed, job killed > > > > PrologSlurmctld=/cm/local/apps/cmd/scripts/prolog-prejob > > Looks like the Bright provided prolog script is failing (and causing all the > jobs to re-queue). Please comment out the PrologSlurmctld line in the slurm.conf and restart all of the Slurm daemons to see if jobs work without the script. Hi Nate, That did the trick, the previously PD jobs are now running. Should I move this issue to Bright to determine the cause or are there other diagnostics we could address here? (In reply to Roy from comment #9) > That did the trick, the previously PD jobs are now running. > > Should I move this issue to Bright to determine the cause or are there other > diagnostics we could address here? That will likely be the most effective route. Slurm honors the result of the prolog script and will requeue any job when it fails. Alright. Thank you for your help with this. Roy
I'm going to close this ticket per your response. Please respond if you have any more questions.
We also generally suggest against running with the config:
> DebugFlags=NO_CONF_HASH
This could result in unexpected issues if one of the nodes gets out of sync with the slurm.conf on the controller.
Thanks,
--Nate
|
Created attachment 19991 [details] slurm.conf file Hi, Running 19.05.8 -- there are plans to upgrade in the upcoming weeks. In the meantime, we're seeing that all new jobs submitted to our system drop immediately into a Pending state with Reason=(BeginTime). I see in the slurmctld log a few curious entries: Many requeues: [2021-06-17T09:21:33.201] Requeuing JobId=3977254 Many failed prolog runs: [2021-06-17T09:21:33.204] error: prolog_slurmctld JobId=3977980 prolog exit status 99:0 Many attempted allocations that still result in PD state: [2021-06-17T09:21:32.791] sched: Allocate JobId=3977407 NodeList=cnode038 #CPUs=1 Partition=high_mem Many issues with RPC counts: [2021-06-17T09:21:30.716] sched: 233 pending RPCs at cycle end, consider configuring max_rpc_cnt I've attached our slurm.conf, and please note we run SLURM along with Bright Computing. I will attach the log in a future message. I have looked through some other bug reports on this issue and have not been able to find a solution that pertains to 19.05.8. I have attempted to hold an release jobs as well as pushing the starttime of jobs. Both cases lead test jobs to the same result of Pending with BeginTime as the reason. We do run with a multifactor priority scheme using age. I see hints that this could be related. Currently, no users can run new jobs on the system. We do still have some jobs running from before whatever hiccough started this.