I'm seeing some odd behavior as of late with jobs failing with either exit code 0:9 or exit code 1:0. This occurs during a stress test we run to ensure the system still functions as it should and is able to handle a large number of job requests. In the case where we see exit code 0:9, for example, Jul 1 21:26:43 j1c10 EPILOG-ROOT: Job 3638483: (86) Finished: Job 3638483 (serial) finished for user Raghu.Reddy in partition juno with exit code 0:9 Jul 1 21:26:43 j1c10 EPILOG-ROOT: Job 3638491: (86) Finished: Job 3638491 (serial) finished for user Raghu.Reddy in partition juno with exit code 0:9 Jul 1 21:26:43 j1c10 EPILOG-ROOT: Job 3638492: (86) Finished: Job 3638492 (serial) finished for user Raghu.Reddy in partition juno with exit code 0:9 Jul 1 21:26:43 j1c13 EPILOG-ROOT: Job 3638480: (86) Finished: Job 3638480 (serial) finished for user Raghu.Reddy in partition juno with exit code 0:9 Jul 1 21:26:43 j1c13 EPILOG-ROOT: Job 3638481: (86) Finished: Job 3638481 (serial) finished for user Raghu.Reddy in partition juno with exit code 0:9 there are some corresponding entries in the slurmctld at the time of the above messages: 61539703 [2020-07-01T21:26:43.207] debug: Note large processing time from _slurm_rpc_epilog_complete: usec=1858446 began=21:26:41.348 ... 61539713 [2020-07-01T21:26:43.209] debug: Note large processing time from _slurm_rpc_dump_job_single: usec=1691571 began=21:26:41.516 These jobs do tax the slurmctld as they are short, and chain themselves by submitting a copy of themselves via the sbatch command just before they end. The jobs with exit code 0:9 appear to fail at the point where the sbatch command is fired off to launch the chained job. The job uses only one core and other jobs share the node (same type of job), up to 40 jobs (one per core). Out of 85K jobs processed, we have experienced about 80 fails. While 85K jobs in about 20 hours seems to be a lot, we have run up to 150K jobs in 24 hours, in the past, without issue. I do believe this is a communication issue among the nodes. BUT, I have tuned the system via the high-throughput guidelines SchedMD has provided and this is a very small cluster - two server nodes and only 14 compute nodes in action. What limit have I reached that may be causing this intermittent issue? Please let me know what additional information I can provide you to assist in tuning the system
Hi Can you send me slurm.conf and output from sdiag for this system? Dominik
Created attachment 14906 [details] Juno slurm.conf file
Created attachment 14907 [details] sdiag out from 0628 through 0702 This file contains 5 sdiag files. We collect these nightly at 23:59 UTC
Hi Sorry for the late response. In sdiag I noticed: REQUEST_SUBMIT_BATCH_JOB seems to be heavy. Could this be caused by JobSubmitPlugins=lua? A lot of REQUEST_NODE_INFO, REQUEST_JOB_INFO_SINGLE. Do these RPCs come from prolog/epilog? If yes, what info you require in this script? Can you grab some perf data from ~10 minutes (both perf.data.tar.bz2 and perf.data)? Maybe this will show us some bottleneck. eg.: perf record -s --call-graph dwarf -p `pidof slurmctld` perf archive perf.data then send both perf.data.tar.bz2 and perf.data Dominik
Hi Any news? Dominik
(In reply to Dominik Bartkiewicz from comment #5) Hi Dominik, Sorry for the delay. Had some personal medical issues to attend to since your reply and haven't had an opportunity to get back to this. We're going to update our system to 20.02.4 today and rerun our stress tests. I'll collect that info during that time and send it to you. Best, Tony.
(In reply to Dominik Bartkiewicz from comment #4) > Can you grab some perf data from ~10 minutes (both perf.data.tar.bz2 and > perf.data)? > Maybe this will show us some bottleneck. > > eg.: > perf record -s --call-graph dwarf -p `pidof slurmctld` > perf archive perf.data Dominik, Not very familiar with perf. I'll come up to speed on it. But we didn't even have in on the system. Just installed it via yum install perf. Regardless, I don't believe it's working: [root@bqs7 ~]# perf record -s --call-graph dwarf -p `pidof slurmctld` WARNING: Ignored open failure for pid 74924 WARNING: Ignored open failure for pid 74936 WARNING: Ignored open failure for pid 74950 WARNING: Ignored open failure for pid 74957 WARNING: Ignored open failure for pid 74960 And, even though perf.data exists with a size of 552K, there is not a perf.data.tar.bz2 file. And trying to read perf.data: [root@bqs7 ~]# perf script -i perf.data > perf.data.txt WARNING: The perf.data file's data size field is 0 which is unexpected. please let me know how you'd like to proceed. Tony
Hi To generate perf.data.tar.bz2 you need to use "perf archive perf.data". Dominik
Created attachment 15470 [details] perf data file (about 10 min) Compressed perf.data file using:tar -czvf perf.data.tgz perf.data necessary to fit in file size limit.
Created attachment 15471 [details] file created via: perf archive perf.data
Created attachment 15472 [details] Refreshed sdiag as of this morning directly after perf run
Hi I am afraid that perf.data is broken (internet reports that this can happen with big processes) could you take perf.data one more time (perf.data.tar.bz2 you already send is fine)? Could you send me prolog and epilog from slurmctld and slurmd? Dominik
Created attachment 15490 [details] All my pro/epilog files
Created attachment 15491 [details] perf data file refreshed 2020-08-18 See if this works out.
Hi Unfortunately, I also can't read this perf. Could you check if you can locally generate any useful output? eg.: perf report --hierarchy -T -i perf.data You can also try to use LBR to grab perf.data (it only works on newer Intel CPU) eg.: perf record --call-graph=lbr -p `pidof slurmctld` I noticed that every prolog and epilog generate one "scontrol show job <job_id>" Using s-tools (squeue, scontrol) in a script can always cause some performance issues in a high throughput system. The ideal is to use data from environment variables. Dominik
(In reply to Dominik Bartkiewicz from comment #15) > Could you check if you can locally generate any useful output? > eg.: > perf report --hierarchy -T -i perf.data I get the same as you: [root@bqs7 ~]# perf report --hierarchy -T -i perf.data 0x3407f2a0 [0xffff]: failed to process type: -1596379697 Warning: 71 out of order events recorded. Error: failed to process sample # To display the perf.data header info, please use --header/--header-only options. > You can also try to use LBR to grab perf.data (it only works on newer Intel > CPU) > eg.: > perf record --call-graph=lbr -p `pidof slurmctld` [root@bqs7 ~]# perf record --call-graph=lbr -p `pidof slurmctld` Lowering default frequency rate to 3000. Please consider tweaking /proc/sys/kernel/perf_event_max_sample_rate. ^C[ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.245 MB perf.data (803 samples) ] I've run this for less than a minute on a quiet system and it does seem to provide a call-graph report - but this is on a quiet system: [root@bqs7 ~]# perf report --hierarchy -T -i perf.data # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 803 of event 'cycles:ppp' # Event count (approx.): 44852584 # .... After seeing that, I also retried: perf record -s --call-graph dwarf -p `pidof slurmctld` and it too was successful for a short period on a quiet system. Let me restart the job load to see what happens. > I noticed that every prolog and epilog generate one "scontrol show job > <job_id>" > Using s-tools (squeue, scontrol) in a script can always cause some > performance issues in a high throughput system. The ideal is to use data > from environment variables. That would be great if the data was there. But, I have no way of knowing if the job is on a shared node or not. This is needed so that we can determine whether or not the cleanup procedure can be run on the node (both in prolog and in epilog). If you know of another way to detect this condition, I would gladly give up the "scontrol sho job" calls. Ideally, there should be a way for all job information to be available at the node(s) without having to call scontrol. I'd be interested in hearing recommendations from SchedMD.
Created attachment 15561 [details] perf data file refreshed 2020-08-24 This seems to work on my end. See if it is useful to you.
Hi PrologSlurmctld and EpilogSlurmctld scripts seem to do just logging. Could you check if disabling PrologSlurmctld and EpilogSlurmctld help? Have you considered using pam_slurm_adopt instead of the cleaning procedure in prolog/epilog? I have tried to locate all data needed by those scripts that are currently taken from scontrol: isShared() NUMTASKS, NUMCPUS, PARTITION (now env exists only in prolog) epilog.root.sh BATCHHOST Do you think it is possible to change this approach slightly to eliminate scontrol from this script? If you describe which data you need, perhaps we will be able to add them to prolog and epilog environment in 20.11. Have you checked the high throughput guide? https://slurm.schedmd.com/high_throughput.html Dominik
(In reply to Dominik Bartkiewicz from comment #18) > Hi > > PrologSlurmctld and EpilogSlurmctld scripts seem to do just logging. > Could you check if disabling PrologSlurmctld and EpilogSlurmctld help? I can do that. Will let you know what happens. But, I'm having trouble understanding why this would be troublesome since it should be a "local" RPC call. > Have you considered using pam_slurm_adopt instead of the cleaning procedure > in prolog/epilog? We are already using pam_slurm_adopt. How is this helpful? As I understand it, that simply prevents other users from getting onto a node if those users don't already have a job there. The more pressing issue is that there may be left over processes or memory (/dev/shm ) or shared storage (/tmp) still allocated (especially from an ill-terminated job). Our past experience has forced us in this direction to ensure nodes are clean after each job or prior to the next job. > I have tried to locate all data needed by those scripts that are currently > taken from scontrol: > > isShared() NUMTASKS, NUMCPUS, PARTITION (now env exists only in prolog) > epilog.root.sh BATCHHOST Since I need to clean up after a job and before a job, I need the information in the prolog.root and epilog.root phases. Our way of determining if a job is on a shared node is to naively test: [[ $NUMTASKS -eq 1 ]] && [[ $NUMCPUS -eq 1 ]] && [[ $NUMNODES -eq 1 ]] since we permit single core jobs to be placed on shared nodes. 2 or more cores get exclusive access to a node. If there's a better way of determining this, I wouldn't have any reservations of dropping my approach. In fact, we really want a more robust method of determining whether or not a job is on a node being shared with other jobs since the customer would prefer to be able to have multiple jobs on nodes if the resources are available. > Do you think it is possible to change this approach slightly to eliminate > scontrol from this script? Our goal is to provide the user clean nodes to work on. I would love to give up scontrol if I could be provided with the necessary information at each node in both the prolog and epilog. We do use the output of "scontrol sho job" for troubleshooting purposes by logging that info to the syslog. It would be helpful if SchedMD would simply provide that info to each node upon allocation. That would solve this issue altogether. > If you describe which data you need, perhaps we will be able to add them to > prolog and epilog environment in 20.11. See above. > Have you checked the high throughput guide? > https://slurm.schedmd.com/high_throughput.html For the "most" part, I've been pretty cognizant of this document - at least for the proc filesystem: cat /proc/sys/fs/file-max 19468965 cat /proc/sys/net/ipv4/tcp_max_syn_backlog 2048 cat /proc/sys/net/ipv4/tcp_syncookies 1 cat /proc/sys/net/ipv4/tcp_synack_retries 5 cat /proc/sys/net/core/somaxconn 1024 cat /proc/sys/net/ipv4/ip_local_port_range 32768 60999 The exception is the txqueuelen, which at the moment remains at 1000. If you think this is a contributor to my issues, we would have to make a cluster-wide change as I would think this should apply to all slurmd nodes as well as the controllers. As for slurmctld limits, most are set to unlimited (as you can see), but open file count might need some tweaking. On the other hand, it is set in systemd to LimitNOFILE=65536. Is this sufficient? Do you see anything else that should be tuned? [root@bqs7 ~]# cat /proc/2645/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size unlimited unlimited bytes Max core file size unlimited unlimited bytes Max resident set unlimited unlimited bytes Max processes 766912 766912 processes Max open files 65536 65536 files Max locked memory 65536 65536 bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 766912 766912 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us I'll take some time to review the slurm.conf configuration. Most of the settings we have were reviewed by Tim Wickberg about 2 years ago. But - it's time for a review - now that we have more experience under our belt. As for slurmdbd, we don't purge any of our data. Customer wants the data to be persistent for the life of the system. If you think this is a major contributor, we can have that discussion with the customer - please let me know. We don't use CommitDelay. But may consider setting it to 1 for testing. Customer is very risk averse and does not want to lose any data.
Hi Slurm hasn't "local" RPCs. All requests are threaded the same way. "scontrol show job <>" request requires jobs read lock and it is interrupting for scheduler/backfill. Besides, PrologSlurmctld creates a separate slurmctld thread for every job start, which also requires the acquisition of a job write lock, This is a consuming operation that can severely limit scheduler throughput. Dominik
Hi I don't think slurmdbd is making any problem in this case. For sure, enabling slurmctld prolog can significantly limit scheduling throughput. We have a long term plan to extend and unified set of env available in prolog epilog scripts. But for now, I recommend limit using "scontrol show job" in scripts to a minimum. cons_tres plugin can also be notable slower (eg:. preempt/qos) then cons_res. Do you plan to use features exclusively available in cons_tres? Does disabling slurmctld prolog help? Dominik
(In reply to Dominik Bartkiewicz from comment #22) Hi Dominik, > For sure, enabling slurmctld prolog can significantly limit scheduling > throughput. > Does disabling slurmctld prolog help? Yes - I've disabled both the prolog and epilog slurmctld scripts. It did provide a more stable environment. Thank you for your hints there. > We have a long term plan to extend and unified set of env available in > prolog epilog scripts. But for now, I recommend limit using "scontrol show > job" in scripts to a minimum. Yes. It is now limited to only the root prolog and epilog. We use it only once per script. This is still troublesome as when there are many jobs trying to get at the controller at the same time, or even a job with many nodes - each making a call to the slurmctld at the same time, there is a noticeable impact on the responsiveness of the controller. Will any of the "long term plan ... env" be available in 20.11? Our inability to efficiently determine whether a job is on a shared node (in prolog and epilog) is really affecting us. It would be very helpful to have this capability sooner than later. > cons_tres plugin can also be notable slower (eg:. preempt/qos) then cons_res. > Do you plan to use features exclusively available in cons_tres? We do use preemption on a limited basis. We expect the need for this to increase in the future. I know the customer wants to take advantage of many more of Slurm's features. Tony.
(In reply to Anthony DelSorbo from comment #23) > > Yes. It is now limited to only the root prolog and epilog. We use it only > once per script. This is still troublesome as when there are many jobs > trying to get at the controller at the same time, or even a job with many > nodes - each making a call to the slurmctld at the same time, there is a > noticeable impact on the responsiveness of the controller. Could you send me the output from sdiag and take another perf? > > Will any of the "long term plan ... env" be available in 20.11? We can add a particular field to 20.11 but a bigger rewriting of this code will be done in 21.08 preempt/qos is available in both select plugins. But it can work much slower in cons_tres. Do you know which of the features they consider to use? Dominik
Hi Any news? Do you still notice problems during the stress tests? If yes, could you send me the data mentioned in the previous comment? Dominik
Hi Any news? If you don't make here any update I will close this in few days as timeout. Dominik
(In reply to Dominik Bartkiewicz from comment #26) > Hi > > Any news? Dominik - Apologies for the delay. The test system is still being used to test 20.11 pre-release for the enhancements we requested. I'm unsure as to when I will get the system back. I can tell you that since we prevented the slurmctld pro/epi scripts from running (since it does "scontrol sho job", too), it performs much better. But again, I am now not able to use a feature of slurm because of this performance degradation in this circumstance. Is there any news on your end on providing the necessary job information to the pro/epi scripts at the nodes (and the controller) so that we don't have to query the controller to get it? Best, Tony.
Hi If you give me a list of fields/information required in prolog, I will try to add them to the available envs. Unfortunately, adding env to epilog is more complicated, and if you can find an alternative way to obtain this info, this would be best. Dominik
Hi You can get a list of job and step currently running on slurmd host by 'scontrol show slurmd'. This connects directly to slurmd and has zero impact on slurmctld. eg.: SLURMD_NODENAME=<host_name> scontrol show slurmd |grep 'Active Steps' Dominik
(In reply to Dominik Bartkiewicz from comment #29) > Hi > > You can get a list of job and step currently running on slurmd host by > 'scontrol show slurmd'. This connects directly to slurmd and has zero impact > on slurmctld. Now, that's interesting. I did not know that. I thought any "scontrol" command went through the slurmctld. I'll need to investigate that. Thanks for that valuable information! Here's what we currently key on in the pro/epilogs: STDOUT=${jobData['StdOut']} EXITCODE=${jobData['ExitCode']} PARTITION=${jobData['Partition']} NUMNODES=${jobData['NumNodes']} NUMCPUS=${jobData['NumCPUs']} NUMTASKS=${jobData['NumTasks']} If these can be made available, in both prolog and epilog then we would be able to forego the scontrol sho job even though all the other job info is useful for troubleshooting. If there was one piece of information I really need is whether or not the job is on a shared node. If that can be made available, it would make what we do so much easier. Tony.
Just to chime in, we would find it useful, and could significantly reduce the complexity of our prolog scripts if the following fields from a job's attributes were available to the script without a call to 'scontrol show job': AdminComment - to pass site-specific json to job script AllocNode - to determine entry point to the cluster, implying security policy Feature - to determine secondary job allocation that may not be part of the common scheduling policy Requeue - to affect behavior on errors encountered within the prolog Reservation - to influence decisions based on the above Ideally, if a job were running within a reservation, reservation attributes would also be extremely useful, especially: State - to determine if we kept running after the reservation finished MaxStartDelay - to influence overlap into a reservation / cancel policy EndTime - to influence the signal catching & checkpoint behavior remaining
Hi Sorry for the late response. Unfortunately, after talking with Tim, we decided that we can't add any new env to 20.11. As I mentioned before, we plan to rewrite this part of the code in 21.08. After that, both prolog and epilog will have the same, bigger set of environments. Dominik
Hi Please let me know if there is anything else I can do to help or if this is ok to close. Dominik
(In reply to Dominik Bartkiewicz from comment #35) Dominik We were disappointed to read your response in comment 34, to be sure. We need your support in providing systems engineers with the job information needed in the prologs and epilogs in order to make appropriate decisions on how to manage the nodes. If you have a roadmap of when those feature would be available so that we can inform our management team, that would be helpful. Tony.
Hi Tony - I'm updating some details on this ticket to reflect an outstanding request to expand the details available in the prolog/epilog, and ensure those are synced up where appropriate. While this may seem like a simple request, it's unfortunately a bit complicated due to our architecture and the current RPC patterns. We cannot expand the prolog without some significant refactoring. While I'd like to tackle that at some point, without sponsorship I can make no guarantee of a timeframe for any related changes. If sponsoring such work is of interest to NOAA, I can get an SoW over to you sometime in January that would ensure this is done by the 21.08 release next August. - Tim
(In reply to Tim Wickberg from comment #38) > > If sponsoring such work is of interest to NOAA, I can get an SoW over to you > sometime in January that would ensure this is done by the 21.08 release next > August. Tim, Thanks for getting back to us on this. We'd be interested, but will need time to get it through the process. So, the sooner you can get it to us so that we can have it reviewed by our team with a bit of coordination and review by your team, the better. As you know the process here takes time to get approved and funds allocated. Best, Tony
(In reply to Anthony DelSorbo from comment #39) > (In reply to Tim Wickberg from comment #38) > > > > If sponsoring such work is of interest to NOAA, I can get an SoW over to you > > sometime in January that would ensure this is done by the 21.08 release next Tim, If you sent me an SOW on this, I must have missed it. But, we're still interested. Would you be able to resend it so that we can try to work this into our next fiscal cycle? Thanks, Tony.
Work related to this ticket was performed for another customer as part of bug 12110, and will be available in 22.05 when released. Marking this closed as a duplicate of that later ticket. - Tim *** This ticket has been marked as a duplicate of ticket 12110 ***