Created attachment 5427 [details] slurm.conf and slurmctld.log Hi Support, The socket time out on send/recv operation happens from time to time (reported by users that many times per day). It happens for simple command like 'squeue' or when user tries to submit jobs by 'sbatch'. Could you help to verify our configuration and try to correlate any problem from the log? I had increased the log level SlurmctldDebug=6 for such case. For example, it happened today (timestamp: 2017-10-25 16:30:06.057624) on the client side. The time sync is quite accurate between the client and the master. I have attached both the slurm.conf and slurmctld.log (the log is so huge, so I have extracted one hour of log). Thanks a lot! Regards, Kevin Wong
Would it be possible to get an "sdiag" output? I see in the log some entries complaining about large processing time, i.e.: [2017-10-25T16:00:11.353] debug: Note large processing time from _slurm_rpc_dump_jobs: usec=1727436 began=16:00:09.625 [2017-10-25T16:00:11.353] debug: Note large processing time from _slurm_rpc_dump_nodes: usec=2021955 began=16:00:09.331 [2017-10-25T16:00:11.422] debug: Note large processing time from _slurm_rpc_dump_jobs: usec=1187098 began=16:00:10.235 [2017-10-25T16:00:11.515] debug: Note large processing time from _slurm_rpc_dump_jobs: usec=1207597 began=16:00:10.308 This operations in the controller takes seconds and some could exceed the MessageTimeout, that is 10 seconds for you since you haven't set it and it is the default. Increasing MessageTimeout could be a workaround but not the solution. I suppose that this is a local filesystem: StateSaveLocation=/localdisk/slurm/ctld , having the StateSaveLoc in a shared fs could be a problem but I also see that you use this path to store job completion information, and if its shared also for slurmd spool dir. This could be a problem. Could you see the I/O stats of this localdisk at OS level? How did you have the debug level before changing to 6? Setting debug to debug5 (6) is extremelly verbose and some informations are not useful, plus it causes an overhead to the daemon. On a very busy system 'debug' may be too high, but should be good in most environments, or even better just set level to 'info'.
(In reply to Felip Moll from comment #1) > Would it be possible to get an "sdiag" output? I put the sdiag output as text at the end of this reply. It was generated only this morning. However I have done 'scontrol reconfigure' earlier this morning. I hope that didn't remove the values that you are looking for. > > I see in the log some entries complaining about large processing time, i.e.: > > [2017-10-25T16:00:11.353] debug: Note large processing time from > _slurm_rpc_dump_jobs: usec=1727436 began=16:00:09.625 > [2017-10-25T16:00:11.353] debug: Note large processing time from > _slurm_rpc_dump_nodes: usec=2021955 began=16:00:09.331 > [2017-10-25T16:00:11.422] debug: Note large processing time from > _slurm_rpc_dump_jobs: usec=1187098 began=16:00:10.235 > [2017-10-25T16:00:11.515] debug: Note large processing time from > _slurm_rpc_dump_jobs: usec=1207597 began=16:00:10.308 > > This operations in the controller takes seconds and some could exceed the > MessageTimeout, that is 10 seconds for you since you haven't set it > and it is the default. Increasing MessageTimeout could be a workaround but > not the solution. Just to experiment the workaround, I have changed MessageTimeout to 20 now. > > I suppose that this is a local filesystem: > StateSaveLocation=/localdisk/slurm/ctld , having the StateSaveLoc in a > shared fs could be a problem but I also see that you use this path to store > job completion information, and if its shared also for slurmd spool dir. > This could be a problem. Yes this is local filesystem ext4 on top of LVM. The hardware is RAID-1 with a pair of 10krpm disks. This host only runs slurm and the mysqld for slurm. The responsiveness of the filesystem seems very good. > > Could you see the I/O stats of this localdisk at OS level? I don't notice any particular issue in the I/O stats. But of course when the timeout happened, I was not watching the stats ;-) > > How did you have the debug level before changing to 6? Setting debug to > debug5 (6) is extremelly verbose and some informations are not useful, plus > it causes an overhead to the daemon. On a very busy system 'debug' may be > too high, but should be good in most environments, or even better just set > level to 'info'. The debug level was set to '5'. Now I have changed it to 'info'. A few more questions. SchedulerParameters=sched_min_interval=1, how should I tune this value for this issue of socket timeout? Most of the clients are still using the binaries of slurm-torque to talk to the master. Do these binaries slow down slurmctld a lot (e.g. qstat vs squeue)? We may have to change MaxArraySize to 1000000 soon. If there is no big change in the overall number of jobs per day, will the change of 1000000 make the timeout situation worse? ******************************************************* sdiag output at Thu Oct 26 08:42:22 2017 Data since Thu Oct 26 08:00:00 2017 ******************************************************* Server thread count: 4 Agent queue size: 0 Jobs submitted: 3873 Jobs started: 4326 Jobs completed: 5155 Jobs canceled: 0 Jobs failed: 0 Main schedule statistics (microseconds): Last cycle: 782 Max cycle: 42501 Total cycles: 5175 Mean cycle: 1184 Mean depth cycle: 16 Cycles per minute: 123 Last queue length: 0 Backfilling stats Total backfilled jobs (since last slurm start): 5432 Total backfilled jobs (since last stats cycle start): 106 Total cycles: 26 Last cycle when: Thu Oct 26 08:40:25 2017 Last cycle: 3585 Max cycle: 2309029 Mean cycle: 731765 Last depth cycle: 4 Last depth cycle (try sched): 4 Depth Mean: 144 Depth Mean (try depth): 144 Last queue length: 4 Queue length mean: 242 Remote Procedure Call statistics by message type REQUEST_JOB_INFO ( 2003) count:2247219 ave_time:487830 total_time:1096262750273 REQUEST_NODE_INFO ( 2007) count:2127723 ave_time:22397 total_time:47655723557 REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:256905 ave_time:83462 total_time:21442012727 REQUEST_SUBMIT_BATCH_JOB ( 4003) count:177022 ave_time:40186 total_time:7113983040 REQUEST_PARTITION_INFO ( 2009) count:123153 ave_time:3001 total_time:369603946 REQUEST_JOB_INFO_SINGLE ( 2021) count:54685 ave_time:50468 total_time:2759890770 REQUEST_UPDATE_JOB ( 3001) count:32772 ave_time:34382 total_time:1126797038 MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:24593 ave_time:2677 total_time:65845643 REQUEST_RESERVATION_INFO ( 2024) count:17426 ave_time:36560 total_time:637098420 REQUEST_PING ( 1008) count:8770 ave_time:135 total_time:1187254 REQUEST_STATS_INFO ( 2035) count:8762 ave_time:2042 total_time:17893244 REQUEST_CANCEL_JOB_STEP ( 5005) count:2889 ave_time:21023 total_time:60735905 MESSAGE_EPILOG_COMPLETE ( 6012) count:2725 ave_time:21891 total_time:59653905 REQUEST_JOB_NOTIFY ( 4022) count:195 ave_time:84189 total_time:16417035 REQUEST_JOB_REQUEUE ( 5023) count:6 ave_time:128763 total_time:772582 REQUEST_RECONFIGURE ( 1003) count:2 ave_time:207833 total_time:415667 REQUEST_TRIGGER_PULL ( 2030) count:1 ave_time:104 total_time:104 Remote Procedure Call statistics by user nikbt2_t ( 10006) count:2736579 ave_time:390197 total_time:1067806297927 v_bazine ( 2021) count:1436329 ave_time:46677 total_time:67044764669 root ( 0) count:601245 ave_time:51332 total_time:30863613022 y_chamich ( 994428) count:95794 ave_time:19649 total_time:1882303428 a_graton ( 2023) count:50852 ave_time:52013 total_time:2644991404 e_geffroy ( 311066) count:46082 ave_time:30536 total_time:1407201554 slurm ( 411) count:41543 ave_time:27152 total_time:1127984396 m_haboucha ( 985883) count:21629 ave_time:72940 total_time:1577634152 a_clairon ( 962434) count:21391 ave_time:103675 total_time:2217719128 h_azzouz ( 1175) count:15512 ave_time:34169 total_time:530036982 mouchbt ( 171) count:9015 ave_time:8868 total_time:79948977 m_durant ( 203) count:3219 ave_time:29728 total_time:95694730 modelmgr_t ( 3025) count:2899 ave_time:45484 total_time:131859083 m_maupe ( 529933) count:1566 ave_time:81294 total_time:127307031 r_huang ( 992184) count:1011 ave_time:41633 total_time:42091240 gadmin ( 50005) count:88 ave_time:125936 total_time:11082414 l_berget ( 3166826) count:88 ave_time:2584 total_time:227466 d_wang ( 3140314) count:6 ave_time:3917 total_time:23507
> morning. I hope that didn't remove the values that you are looking for. It would be better to have the sdiag output on the time that problem arises but with what you send we can do some assumptions. > Just to experiment the workaround, I have changed MessageTimeout to 20 now. Ok, but we should try to solve the issue first without changing MessageTimeout, > I don't notice any particular issue in the I/O stats. But of course when the > timeout happened, I was not watching the stats ;-) Good, if it happens again, just take a quick look but since the hardware description you gave I think it would be ok. > The debug level was set to '5'. Now I have changed it to 'info'. > Wow, this was the maximum. This is important and can be at least part of the cause for the problems. Keep it to info. Remember you can change the debug level on the fly with "scontrol setdebug debug2" if needed at a particular time. > SchedulerParameters=sched_min_interval=1, how should I tune this value for > this issue of socket timeout? The bigger the value, the lesser the execution frequency of the scheduler. So, if you have high frequency of submissions/terminates the main scheduling loop will execute and test any pending jobs very often. This indeed can cause performance problems. By default, it is deactivated in 17.02, meaning that on each event the scheduler is triggered. The units are "microseconds". If you increase a lot the submissions/terminates, consider increasing it to for example sched_min_interval=500000 (0.5 seconds). It is somewhat heuristic so you will have to experiment a little. > Most of the clients are still using the binaries of slurm-torque to talk to > the master. Do these binaries slow down slurmctld a lot (e.g. qstat vs > squeue)? I don't think so. Some of this wrappers uses slurm tools directly, and some uses the API, but it shouldn't be a problem. > We may have to change MaxArraySize to 1000000 soon. If there is no big > change in the overall number of jobs per day, will the change of 1000000 > make the timeout situation worse? > Probably yes if the user-submitted array sizes are really increased. 1000000 is a huge increase. You probably will need some tunning in the scheduler and you will want to increase also MaxJobCount since each array task counts as a separate job. I recommend to read through the possible parameters and the high throughput guide. > ******************************************************* > sdiag output at Thu Oct 26 08:42:22 2017 > Data since Thu Oct 26 08:00:00 2017 > ******************************************************* Everything seems fine until here. > Remote Procedure Call statistics by message type > REQUEST_JOB_INFO ( 2003) count:2247219 > ave_time:487830 total_time:1096262750273 > REQUEST_NODE_INFO ( 2007) count:2127723 > ave_time:22397 total_time:47655723557 > Remote Procedure Call statistics by user > nikbt2_t ( 10006) count:2736579 ave_time:390197 > total_time:1067806297927 > v_bazine ( 2021) count:1436329 ave_time:46677 > total_time:67044764669 > root ( 0) count:601245 ave_time:51332 > total_time:30863613022 In these lines I see a very big count for REQUEST_JOB_INFO and REQUEST_NODE_INFO. The users nikbt2_t, v_bazine are performing a lot of queries to the controller. This can be an issue. Can you investigate what these users are doing? It seems to me that it can be a script that monitors the jobs and associated nodes, or something similar. In another system with thousands of nodes, after 18 hours of execution I got a maximum of around ~1 million RPCS per the user that have more. Your users have 2.7 and 1.5 milion of queries in 42 minutes. Let me if it is still happening with the changes and all the provided info.
(In reply to Felip Moll from comment #3) > > morning. I hope that didn't remove the values that you are looking for. > > It would be better to have the sdiag output on the time that problem arises > but with what you send we can do some assumptions. > > > > Just to experiment the workaround, I have changed MessageTimeout to 20 now. > > Ok, but we should try to solve the issue first without changing > MessageTimeout, > > > > I don't notice any particular issue in the I/O stats. But of course when the > > timeout happened, I was not watching the stats ;-) > > Good, if it happens again, just take a quick look but since the hardware > description you gave I think it would be ok. > > > > > The debug level was set to '5'. Now I have changed it to 'info'. > > > > Wow, this was the maximum. This is important and can be at least part of the > cause for the problems. Keep it to info. > > Remember you can change the debug level on the fly with "scontrol setdebug > debug2" if needed at a particular time. > > My mistake here, level 5 is "debug", not "debug5", so maybe wasn't so harmful. I think its more comprehensive to use: "quiet", "fatal", "error", "info", "verbose","debug", "debug2", "debug3", "debug4","debug5" as debug levels. Let's see anyway how it progress, just keep me up-to-date.
Hi Kevin, Had you had any chance to check provided comments, do the modifications to the config. and see how it behaves now? Thank you, Felip M
Hello Kevin, Are you still receiving the error messages after applying the modifications? Thanks, Felip M
(In reply to Felip Moll from comment #7) > Hello Kevin, > > Are you still receiving the error messages after applying the modifications? > > Thanks, > Felip M There is no more report of socket timeout after we applied the modifications. Thanks for the hints! Additional to the question MaxArraySize, is MaxJobCount has to be greater than MaxArraySize? What will happen when user tries to submit jobs while MaxJobCount is just 50% of MaxArraySize? Regards, Kevin
(In reply to Kevin WONG from comment #8) > (In reply to Felip Moll from comment #7) > > Hello Kevin, > > > > Are you still receiving the error messages after applying the modifications? > > > > Thanks, > > Felip M > > There is no more report of socket timeout after we applied the > modifications. Thanks for the hints! > > Additional to the question MaxArraySize, is MaxJobCount has to be greater > than MaxArraySize? What will happen when user tries to submit jobs while > MaxJobCount is just 50% of MaxArraySize? > > Regards, > Kevin I am glad that problem is solved. Extracted from the 'man slurm.conf': MaxJobCount defines the maximum number of jobs Slurm can have in its active database at one time. The value of MaxJobCount should be much larger than MaxArraySize. The default value is 1001. Once MaxJobCount limit is reached, requests to submit additional jobs will fail. So, because each task of a job array counts as one job, if a user submits an array greater than MaxJobCount the request will fail. You must increase MaxJobCount to be larger than MaxArraySize. If you are ok, may I close the ticket? Best regards Felip M
(In reply to Felip Moll from comment #9) > (In reply to Kevin WONG from comment #8) > > (In reply to Felip Moll from comment #7) > > > Hello Kevin, > > > > > > Are you still receiving the error messages after applying the modifications? > > > > > > Thanks, > > > Felip M > > > > There is no more report of socket timeout after we applied the > > modifications. Thanks for the hints! > > > > Additional to the question MaxArraySize, is MaxJobCount has to be greater > > than MaxArraySize? What will happen when user tries to submit jobs while > > MaxJobCount is just 50% of MaxArraySize? > > > > Regards, > > Kevin > > I am glad that problem is solved. > > Extracted from the 'man slurm.conf': > > MaxJobCount defines the maximum number of jobs Slurm can have in its active > database at one time. The value of MaxJobCount should be much larger than > MaxArraySize. The default value is 1001. Once MaxJobCount limit is reached, > requests to submit additional jobs will fail. > > So, because each task of a job array counts as one job, if a user submits an > array greater than MaxJobCount the request will fail. > > You must increase MaxJobCount to be larger than MaxArraySize. > > > If you are ok, may I close the ticket? > > Best regards > Felip M Hi Kevin, I will assume everything is solved and I will close the ticket, if you think we need to keep it open please just comment on that. Best regards, Felip M