Ticket 4304

Summary: Socket timed out on send/recv operation
Product: Slurm Reporter: Kevin WONG <kevin.k.wong>
Component: slurmctldAssignee: Felip Moll <felip.moll>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: felip.moll
Version: 17.02.2   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=6030
Site: BNP Paribas Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 17.02.2 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.conf and slurmctld.log

Description Kevin WONG 2017-10-25 04:46:06 MDT
Created attachment 5427 [details]
slurm.conf and slurmctld.log

Hi Support,

The socket time out on send/recv operation happens from time to time (reported by users that many times per day). It happens for simple command like 'squeue' or when user tries to submit jobs by 'sbatch'.

Could you help to verify our configuration and try to correlate any problem from the log? I had increased the log level SlurmctldDebug=6 for such case.

For example, it happened today (timestamp: 2017-10-25 16:30:06.057624) on the client side. The time sync is quite accurate between the client and the master. I have attached both the slurm.conf and slurmctld.log (the log is so huge, so I have extracted one hour of log).

Thanks a lot!

Regards,
Kevin Wong
Comment 1 Felip Moll 2017-10-25 05:56:44 MDT
Would it be possible to get an "sdiag" output?

I see in the log some entries complaining about large processing time, i.e.:

[2017-10-25T16:00:11.353] debug:  Note large processing time from _slurm_rpc_dump_jobs: usec=1727436 began=16:00:09.625
[2017-10-25T16:00:11.353] debug:  Note large processing time from _slurm_rpc_dump_nodes: usec=2021955 began=16:00:09.331
[2017-10-25T16:00:11.422] debug:  Note large processing time from _slurm_rpc_dump_jobs: usec=1187098 began=16:00:10.235
[2017-10-25T16:00:11.515] debug:  Note large processing time from _slurm_rpc_dump_jobs: usec=1207597 began=16:00:10.308

This operations in the controller takes seconds and some could exceed the MessageTimeout, that is 10 seconds for you since you haven't set it
and it is the default. Increasing MessageTimeout could be a workaround but not the solution.

I suppose that this is a local filesystem: StateSaveLocation=/localdisk/slurm/ctld  , having the StateSaveLoc in a shared fs could be a problem but I also see that you use this path to store job completion information, and if its shared also for slurmd spool dir. This could be a problem.

Could you see the I/O stats of this localdisk at OS level?

How did you have the debug level before changing to 6? Setting debug to debug5 (6) is extremelly verbose and some informations are not useful, plus it causes an overhead to the daemon. On a very busy system 'debug' may be too high, but should be good in most environments, or even better just set level to 'info'.
Comment 2 Kevin WONG 2017-10-25 19:16:01 MDT
(In reply to Felip Moll from comment #1)
> Would it be possible to get an "sdiag" output?

I put the sdiag output as text at the end of this reply. It was generated only this morning. However I have done 'scontrol reconfigure' earlier this morning. I hope that didn't remove the values that you are looking for.

> 
> I see in the log some entries complaining about large processing time, i.e.:
> 
> [2017-10-25T16:00:11.353] debug:  Note large processing time from
> _slurm_rpc_dump_jobs: usec=1727436 began=16:00:09.625
> [2017-10-25T16:00:11.353] debug:  Note large processing time from
> _slurm_rpc_dump_nodes: usec=2021955 began=16:00:09.331
> [2017-10-25T16:00:11.422] debug:  Note large processing time from
> _slurm_rpc_dump_jobs: usec=1187098 began=16:00:10.235
> [2017-10-25T16:00:11.515] debug:  Note large processing time from
> _slurm_rpc_dump_jobs: usec=1207597 began=16:00:10.308
> 
> This operations in the controller takes seconds and some could exceed the
> MessageTimeout, that is 10 seconds for you since you haven't set it
> and it is the default. Increasing MessageTimeout could be a workaround but
> not the solution.

Just to experiment the workaround, I have changed MessageTimeout to 20 now.

> 
> I suppose that this is a local filesystem:
> StateSaveLocation=/localdisk/slurm/ctld  , having the StateSaveLoc in a
> shared fs could be a problem but I also see that you use this path to store
> job completion information, and if its shared also for slurmd spool dir.
> This could be a problem.

Yes this is local filesystem ext4 on top of LVM. The hardware is RAID-1 with a pair of 10krpm disks. This host only runs slurm and the mysqld for slurm. The responsiveness of the filesystem seems very good.

> 
> Could you see the I/O stats of this localdisk at OS level?

I don't notice any particular issue in the I/O stats. But of course when the timeout happened, I was not watching the stats ;-)

> 
> How did you have the debug level before changing to 6? Setting debug to
> debug5 (6) is extremelly verbose and some informations are not useful, plus
> it causes an overhead to the daemon. On a very busy system 'debug' may be
> too high, but should be good in most environments, or even better just set
> level to 'info'.

The debug level was set to '5'. Now I have changed it to 'info'.

A few more questions.

SchedulerParameters=sched_min_interval=1, how should I tune this value for this issue of socket timeout?

Most of the clients are still using the binaries of slurm-torque to talk to the master. Do these binaries slow down slurmctld a lot (e.g. qstat vs squeue)?

We may have to change MaxArraySize to 1000000 soon. If there is no big change in the overall number of jobs per day, will the change of 1000000 make the timeout situation worse?


*******************************************************
sdiag output at Thu Oct 26 08:42:22 2017
Data since      Thu Oct 26 08:00:00 2017
*******************************************************
Server thread count: 4
Agent queue size:    0

Jobs submitted: 3873
Jobs started:   4326
Jobs completed: 5155
Jobs canceled:  0
Jobs failed:    0

Main schedule statistics (microseconds):
        Last cycle:   782
        Max cycle:    42501
        Total cycles: 5175
        Mean cycle:   1184
        Mean depth cycle:  16
        Cycles per minute: 123
        Last queue length: 0

Backfilling stats
        Total backfilled jobs (since last slurm start): 5432
        Total backfilled jobs (since last stats cycle start): 106
        Total cycles: 26
        Last cycle when: Thu Oct 26 08:40:25 2017
        Last cycle: 3585
        Max cycle:  2309029
        Mean cycle: 731765
        Last depth cycle: 4
        Last depth cycle (try sched): 4
        Depth Mean: 144
        Depth Mean (try depth): 144
        Last queue length: 4
        Queue length mean: 242

Remote Procedure Call statistics by message type
        REQUEST_JOB_INFO                        ( 2003) count:2247219 ave_time:487830 total_time:1096262750273
        REQUEST_NODE_INFO                       ( 2007) count:2127723 ave_time:22397  total_time:47655723557
        REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:256905 ave_time:83462  total_time:21442012727
        REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:177022 ave_time:40186  total_time:7113983040
        REQUEST_PARTITION_INFO                  ( 2009) count:123153 ave_time:3001   total_time:369603946
        REQUEST_JOB_INFO_SINGLE                 ( 2021) count:54685  ave_time:50468  total_time:2759890770
        REQUEST_UPDATE_JOB                      ( 3001) count:32772  ave_time:34382  total_time:1126797038
        MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:24593  ave_time:2677   total_time:65845643
        REQUEST_RESERVATION_INFO                ( 2024) count:17426  ave_time:36560  total_time:637098420
        REQUEST_PING                            ( 1008) count:8770   ave_time:135    total_time:1187254
        REQUEST_STATS_INFO                      ( 2035) count:8762   ave_time:2042   total_time:17893244
        REQUEST_CANCEL_JOB_STEP                 ( 5005) count:2889   ave_time:21023  total_time:60735905
        MESSAGE_EPILOG_COMPLETE                 ( 6012) count:2725   ave_time:21891  total_time:59653905
        REQUEST_JOB_NOTIFY                      ( 4022) count:195    ave_time:84189  total_time:16417035
        REQUEST_JOB_REQUEUE                     ( 5023) count:6      ave_time:128763 total_time:772582
        REQUEST_RECONFIGURE                     ( 1003) count:2      ave_time:207833 total_time:415667
        REQUEST_TRIGGER_PULL                    ( 2030) count:1      ave_time:104    total_time:104

Remote Procedure Call statistics by user
        nikbt2_t        (   10006) count:2736579 ave_time:390197 total_time:1067806297927
        v_bazine        (    2021) count:1436329 ave_time:46677  total_time:67044764669
        root            (       0) count:601245 ave_time:51332  total_time:30863613022
        y_chamich       (  994428) count:95794  ave_time:19649  total_time:1882303428
        a_graton        (    2023) count:50852  ave_time:52013  total_time:2644991404
        e_geffroy       (  311066) count:46082  ave_time:30536  total_time:1407201554
        slurm           (     411) count:41543  ave_time:27152  total_time:1127984396
        m_haboucha      (  985883) count:21629  ave_time:72940  total_time:1577634152
        a_clairon       (  962434) count:21391  ave_time:103675 total_time:2217719128
        h_azzouz        (    1175) count:15512  ave_time:34169  total_time:530036982
        mouchbt         (     171) count:9015   ave_time:8868   total_time:79948977
        m_durant        (     203) count:3219   ave_time:29728  total_time:95694730
        modelmgr_t      (    3025) count:2899   ave_time:45484  total_time:131859083
        m_maupe         (  529933) count:1566   ave_time:81294  total_time:127307031
        r_huang         (  992184) count:1011   ave_time:41633  total_time:42091240
        gadmin          (   50005) count:88     ave_time:125936 total_time:11082414
        l_berget        ( 3166826) count:88     ave_time:2584   total_time:227466
        d_wang          ( 3140314) count:6      ave_time:3917   total_time:23507
Comment 3 Felip Moll 2017-10-26 04:39:59 MDT
> morning. I hope that didn't remove the values that you are looking for.

It would be better to have the sdiag output on the time that problem arises but with what you send we can do some assumptions.


> Just to experiment the workaround, I have changed MessageTimeout to 20 now.

Ok, but we should try to solve the issue first without changing MessageTimeout,


> I don't notice any particular issue in the I/O stats. But of course when the
> timeout happened, I was not watching the stats ;-)

Good, if it happens again, just take a quick look but since the hardware description you gave I think it would be ok.



> The debug level was set to '5'. Now I have changed it to 'info'.
> 

Wow, this was the maximum. This is important and can be at least part of the cause for the problems. Keep it to info.

Remember you can change the debug level on the fly with "scontrol setdebug debug2" if needed at a particular time.


> SchedulerParameters=sched_min_interval=1, how should I tune this value for
> this issue of socket timeout?

The bigger the value, the lesser the execution frequency of the scheduler. So, if you have high frequency of submissions/terminates the main scheduling loop will execute and test any pending jobs very often.
This indeed can cause performance problems. By default, it is deactivated in 17.02, meaning that on each event the scheduler is triggered.

The units are "microseconds". If you increase a lot the submissions/terminates, consider increasing it to for example sched_min_interval=500000 (0.5 seconds). It is somewhat heuristic so you will have to experiment a little.


> Most of the clients are still using the binaries of slurm-torque to talk to
> the master. Do these binaries slow down slurmctld a lot (e.g. qstat vs
> squeue)?

I don't think so. Some of this wrappers uses slurm tools directly, and some uses the API, but it shouldn't be a problem.


> We may have to change MaxArraySize to 1000000 soon. If there is no big
> change in the overall number of jobs per day, will the change of 1000000
> make the timeout situation worse?
> 

Probably yes if the user-submitted array sizes are really increased. 1000000 is a huge increase.

You probably will need some tunning in the scheduler and you will want to increase also MaxJobCount since each array task counts as a separate job.

I recommend to read through the possible parameters and the high throughput guide.


> *******************************************************
> sdiag output at Thu Oct 26 08:42:22 2017
> Data since      Thu Oct 26 08:00:00 2017
> *******************************************************

Everything seems fine until here.


> Remote Procedure Call statistics by message type
>         REQUEST_JOB_INFO                        ( 2003) count:2247219
> ave_time:487830 total_time:1096262750273
>         REQUEST_NODE_INFO                       ( 2007) count:2127723
> ave_time:22397  total_time:47655723557


> Remote Procedure Call statistics by user
>         nikbt2_t        (   10006) count:2736579 ave_time:390197
> total_time:1067806297927
>         v_bazine        (    2021) count:1436329 ave_time:46677 
> total_time:67044764669
>         root            (       0) count:601245 ave_time:51332 
> total_time:30863613022



In these lines I see a very big count for REQUEST_JOB_INFO and REQUEST_NODE_INFO. The users nikbt2_t, v_bazine are performing a lot of queries to the controller. This can be an issue. 

Can you investigate what these users are doing? 

It seems to me that it can be a script that monitors the jobs and associated nodes, or something similar. In another system with thousands of nodes, after 18 hours of execution I got a maximum of around ~1 million RPCS per the user that have more. Your users have 2.7 and 1.5 milion of queries in 42 minutes.


Let me if it is still happening with the changes and all the provided info.
Comment 5 Felip Moll 2017-10-26 06:32:42 MDT
(In reply to Felip Moll from comment #3)
> > morning. I hope that didn't remove the values that you are looking for.
> 
> It would be better to have the sdiag output on the time that problem arises
> but with what you send we can do some assumptions.
> 
> 
> > Just to experiment the workaround, I have changed MessageTimeout to 20 now.
> 
> Ok, but we should try to solve the issue first without changing
> MessageTimeout,
> 
> 
> > I don't notice any particular issue in the I/O stats. But of course when the
> > timeout happened, I was not watching the stats ;-)
> 
> Good, if it happens again, just take a quick look but since the hardware
> description you gave I think it would be ok.
> 
> 
> 
> > The debug level was set to '5'. Now I have changed it to 'info'.
> > 
> 
> Wow, this was the maximum. This is important and can be at least part of the
> cause for the problems. Keep it to info.
> 
> Remember you can change the debug level on the fly with "scontrol setdebug
> debug2" if needed at a particular time.
> 
> 

My mistake here, level 5 is "debug", not "debug5", so maybe wasn't so harmful. 

I think its more comprehensive to use: "quiet", "fatal", "error", "info", "verbose","debug", "debug2", "debug3", "debug4","debug5" as debug levels.

Let's see anyway how it progress, just keep me up-to-date.
Comment 6 Felip Moll 2017-11-02 10:24:39 MDT
Hi Kevin,

Had you had any chance to check provided comments, do the modifications to the config. and see how it behaves now?

Thank you,
Felip M
Comment 7 Felip Moll 2017-11-07 06:04:28 MST
Hello Kevin, 

Are you still receiving the error messages after applying the modifications?

Thanks,
Felip M
Comment 8 Kevin WONG 2017-11-08 00:45:49 MST
(In reply to Felip Moll from comment #7)
> Hello Kevin, 
> 
> Are you still receiving the error messages after applying the modifications?
> 
> Thanks,
> Felip M

There is no more report of socket timeout after we applied the modifications. Thanks for the hints!

Additional to the question MaxArraySize, is MaxJobCount has to be greater than MaxArraySize? What will happen when user tries to submit jobs while MaxJobCount is just 50% of MaxArraySize?

Regards,
Kevin
Comment 9 Felip Moll 2017-11-08 02:26:00 MST
(In reply to Kevin WONG from comment #8)
> (In reply to Felip Moll from comment #7)
> > Hello Kevin, 
> > 
> > Are you still receiving the error messages after applying the modifications?
> > 
> > Thanks,
> > Felip M
> 
> There is no more report of socket timeout after we applied the
> modifications. Thanks for the hints!
> 
> Additional to the question MaxArraySize, is MaxJobCount has to be greater
> than MaxArraySize? What will happen when user tries to submit jobs while
> MaxJobCount is just 50% of MaxArraySize?
> 
> Regards,
> Kevin

I am glad that problem is solved.

Extracted from the 'man slurm.conf':

MaxJobCount defines the maximum number of jobs Slurm can have in its active database at one time. The value of MaxJobCount should be much larger than MaxArraySize. The default value is 1001. Once MaxJobCount limit is reached, requests to submit additional jobs will fail.

So, because each task of a job array counts as one job, if a user submits an array greater than MaxJobCount the request will fail.

You must increase MaxJobCount to be larger than MaxArraySize.


If you are ok, may I close the ticket?

Best regards
Felip M
Comment 10 Felip Moll 2017-11-14 11:13:53 MST
(In reply to Felip Moll from comment #9)
> (In reply to Kevin WONG from comment #8)
> > (In reply to Felip Moll from comment #7)
> > > Hello Kevin, 
> > > 
> > > Are you still receiving the error messages after applying the modifications?
> > > 
> > > Thanks,
> > > Felip M
> > 
> > There is no more report of socket timeout after we applied the
> > modifications. Thanks for the hints!
> > 
> > Additional to the question MaxArraySize, is MaxJobCount has to be greater
> > than MaxArraySize? What will happen when user tries to submit jobs while
> > MaxJobCount is just 50% of MaxArraySize?
> > 
> > Regards,
> > Kevin
> 
> I am glad that problem is solved.
> 
> Extracted from the 'man slurm.conf':
> 
> MaxJobCount defines the maximum number of jobs Slurm can have in its active
> database at one time. The value of MaxJobCount should be much larger than
> MaxArraySize. The default value is 1001. Once MaxJobCount limit is reached,
> requests to submit additional jobs will fail.
> 
> So, because each task of a job array counts as one job, if a user submits an
> array greater than MaxJobCount the request will fail.
> 
> You must increase MaxJobCount to be larger than MaxArraySize.
> 
> 
> If you are ok, may I close the ticket?
> 
> Best regards
> Felip M


Hi Kevin,

I will assume everything is solved and I will close the ticket, if you think we need to keep it open please just comment on that.

Best regards,
Felip M