Make batchSize configurable or increase the value to improve performance #4314

marcosmarxm · 2021-06-24T11:03:38Z

Tell us about the problem you're trying to solve

Increase performance in sync from jdbc database

Describe the solution you’d like

A advance page where I can configure the batchSize or maybe use a bigger value.
This can be tricky because depends on the source resources.

Hudson done this modification and got a good improvement, changing the 1000 default to 50k.
The 18M rows sync goes from 55 min to 10 min only.
Link for slack convo

Describe the alternative you’ve considered or used

Continue using the default value

Additional context

Add any other context or screenshots about the feature request here.

davinchia · 2021-07-01T04:03:36Z

Related issue #3439

andresbravog · 2021-07-01T09:22:44Z

This is something interesting for us too, we have tables +500M rows and that will certainly help with the extraction speed.

danieldiamond · 2021-08-26T03:43:04Z

@davinchia @marcosmarxm where are we at with this. this seems like a massive benefit to majority of users.

davinchia · 2021-09-07T00:52:34Z

Gathering from feedback from everyone on the thread, would people prefer this on a per connection, per stream or global basis?

marcosmarxm · 2021-09-07T00:54:33Z

@davinchia what do you think being a parameter possible to update in the specification of the connector? So, the first sync you can prepare your source database to a more heavy request and after using incremental you can reduce the batch size.

davinchia · 2021-09-07T00:59:28Z

That's a good idea. This might affect other syncs of the same spec so a user would have to manage that properly. I was thinking having this on a per stream basis makes the most sense, but would require us changing part of the catalog, which is more effort.

Maybe adding it to the spec is sufficient. @jrhizor @sherifnada ?

jrhizor · 2021-09-07T14:57:16Z

It's reasonable to be configurable, but the default should also be increased significantly. For the default, we should also consider trying to estimate the memory and using that as a limit instead of the number of rows.

tuliren · 2022-04-27T21:21:01Z

This issue is addressed in #12400, which uses a dynamic batch size for based on how large the average row is in each table. We cannot set one batch size for each connector, because each table may require a different size.

jbowlen · 2022-05-02T08:34:26Z

Does someone see speed improvements?
I see the higher/adjusted fetch size in the logs but the sync time is actually the same.

Mssql -> snowflake
160Gb, 91 Mio rows
Old: 8h 48m
New: 8h 46m

5.46Gb, 1.45 Mio rows
Old: 0h 30m
New: 0h 33m

tuliren · 2022-05-03T06:22:32Z

@jbowlen, thank you for the feedback.

This change is mainly to prevent the out-of-memory issue. Its impact on performance is not consistent. It does improve the runtime for some tables, but has no effect for the others, based on the schema and amount of data.

We will have a more comprehensive analysis of the database connector performance. The issue is here.

marcosmarxm added the type/enhancement New feature or request label Jun 24, 2021

marcosmarxm mentioned this issue Jul 6, 2021

Troubleshooting increase batchSize and add section in database source #4569

Closed

marcosmarxm added this to the Core 2021-09-29 milestone Sep 22, 2021

cgardens added cloud-public-launch and removed cloud-public-launch labels Sep 22, 2021

cgardens removed this from the Core 2021-09-29 (qa) milestone Sep 22, 2021

marcosmarxm mentioned this issue Sep 23, 2021

10K fetchs ize #6229

Closed

38 tasks

bleonard added autoteam team/triage and removed autoteam labels Apr 26, 2022

marcosmarxm mentioned this issue Apr 27, 2022

Add limit for JDBC based source connectors #12297

Closed

tuliren mentioned this issue Apr 27, 2022

🎉 JDBC source: adjust streaming query fetch size dynamically #12400

Merged

tuliren linked a pull request Apr 27, 2022 that will close this issue

🎉 JDBC source: adjust streaming query fetch size dynamically #12400

Merged

tuliren self-assigned this Apr 27, 2022

tuliren closed this as completed in #12400 Apr 29, 2022

tuliren mentioned this issue Apr 29, 2022

🎉 Jdbc sources: publish new version with adaptive fetch size #12480

Merged

9 tasks

tuliren mentioned this issue May 3, 2022

Investigate the performance bottleneck of source database connectors #12532

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make batchSize configurable or increase the value to improve performance #4314

Make batchSize configurable or increase the value to improve performance #4314

marcosmarxm commented Jun 24, 2021

davinchia commented Jul 1, 2021

andresbravog commented Jul 1, 2021

danieldiamond commented Aug 26, 2021

davinchia commented Sep 7, 2021

marcosmarxm commented Sep 7, 2021

davinchia commented Sep 7, 2021

jrhizor commented Sep 7, 2021

tuliren commented Apr 27, 2022 •

edited

Loading

jbowlen commented May 2, 2022

tuliren commented May 3, 2022

Make batchSize configurable or increase the value to improve performance #4314

Make batchSize configurable or increase the value to improve performance #4314

Comments

marcosmarxm commented Jun 24, 2021

Tell us about the problem you're trying to solve

Describe the solution you’d like

Describe the alternative you’ve considered or used

Additional context

davinchia commented Jul 1, 2021

andresbravog commented Jul 1, 2021

danieldiamond commented Aug 26, 2021

davinchia commented Sep 7, 2021

marcosmarxm commented Sep 7, 2021

davinchia commented Sep 7, 2021

jrhizor commented Sep 7, 2021

tuliren commented Apr 27, 2022 • edited Loading

jbowlen commented May 2, 2022

tuliren commented May 3, 2022

tuliren commented Apr 27, 2022 •

edited

Loading