Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make batchSize configurable or increase the value to improve performance #4314

Closed
marcosmarxm opened this issue Jun 24, 2021 · 10 comments · Fixed by #12400
Closed

Make batchSize configurable or increase the value to improve performance #4314

marcosmarxm opened this issue Jun 24, 2021 · 10 comments · Fixed by #12400
Assignees
Labels
team/triage type/enhancement New feature or request

Comments

@marcosmarxm
Copy link
Member

Tell us about the problem you're trying to solve

Increase performance in sync from jdbc database

Describe the solution you’d like

A advance page where I can configure the batchSize or maybe use a bigger value.
This can be tricky because depends on the source resources.

Hudson done this modification and got a good improvement, changing the 1000 default to 50k.
The 18M rows sync goes from 55 min to 10 min only.
Link for slack convo

Describe the alternative you’ve considered or used

Continue using the default value

Additional context

Add any other context or screenshots about the feature request here.

@marcosmarxm marcosmarxm added the type/enhancement New feature or request label Jun 24, 2021
@davinchia
Copy link
Contributor

Related issue #3439

@andresbravog
Copy link
Contributor

This is something interesting for us too, we have tables +500M rows and that will certainly help with the extraction speed.

@danieldiamond
Copy link
Contributor

@davinchia @marcosmarxm where are we at with this. this seems like a massive benefit to majority of users.

@davinchia
Copy link
Contributor

Gathering from feedback from everyone on the thread, would people prefer this on a per connection, per stream or global basis?

@marcosmarxm
Copy link
Member Author

@davinchia what do you think being a parameter possible to update in the specification of the connector? So, the first sync you can prepare your source database to a more heavy request and after using incremental you can reduce the batch size.

@davinchia
Copy link
Contributor

That's a good idea. This might affect other syncs of the same spec so a user would have to manage that properly. I was thinking having this on a per stream basis makes the most sense, but would require us changing part of the catalog, which is more effort.

Maybe adding it to the spec is sufficient. @jrhizor @sherifnada ?

@jrhizor
Copy link
Contributor

jrhizor commented Sep 7, 2021

It's reasonable to be configurable, but the default should also be increased significantly. For the default, we should also consider trying to estimate the memory and using that as a limit instead of the number of rows.

@tuliren
Copy link
Contributor

tuliren commented Apr 27, 2022

This issue is addressed in #12400, which uses a dynamic batch size for based on how large the average row is in each table. We cannot set one batch size for each connector, because each table may require a different size.

@jbowlen
Copy link

jbowlen commented May 2, 2022

Does someone see speed improvements?
I see the higher/adjusted fetch size in the logs but the sync time is actually the same.

Mssql -> snowflake
160Gb, 91 Mio rows
Old: 8h 48m
New: 8h 46m

5.46Gb, 1.45 Mio rows
Old: 0h 30m
New: 0h 33m

@tuliren
Copy link
Contributor

tuliren commented May 3, 2022

@jbowlen, thank you for the feedback.

This change is mainly to prevent the out-of-memory issue. Its impact on performance is not consistent. It does improve the runtime for some tables, but has no effect for the others, based on the schema and amount of data.

We will have a more comprehensive analysis of the database connector performance. The issue is here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team/triage type/enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants