-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make batchSize configurable or increase the value to improve performance #4314
Comments
Related issue #3439 |
This is something interesting for us too, we have tables +500M rows and that will certainly help with the extraction speed. |
@davinchia @marcosmarxm where are we at with this. this seems like a massive benefit to majority of users. |
Gathering from feedback from everyone on the thread, would people prefer this on a per connection, per stream or global basis? |
@davinchia what do you think being a parameter possible to update in the specification of the connector? So, the first sync you can prepare your source database to a more heavy request and after using incremental you can reduce the batch size. |
That's a good idea. This might affect other syncs of the same spec so a user would have to manage that properly. I was thinking having this on a per stream basis makes the most sense, but would require us changing part of the catalog, which is more effort. Maybe adding it to the spec is sufficient. @jrhizor @sherifnada ? |
It's reasonable to be configurable, but the default should also be increased significantly. For the default, we should also consider trying to estimate the memory and using that as a limit instead of the number of rows. |
This issue is addressed in #12400, which uses a dynamic batch size for based on how large the average row is in each table. We cannot set one batch size for each connector, because each table may require a different size. |
Does someone see speed improvements? Mssql -> snowflake 5.46Gb, 1.45 Mio rows |
@jbowlen, thank you for the feedback. This change is mainly to prevent the out-of-memory issue. Its impact on performance is not consistent. It does improve the runtime for some tables, but has no effect for the others, based on the schema and amount of data. We will have a more comprehensive analysis of the database connector performance. The issue is here. |
Tell us about the problem you're trying to solve
Increase performance in sync from jdbc database
Describe the solution you’d like
A advance page where I can configure the batchSize or maybe use a bigger value.
This can be tricky because depends on the source resources.
Hudson done this modification and got a good improvement, changing the 1000 default to 50k.
The 18M rows sync goes from 55 min to 10 min only.
Link for slack convo
Describe the alternative you’ve considered or used
Continue using the default value
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: