-
-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fetching query results concurrently #51
Comments
Hello @jonashaag ,
Are there any measuerments of how much this actually improves speed. From what I can gather the implementation does rebind the buffer in between calls to
I'm just rambling some thoughts, sorry if I am confusing with details. Here is something interessting though. I would not be suprised to learn, that the majority of time is spend in Best, Markus |
Sorry, a detail I've left out is that I'm comparing this to fetching Arrow data from turbodbc, not Python objects. It does actually improve performance by almost 2x. I'll do some profiling of both libraries. |
Interesting. Either that means my mental model is wrong, or the effort of converting into arrow and SQLFetch are almost perfectly in balance. The latter seems like an amazing coincidence. |
Unfortunately I can't really profile on this machine but I see 180% htop CPU usage of the fetching process when using |
More profiling revealed that most of the time is spent in creating Arrow tables in turbodbc ( I guess the easiest way to make this faster is by processing batches in parallel. Or if you have a very low-overhead thread pool you could also write the columns of each batch in parallel. Although at the end of the day you'll probably write the batches to Parquet, which will be the bottleneck – although you can make that parallel as well: apache/arrow#33656 apache/datafusion#7562 |
We could rephrase this issue as provide benchmarking. With some timings emitted by |
Yeah, I'll have to reproduce this on my local machine, and for that I'll have to find a large enough dataset first. |
Hello @jonashaag , thanks for providing the benchmarks and insights. I would be interessted in the rough schema of the data you used, but this is out of curiosity purely. I am more convinced than before that fetching query results concurrently may yield some benefits. My full time job is pretty taxing on me, so I am not sure when I would be able to act on these ideas. I will keep the issue open though as a reminder next time I am itching to do some open source. Best, Markus |
Hello @jonashaag , you may want to check out the latest release Best, Markus |
Cool, this works and gives a 33% speedup. I'll do some profiling to see what's the bottleneck now. |
I think this is because numbers are loaded as decimal from Snowflake. Will investigate. In any case I think this ticket is fixed 🚀 |
Sure, feel free to open another issue or discussion for the performance issue. If you enable the log output, I could provide insight about what is happening "under the hood". Relevant piece of code which decides what buffer type to use to fetch based on arrow schema and relational SQL type can be found here: https://github.com/pacman82/arrow-odbc/blob/dc38224d72798af5a12b923cb559135820993d6f/src/reader.rs#L165 |
I was wondering if there is any way to use more CPU cores for fetching query results.
If you have a very fast database, fetching results can be bottlenecked by the single thread that is used by arrow-odbc. One way to do it is to manually partition the query and spawn multiple instances of arrow-odbc but that adds complexity and might not be as efficient as fetching concurrently.
In turbodbc you can use
use_async_io
which is nice but also limited to a 2x speedup.The text was updated successfully, but these errors were encountered: