Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetching query results concurrently #51

Closed
jonashaag opened this issue Sep 16, 2023 · 14 comments
Closed

Fetching query results concurrently #51

jonashaag opened this issue Sep 16, 2023 · 14 comments

Comments

@jonashaag
Copy link

I was wondering if there is any way to use more CPU cores for fetching query results.

If you have a very fast database, fetching results can be bottlenecked by the single thread that is used by arrow-odbc. One way to do it is to manually partition the query and spawn multiple instances of arrow-odbc but that adds complexity and might not be as efficient as fetching concurrently.

In turbodbc you can use use_async_io which is nice but also limited to a 2x speedup.

@pacman82
Copy link
Owner

Hello @jonashaag ,

In turbodbc you can use use_async_io which is nice but also limited to a 2x speedup

Are there any measuerments of how much this actually improves speed. From what I can gather the implementation does rebind the buffer in between calls to fetch. ODBC function calls can be quite costly. It is not at all clear to me rebinding the buffer is cheaper than just copying it away to someplace else. No disagreemet with the statement here, but I would be truly interessted to learn if there are situation in which this acutally yields a speedup somewhat close to 2x? I guess if converting to Python objects takes roughly as long as fetch?

arrow_odbc after the fetch converts the data from the ODBC to the arrow representation. Which depending on the type of column is some work. In other cases it is the identity operation. The data is always copied though from the ODBC buffer into the arrow target array. Partly because it is simple, and partly because it is enforced by Arrow arrays being immutable and shared. The only way to avoid the copy would be to move the buffer into the arrow array and bind a new one for the next fetch. Which I think might be more expensive, than just reusing the same buffer for fetch over and over again.

I'm just rambling some thoughts, sorry if I am confusing with details. Here is something interessting though. I would not be suprised to learn, that the majority of time is spend in SQLFetch function (again, knowing little about your usecase. Your milage may vary). This function is ultimatly not implemented in turbodbc, or arrow_odbc and to run that with more cores you would need to talk to the implementers of the driver. The only way to call SQLFetch in parallel is to have multiple result sets to be active at the same time. This also avoids having to rebind the buffers over and over again. If your domain allows it, splitting the query at the application level might be the most effective thing to do.

Best, Markus

@jonashaag
Copy link
Author

Sorry, a detail I've left out is that I'm comparing this to fetching Arrow data from turbodbc, not Python objects. It does actually improve performance by almost 2x. I'll do some profiling of both libraries.

@pacman82
Copy link
Owner

It does actually improve performance by almost 2x

Interesting. Either that means my mental model is wrong, or the effort of converting into arrow and SQLFetch are almost perfectly in balance. The latter seems like an amazing coincidence.

@jonashaag
Copy link
Author

Unfortunately I can't really profile on this machine but I see 180% htop CPU usage of the fetching process when using use_async_io=True, and 95% CPU usage with use_async_io=False.

@jonashaag
Copy link
Author

More profiling revealed that most of the time is spent in creating Arrow tables in turbodbc (Append() etc).

I guess the easiest way to make this faster is by processing batches in parallel. Or if you have a very low-overhead thread pool you could also write the columns of each batch in parallel.

Although at the end of the day you'll probably write the batches to Parquet, which will be the bottleneck – although you can make that parallel as well: apache/arrow#33656 apache/datafusion#7562

@pacman82
Copy link
Owner

We could rephrase this issue as provide benchmarking. With some timings emitted by arrow-odbc itself, we could make more informed decisions, if there is something to speed up?

@jonashaag
Copy link
Author

Yeah, I'll have to reproduce this on my local machine, and for that I'll have to find a large enough dataset first.

@jonashaag
Copy link
Author

Screenshot 2023-09-19 at 22 53 46

Here's a synthetic profile that doesn't completely resemble the real one but you get the idea. Here it should be possible to get 1.3 speedup by using a background thread.

@jonashaag
Copy link
Author

jonashaag commented Sep 19, 2023

Screenshot 2023-09-19 at 23 00 28

Here's another profile with the same data but with odbc2parquet with zstd compression. I know it's somewhat unrelated but I wanted to show that there are real benefits of fetching data, converting data to Arrow, and writing to Parquet in parallel.

@pacman82
Copy link
Owner

Hello @jonashaag ,

thanks for providing the benchmarks and insights. I would be interessted in the rough schema of the data you used, but this is out of curiosity purely. I am more convinced than before that fetching query results concurrently may yield some benefits. My full time job is pretty taxing on me, so I am not sure when I would be able to act on these ideas. I will keep the issue open though as a reminder next time I am itching to do some open source.

Best, Markus

@pacman82
Copy link
Owner

pacman82 commented Oct 8, 2023

Hello @jonashaag ,

you may want to check out the latest release arrow-odbc 1.3.0 which features a method called fetch_concurrently on the BatchReader. Tell me how it works for you.

Best, Markus

@jonashaag
Copy link
Author

Cool, this works and gives a 33% speedup. I'll do some profiling to see what's the bottleneck now.

@jonashaag
Copy link
Author

jonashaag commented Oct 10, 2023

 64.00%  67.00%   33.09s    36.41s   _$LT$arrow_odbc..reader..Decimal$u20$as$u20$arrow_odbc..reader..ReadStrategy$GT$::fill_arrow_array::h79802edb8743c187 (arrow_odbc/arrow_odbc/native.so)

I think this is because numbers are loaded as decimal from Snowflake. Will investigate.

In any case I think this ticket is fixed 🚀

@pacman82
Copy link
Owner

Sure, feel free to open another issue or discussion for the performance issue. If you enable the log output, I could provide insight about what is happening "under the hood".

Relevant piece of code which decides what buffer type to use to fetch based on arrow schema and relational SQL type can be found here: https://github.com/pacman82/arrow-odbc/blob/dc38224d72798af5a12b923cb559135820993d6f/src/reader.rs#L165

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants