Fetching query results concurrently #51

jonashaag · 2023-09-16T08:48:00Z

I was wondering if there is any way to use more CPU cores for fetching query results.

If you have a very fast database, fetching results can be bottlenecked by the single thread that is used by arrow-odbc. One way to do it is to manually partition the query and spawn multiple instances of arrow-odbc but that adds complexity and might not be as efficient as fetching concurrently.

In turbodbc you can use use_async_io which is nice but also limited to a 2x speedup.

The text was updated successfully, but these errors were encountered:

pacman82 · 2023-09-18T06:09:39Z

Hello @jonashaag ,

In turbodbc you can use use_async_io which is nice but also limited to a 2x speedup

Are there any measuerments of how much this actually improves speed. From what I can gather the implementation does rebind the buffer in between calls to fetch. ODBC function calls can be quite costly. It is not at all clear to me rebinding the buffer is cheaper than just copying it away to someplace else. No disagreemet with the statement here, but I would be truly interessted to learn if there are situation in which this acutally yields a speedup somewhat close to 2x? I guess if converting to Python objects takes roughly as long as fetch?

arrow_odbc after the fetch converts the data from the ODBC to the arrow representation. Which depending on the type of column is some work. In other cases it is the identity operation. The data is always copied though from the ODBC buffer into the arrow target array. Partly because it is simple, and partly because it is enforced by Arrow arrays being immutable and shared. The only way to avoid the copy would be to move the buffer into the arrow array and bind a new one for the next fetch. Which I think might be more expensive, than just reusing the same buffer for fetch over and over again.

I'm just rambling some thoughts, sorry if I am confusing with details. Here is something interessting though. I would not be suprised to learn, that the majority of time is spend in SQLFetch function (again, knowing little about your usecase. Your milage may vary). This function is ultimatly not implemented in turbodbc, or arrow_odbc and to run that with more cores you would need to talk to the implementers of the driver. The only way to call SQLFetch in parallel is to have multiple result sets to be active at the same time. This also avoids having to rebind the buffers over and over again. If your domain allows it, splitting the query at the application level might be the most effective thing to do.

Best, Markus

jonashaag · 2023-09-18T07:08:12Z

Sorry, a detail I've left out is that I'm comparing this to fetching Arrow data from turbodbc, not Python objects. It does actually improve performance by almost 2x. I'll do some profiling of both libraries.

pacman82 · 2023-09-18T07:26:21Z

It does actually improve performance by almost 2x

Interesting. Either that means my mental model is wrong, or the effort of converting into arrow and SQLFetch are almost perfectly in balance. The latter seems like an amazing coincidence.

jonashaag · 2023-09-18T14:52:33Z

Unfortunately I can't really profile on this machine but I see 180% htop CPU usage of the fetching process when using use_async_io=True, and 95% CPU usage with use_async_io=False.

jonashaag · 2023-09-19T12:12:09Z

More profiling revealed that most of the time is spent in creating Arrow tables in turbodbc (Append() etc).

I guess the easiest way to make this faster is by processing batches in parallel. Or if you have a very low-overhead thread pool you could also write the columns of each batch in parallel.

Although at the end of the day you'll probably write the batches to Parquet, which will be the bottleneck – although you can make that parallel as well: apache/arrow#33656 apache/datafusion#7562

pacman82 · 2023-09-19T12:21:22Z

We could rephrase this issue as provide benchmarking. With some timings emitted by arrow-odbc itself, we could make more informed decisions, if there is something to speed up?

jonashaag · 2023-09-19T12:23:45Z

Yeah, I'll have to reproduce this on my local machine, and for that I'll have to find a large enough dataset first.

jonashaag · 2023-09-19T20:54:59Z

Here's a synthetic profile that doesn't completely resemble the real one but you get the idea. Here it should be possible to get 1.3 speedup by using a background thread.

jonashaag · 2023-09-19T21:01:15Z

Here's another profile with the same data but with odbc2parquet with zstd compression. I know it's somewhat unrelated but I wanted to show that there are real benefits of fetching data, converting data to Arrow, and writing to Parquet in parallel.

pacman82 · 2023-09-22T06:24:53Z

Hello @jonashaag ,

thanks for providing the benchmarks and insights. I would be interessted in the rough schema of the data you used, but this is out of curiosity purely. I am more convinced than before that fetching query results concurrently may yield some benefits. My full time job is pretty taxing on me, so I am not sure when I would be able to act on these ideas. I will keep the issue open though as a reminder next time I am itching to do some open source.

Best, Markus

pacman82 · 2023-10-08T17:49:49Z

Hello @jonashaag ,

you may want to check out the latest release arrow-odbc 1.3.0 which features a method called fetch_concurrently on the BatchReader. Tell me how it works for you.

Best, Markus

jonashaag · 2023-10-10T10:00:23Z

Cool, this works and gives a 33% speedup. I'll do some profiling to see what's the bottleneck now.

jonashaag · 2023-10-10T10:10:54Z

 64.00%  67.00%   33.09s    36.41s   _$LT$arrow_odbc..reader..Decimal$u20$as$u20$arrow_odbc..reader..ReadStrategy$GT$::fill_arrow_array::h79802edb8743c187 (arrow_odbc/arrow_odbc/native.so)

I think this is because numbers are loaded as decimal from Snowflake. Will investigate.

In any case I think this ticket is fixed 🚀

pacman82 · 2023-10-10T12:01:04Z

Sure, feel free to open another issue or discussion for the performance issue. If you enable the log output, I could provide insight about what is happening "under the hood".

Relevant piece of code which decides what buffer type to use to fetch based on arrow schema and relational SQL type can be found here: https://github.com/pacman82/arrow-odbc/blob/dc38224d72798af5a12b923cb559135820993d6f/src/reader.rs#L165

pacman82 closed this as completed Oct 10, 2023

jonashaag mentioned this issue Oct 10, 2023

Slower than std? pacman82/atoi-rs#24

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fetching query results concurrently #51

Fetching query results concurrently #51

jonashaag commented Sep 16, 2023

pacman82 commented Sep 18, 2023

jonashaag commented Sep 18, 2023

pacman82 commented Sep 18, 2023

jonashaag commented Sep 18, 2023

jonashaag commented Sep 19, 2023

pacman82 commented Sep 19, 2023

jonashaag commented Sep 19, 2023

jonashaag commented Sep 19, 2023

jonashaag commented Sep 19, 2023 •

edited

Loading

pacman82 commented Sep 22, 2023

pacman82 commented Oct 8, 2023

jonashaag commented Oct 10, 2023

jonashaag commented Oct 10, 2023 •

edited

Loading

pacman82 commented Oct 10, 2023

Fetching query results concurrently #51

Fetching query results concurrently #51

Comments

jonashaag commented Sep 16, 2023

pacman82 commented Sep 18, 2023

jonashaag commented Sep 18, 2023

pacman82 commented Sep 18, 2023

jonashaag commented Sep 18, 2023

jonashaag commented Sep 19, 2023

pacman82 commented Sep 19, 2023

jonashaag commented Sep 19, 2023

jonashaag commented Sep 19, 2023

jonashaag commented Sep 19, 2023 • edited Loading

pacman82 commented Sep 22, 2023

pacman82 commented Oct 8, 2023

jonashaag commented Oct 10, 2023

jonashaag commented Oct 10, 2023 • edited Loading

pacman82 commented Oct 10, 2023

jonashaag commented Sep 19, 2023 •

edited

Loading

jonashaag commented Oct 10, 2023 •

edited

Loading