[c++] Optimizing indexer for pandas and pyarrow #2159

beroy · 2024-02-21T18:43:41Z

Issue and/or context:
#2099

Changes:

Notes for Reviewer:

bkmartinjr · 2024-02-22T16:13:39Z

@beroy - do you have post-change benchmarking numbers for all cases? Just want to ensure removing this does not add any significant regression. I noted the post-change measurement of Pandas types, but not the others (e.g., Arrow).

Most concerned about:

Arrow Array & ChunkedArray
Pandas Series
NDArray
Python list (of least concern, but it does see some use)

Appreciate it if you could check these.

Somewhat related: the pytests do not seem to test all types (e.g., Arrow types). It would be excellent if we had a unit test to ensure the expected types work in a basic manner.

beroy · 2024-02-22T20:14:25Z

@bkmartinjr, I posted all the benchmarks that I performed in the issue for all cases. I will add correctness benchmarks for all of the cases you mentioned as well.

bkmartinjr · 2024-02-22T20:26:21Z

posted all the benchmarks

I don't see "after change" benchmarks for all cases. Are they identical to "before change" numbers? (I'm being explicit as I'm just looking for any potential regression)

beroy · 2024-02-22T21:08:46Z

Yes, for the rest of the cases the benchmark results are identical (made that comments). Actually I only posted the one with major regression. LMK if you need all the results there. The only remaining concern for me is chucked array. I made my py.uncked_arrays 3x larger than pa.array but overall runtime (both setup and lookup) is more almost twice as much as py.array's runtime, still in an acceptable range.

beroy · 2024-02-22T21:43:56Z

Updated all results

beroy · 2024-02-22T22:05:03Z

@bkmartinjr I published all the result plus Panda's results. Some take aways:

We clearly are way better in lookup across the board.
Panda's does not perform in a few cases python list and pa.chunked_array
Panda's setup (hash creation) is significantly faster than us. Something that for sure needs investigation.

bkmartinjr · 2024-02-22T22:09:30Z

Re: results - just took a look at the results in the original issue, excerpted here:

pyarrow array:
keys = pa.array(list(range(1, 100000000)))
Setup time: 0.00011958299728576094
Lookup time: 2.5583114580076654

pyarrow Chunked Array:
keys = pa.chunked_array([list(range(1, 100000000)), list(range(100000000, 200000000)), list(range(300000000, 400000000))])
Setup time: 0.5131658750033239
Lookup time: 29.461485624997295

The ChunkedArray result doesn't make sense to me - it should more or less be identical if we handle it correctly, as it should be zero copy (and there are only three chunks, so the overhead from this should be minimal).

The results imply it is not zero copy -- ie it is being converted to some other structure, ie.., copied. Given that ChunkedArrays are commonly used in our API, I think we need to chase this down as well. In principle, it should be very close to the same as np.ndarray or pa.array.

I suspect you may need to provide a ChunkedArray-specific handler to do this.

beroy · 2024-02-22T22:42:35Z

A couple of points:

The results you mentioned are from Panda's runs
The size of chunked_array is 3 times bigger

I'll create a spreadsheet with all the results and the arrays all of the same size soon. Easier to discuss that way.

bkmartinjr · 2024-02-22T22:48:18Z

The results you mentioned are from Panda's runs

Not sure I understand, as the items I pasted are Arrow.

chunked_array is 3 times bigger

That completely escaped me - apologies! Results still should not scale sub-linearly, right? My rationale: setup and lookup do not inherentlyrequire a flattened vector or a copy - the should be approximately the same speed regardless of chunked or flat. Under the covers, they are all just 1-n_chunks flat vectors, so any overhead is caused by either:

your interface forcing a single flat vector rather than a list of vectors (ake ChunkedArray)
unexpected time conversions which do a copy

I suspect the ChunkedArray issue is primarily the first. Given that this is the standard Arrow array type we receive from Table, etc, it is worth handling correctly.

beroy · 2024-02-23T01:10:34Z

@bkmartinjr you're absolutely right. When I changed the chunked array to the same size with 10 chunks my result became consistent with np.array. Here is the spreadsheet with all details. For me, right now one concern is in general our setup time vs pandas.

bkmartinjr · 2024-02-23T02:50:35Z

spreadsheet

thanks for positing that including the benchmark code. There is a bug in the benchmark which explains some of the odd results (e.g., time==0 for Pandas setup) - sent you a slack with the details.

beroy · 2024-02-23T18:35:49Z

@bkmartinjr updated the doc with random large lookups. There are multiple major observations:

The fix take care of panda's setup slowdown
We do worse than panda's on pyarrow (maybe worth looking into)
Both us and panda do really bad on python lists (we do much better than pandas still)

bkmartinjr · 2024-02-23T19:47:24Z

@beroy - much better results! I'd focus on pyarrow, numpy and pandas keys, as they should all be roughly equivalent in speed (all are flat int64 arrays underneath, so any difference is due to the wrappers and/or excess copies). The list is another beast and low priority IMHO.

codecov · 2024-02-27T02:49:55Z

Codecov Report

Merging #2159 (ce62bbc) into main (3a83746) will decrease coverage by 6.45%.
The diff coverage is n/a.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2159      +/-   ##
==========================================
- Coverage   78.52%   72.08%   -6.45%     
==========================================
  Files         136      102      -34     
  Lines       10687     6881    -3806     
  Branches      215      215              
==========================================
- Hits         8392     4960    -3432     
+ Misses       2196     1822     -374     
  Partials       99       99

Flag	Coverage Δ
libtiledbsoma	`67.62% <ø> (ø)`
python	`?`
r	`74.68% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
python_api	`∅ <ø> (∅)`
libtiledbsoma	`48.75% <ø> (ø)`

bkmartinjr · 2024-02-29T23:22:49Z

apis/python/tests/test_indexer_dtatye_perf.py

@@ -0,0 +1,99 @@
+from time import perf_counter


Do we want to start adding perf tests to pytest unit tests? Without some sort of pass/fail criteria, they don't seem very useful. Historically we have separated unit tests (function, correctness) from performance, and only the former were run in CI

CC: @johnkerl for any thoughts he may have

@bkmartinjr I moved the perf benchmark outside the CI. Agree totally they should not be in the CI. Also I have correctness version of those tests (smaller memory footprint) in the CI.

bkmartinjr

overall LGTM. Not sure the perf test belongs in our suite of unit tests without a clear pass/fail criteria, but I'll leave that decision to John, et al.

-Removed std::vector based lookup function to speedup panda's lookup -Add a speiclized lookup for pyarrow Signed-off-by: Behnam Robatmili <[email protected]>

johnkerl · 2024-03-01T14:57:42Z

libtiledbsoma/src/reindexer/test_indexer_dtatye_perf.py

@@ -0,0 +1,99 @@
+from time import perf_counter


@beroy is test_indexed_dtatye_perf.py a typo? What does dtatye mean?

It's a typo. It should say data_types

johnkerl · 2024-03-01T14:58:11Z

libtiledbsoma/src/reindexer/test_indexer_dtatye_perf.py

+from tiledbsoma.options._soma_tiledb_context import _validate_soma_tiledb_context
+
+"""
+Performance test evaluating the reindexer performance compared to pnadas.Index for different data types


typo: pnadas -> pandas

libtiledbsoma/src/reindexer/test_indexer_dtatye_perf.py

johnkerl · 2024-03-01T16:03:56Z

@beroy there are several post-merge comments here -- let's get a follow-up PR going

johnkerl · 2024-03-01T16:24:00Z

@beroy I don't want to tag 1.7.3 until we have these questions resolved

nguyenv

I don't think my comments need to be addressed immediately, but we should definitely correct the typos @johnkerl pointed out.

nguyenv · 2024-03-01T18:10:52Z

apis/python/src/tiledbsoma/reindexer.cc

        .def(
            "get_indexer",
            [](IntIndexer& indexer, py::array_t<int64_t> lookups) {
-                auto input_buffer = lookups.request();
-                int64_t* input_ptr = static_cast<int64_t*>(input_buffer.ptr);
-                size_t size = input_buffer.shape[0];
-                auto results = py::array_t<int64_t>(size);
-                auto results_buffer = results.request();
-                size_t results_size = results_buffer.shape[0];
-
-                int64_t* results_ptr = static_cast<int64_t*>(
-                    results_buffer.ptr);
-
-                indexer.lookup(input_ptr, results_ptr, size);
-                return results;
+                return get_indexer_general(indexer, lookups);
            })
-        // Perform lookup for a large input array of keys and writes the looked
-        // up values into previously allocated array (works for the cases in
-        // which python and R pre-allocate the array)
-        .def(
-            "get_indexer",
-            [](IntIndexer& indexer,
-               py::array_t<int64_t> lookups,
-               py::array_t<int64_t>& results) {
-                auto input_buffer = lookups.request();
-                int64_t* input_ptr = static_cast<int64_t*>(input_buffer.ptr);
-                size_t size = input_buffer.shape[0];
-
-                auto results_buffer = results.request();
-                int64_t* results_ptr = static_cast<int64_t*>(
-                    results_buffer.ptr);
-                size_t results_size = input_buffer.shape[0];
-                indexer.lookup(input_ptr, input_ptr, size);
-            });
+        // If the input is not arrow (does not have _export_to_c attribute),
+        // it will be handled using a general input method.
+        .def("get_indexer", [](IntIndexer& indexer, py::object py_arrow_array) {
+            return get_indexer_py_arrow(indexer, py_arrow_array);
+        });


I think these should have distinct names as we discussed Pybind11 will not handle dispatching to the correct arguments. You can also simplify the binding by just passing in the function name.

.def("get_indexer_general", get_indexer_general) .def("get_indexer_py_arrow", get_indexer_py_arrow);

Instead of doing the type checking within the Pybind11 function, they should be done in the Python code. And then call the correct get_indexer_general or get_indexer_py_arrow for that object.

This has a huge implication to our entire indexer API! Changing the API is huge change that has ripple effects. I strongly rather not change API specially given the fact that our API is intentionally copying the pandas one.

This should not have any effect in the API. You would implement it in Python by doing

def get_indexer(obj): if(is pyarrow obj): return get_indexer_py_arrow() else: return get_indexer_general()

Oh I see the issue now. I think just keep how you have it for now, but we should have the clib stuff just internal. In the future, have a class IntIndex that holds a member to clib.IntIndexer.

nguyenv · 2024-03-01T18:12:54Z

apis/python/src/tiledbsoma/reindexer.cc

+    // Check if it is not a pyarrow array or pyarrow chunked array
+    if (!py::hasattr(py_arrow_array, "_export_to_c") &&
+        !py::hasattr(py_arrow_array, "chunks") &&
+        !py::hasattr(py_arrow_array, "combine_chunks")) {
+        // Handle the general case (no py arrow objects)
+        return get_indexer_general(indexer, py_arrow_array);
+    }


I know I said something different over DMs a few days ago, but I think it's better to remove this check from here and do these on the Python side.

Doing it on python is not practical as the API call goes directly to the indexer meaning there's no python layer between the python get_index and this.

I will create one PR with minor changes needed for this and other PRs.

beroy requested a review from nguyenv February 21, 2024 18:43

beroy force-pushed the optimize_indexer_for_pandas branch 2 times, most recently from 3472f5d to b7f14a4 Compare February 21, 2024 18:45

beroy changed the title ~~Optimizing indexer for panda by removing std::vector map_locations~~ Optimizing indexer for pandas by removing std::vector map_locations Feb 21, 2024

beroy requested a review from bkmartinjr February 21, 2024 18:53

beroy force-pushed the optimize_indexer_for_pandas branch from b7f14a4 to 26d7585 Compare February 22, 2024 20:45

beroy marked this pull request as ready for review February 22, 2024 20:46

beroy force-pushed the optimize_indexer_for_pandas branch from 26d7585 to 6e53ea4 Compare February 22, 2024 20:48

johnkerl changed the title ~~Optimizing indexer for pandas by removing std::vector map_locations~~ [c++] Optimizing indexer for pandas by removing std::vector map_locations Feb 22, 2024

beroy force-pushed the optimize_indexer_for_pandas branch from 6e53ea4 to 77b2edb Compare February 27, 2024 01:59

beroy marked this pull request as draft February 27, 2024 02:08

beroy force-pushed the optimize_indexer_for_pandas branch from 77b2edb to 3e95f00 Compare February 27, 2024 02:46

beroy self-assigned this Feb 27, 2024

beroy marked this pull request as ready for review February 27, 2024 03:02

beroy force-pushed the optimize_indexer_for_pandas branch 2 times, most recently from 76e4a10 to 3ac83b3 Compare February 27, 2024 23:44

beroy changed the title ~~[c++] Optimizing indexer for pandas by removing std::vector map_locations~~ [c++] Optimizing indexer for pandas and pyarrow Feb 28, 2024

beroy force-pushed the optimize_indexer_for_pandas branch 3 times, most recently from e44765f to d5c61cd Compare February 29, 2024 21:46

bkmartinjr reviewed Feb 29, 2024

View reviewed changes

bkmartinjr approved these changes Feb 29, 2024

View reviewed changes

[C++, python] Optimizing indexer for panda create and pyarrow lookup

ce62bbc

-Removed std::vector based lookup function to speedup panda's lookup -Add a speiclized lookup for pyarrow Signed-off-by: Behnam Robatmili <[email protected]>

beroy force-pushed the optimize_indexer_for_pandas branch from d5c61cd to ce62bbc Compare March 1, 2024 00:19

beroy merged commit 84f0815 into main Mar 1, 2024
21 checks passed

beroy deleted the optimize_indexer_for_pandas branch March 1, 2024 01:10

johnkerl reviewed Mar 1, 2024

View reviewed changes

nguyenv reviewed Mar 1, 2024

View reviewed changes

libtiledbsoma/src/reindexer/test_indexer_dtatye_perf.py Show resolved Hide resolved

nguyenv reviewed Mar 1, 2024

View reviewed changes

johnkerl mentioned this pull request Mar 1, 2024

[c++] Address issues related to reindexer advanced features #2186

Merged

beroy mentioned this pull request Mar 11, 2024

[python] indexer creation performance is slow for Pandas Series vs numpy ndarray #2099

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[c++] Optimizing indexer for pandas and pyarrow #2159

[c++] Optimizing indexer for pandas and pyarrow #2159

beroy commented Feb 21, 2024

bkmartinjr commented Feb 22, 2024

beroy commented Feb 22, 2024

bkmartinjr commented Feb 22, 2024

beroy commented Feb 22, 2024 •

edited

Loading

beroy commented Feb 22, 2024

beroy commented Feb 22, 2024 •

edited

Loading

bkmartinjr commented Feb 22, 2024

beroy commented Feb 22, 2024

bkmartinjr commented Feb 22, 2024

beroy commented Feb 23, 2024

bkmartinjr commented Feb 23, 2024

beroy commented Feb 23, 2024

bkmartinjr commented Feb 23, 2024

codecov bot commented Feb 27, 2024 •

edited

Loading

bkmartinjr Feb 29, 2024

beroy Mar 1, 2024

bkmartinjr left a comment

johnkerl Mar 1, 2024

beroy Mar 1, 2024

johnkerl Mar 1, 2024

johnkerl commented Mar 1, 2024

johnkerl commented Mar 1, 2024

nguyenv left a comment

nguyenv Mar 1, 2024

beroy Mar 1, 2024

nguyenv Mar 1, 2024 •

edited

Loading

nguyenv Mar 1, 2024

nguyenv Mar 1, 2024

beroy Mar 1, 2024

beroy Mar 1, 2024

[c++] Optimizing indexer for pandas and pyarrow #2159

[c++] Optimizing indexer for pandas and pyarrow #2159

Conversation

beroy commented Feb 21, 2024

bkmartinjr commented Feb 22, 2024

beroy commented Feb 22, 2024

bkmartinjr commented Feb 22, 2024

beroy commented Feb 22, 2024 • edited Loading

beroy commented Feb 22, 2024

beroy commented Feb 22, 2024 • edited Loading

bkmartinjr commented Feb 22, 2024

beroy commented Feb 22, 2024

bkmartinjr commented Feb 22, 2024

beroy commented Feb 23, 2024

bkmartinjr commented Feb 23, 2024

beroy commented Feb 23, 2024

bkmartinjr commented Feb 23, 2024

codecov bot commented Feb 27, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkmartinjr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnkerl commented Mar 1, 2024

johnkerl commented Mar 1, 2024

nguyenv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nguyenv Mar 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beroy commented Feb 22, 2024 •

edited

Loading

beroy commented Feb 22, 2024 •

edited

Loading

codecov bot commented Feb 27, 2024 •

edited

Loading

nguyenv Mar 1, 2024 •

edited

Loading