Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[c++] Optimizing indexer for pandas and pyarrow #2159

Merged
merged 1 commit into from
Mar 1, 2024

Conversation

beroy
Copy link
Collaborator

@beroy beroy commented Feb 21, 2024

Issue and/or context:
#2099

Changes:

Notes for Reviewer:

@beroy beroy requested a review from nguyenv February 21, 2024 18:43
@beroy beroy force-pushed the optimize_indexer_for_pandas branch 2 times, most recently from 3472f5d to b7f14a4 Compare February 21, 2024 18:45
@beroy beroy changed the title Optimizing indexer for panda by removing std::vector map_locations Optimizing indexer for pandas by removing std::vector map_locations Feb 21, 2024
@beroy beroy requested a review from bkmartinjr February 21, 2024 18:53
@bkmartinjr
Copy link
Member

@beroy - do you have post-change benchmarking numbers for all cases? Just want to ensure removing this does not add any significant regression. I noted the post-change measurement of Pandas types, but not the others (e.g., Arrow).

Most concerned about:

  • Arrow Array & ChunkedArray
  • Pandas Series
  • NDArray
  • Python list (of least concern, but it does see some use)

Appreciate it if you could check these.

Somewhat related: the pytests do not seem to test all types (e.g., Arrow types). It would be excellent if we had a unit test to ensure the expected types work in a basic manner.

@beroy
Copy link
Collaborator Author

beroy commented Feb 22, 2024

@bkmartinjr, I posted all the benchmarks that I performed in the issue for all cases. I will add correctness benchmarks for all of the cases you mentioned as well.

@bkmartinjr
Copy link
Member

posted all the benchmarks

I don't see "after change" benchmarks for all cases. Are they identical to "before change" numbers? (I'm being explicit as I'm just looking for any potential regression)

@beroy beroy force-pushed the optimize_indexer_for_pandas branch from b7f14a4 to 26d7585 Compare February 22, 2024 20:45
@beroy beroy marked this pull request as ready for review February 22, 2024 20:46
@beroy beroy force-pushed the optimize_indexer_for_pandas branch from 26d7585 to 6e53ea4 Compare February 22, 2024 20:48
@beroy
Copy link
Collaborator Author

beroy commented Feb 22, 2024

Yes, for the rest of the cases the benchmark results are identical (made that comments). Actually I only posted the one with major regression. LMK if you need all the results there. The only remaining concern for me is chucked array. I made my py.uncked_arrays 3x larger than pa.array but overall runtime (both setup and lookup) is more almost twice as much as py.array's runtime, still in an acceptable range.

@beroy
Copy link
Collaborator Author

beroy commented Feb 22, 2024

Updated all results

@beroy
Copy link
Collaborator Author

beroy commented Feb 22, 2024

@bkmartinjr I published all the result plus Panda's results. Some take aways:

  • We clearly are way better in lookup across the board.
  • Panda's does not perform in a few cases python list and pa.chunked_array
  • Panda's setup (hash creation) is significantly faster than us. Something that for sure needs investigation.

@bkmartinjr
Copy link
Member

Re: results - just took a look at the results in the original issue, excerpted here:

pyarrow array:
keys = pa.array(list(range(1, 100000000)))
Setup time: 0.00011958299728576094
Lookup time: 2.5583114580076654

pyarrow Chunked Array:
keys = pa.chunked_array([list(range(1, 100000000)), list(range(100000000, 200000000)), list(range(300000000, 400000000))])
Setup time: 0.5131658750033239
Lookup time: 29.461485624997295

The ChunkedArray result doesn't make sense to me - it should more or less be identical if we handle it correctly, as it should be zero copy (and there are only three chunks, so the overhead from this should be minimal).

The results imply it is not zero copy -- ie it is being converted to some other structure, ie.., copied. Given that ChunkedArrays are commonly used in our API, I think we need to chase this down as well. In principle, it should be very close to the same as np.ndarray or pa.array.

I suspect you may need to provide a ChunkedArray-specific handler to do this.

@johnkerl johnkerl changed the title Optimizing indexer for pandas by removing std::vector map_locations [c++] Optimizing indexer for pandas by removing std::vector map_locations Feb 22, 2024
@beroy
Copy link
Collaborator Author

beroy commented Feb 22, 2024

A couple of points:

  • The results you mentioned are from Panda's runs
  • The size of chunked_array is 3 times bigger

I'll create a spreadsheet with all the results and the arrays all of the same size soon. Easier to discuss that way.

@bkmartinjr
Copy link
Member

The results you mentioned are from Panda's runs

Not sure I understand, as the items I pasted are Arrow.

chunked_array is 3 times bigger

That completely escaped me - apologies! Results still should not scale sub-linearly, right? My rationale: setup and lookup do not inherentlyrequire a flattened vector or a copy - the should be approximately the same speed regardless of chunked or flat. Under the covers, they are all just 1-n_chunks flat vectors, so any overhead is caused by either:

  • your interface forcing a single flat vector rather than a list of vectors (ake ChunkedArray)
  • unexpected time conversions which do a copy

I suspect the ChunkedArray issue is primarily the first. Given that this is the standard Arrow array type we receive from Table, etc, it is worth handling correctly.

@beroy
Copy link
Collaborator Author

beroy commented Feb 23, 2024

@bkmartinjr you're absolutely right. When I changed the chunked array to the same size with 10 chunks my result became consistent with np.array. Here is the spreadsheet with all details. For me, right now one concern is in general our setup time vs pandas.

@bkmartinjr
Copy link
Member

spreadsheet

thanks for positing that including the benchmark code. There is a bug in the benchmark which explains some of the odd results (e.g., time==0 for Pandas setup) - sent you a slack with the details.

@beroy
Copy link
Collaborator Author

beroy commented Feb 23, 2024

@bkmartinjr updated the doc with random large lookups. There are multiple major observations:

  • The fix take care of panda's setup slowdown
  • We do worse than panda's on pyarrow (maybe worth looking into)
  • Both us and panda do really bad on python lists (we do much better than pandas still)

@bkmartinjr
Copy link
Member

@beroy - much better results! I'd focus on pyarrow, numpy and pandas keys, as they should all be roughly equivalent in speed (all are flat int64 arrays underneath, so any difference is due to the wrappers and/or excess copies). The list is another beast and low priority IMHO.

@beroy beroy force-pushed the optimize_indexer_for_pandas branch from 6e53ea4 to 77b2edb Compare February 27, 2024 01:59
@beroy beroy marked this pull request as draft February 27, 2024 02:08
@beroy beroy force-pushed the optimize_indexer_for_pandas branch from 77b2edb to 3e95f00 Compare February 27, 2024 02:46
Copy link

codecov bot commented Feb 27, 2024

Codecov Report

Merging #2159 (ce62bbc) into main (3a83746) will decrease coverage by 6.45%.
The diff coverage is n/a.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2159      +/-   ##
==========================================
- Coverage   78.52%   72.08%   -6.45%     
==========================================
  Files         136      102      -34     
  Lines       10687     6881    -3806     
  Branches      215      215              
==========================================
- Hits         8392     4960    -3432     
+ Misses       2196     1822     -374     
  Partials       99       99              
Flag Coverage Δ
libtiledbsoma 67.62% <ø> (ø)
python ?
r 74.68% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
python_api ∅ <ø> (∅)
libtiledbsoma 48.75% <ø> (ø)

@beroy beroy self-assigned this Feb 27, 2024
@beroy beroy marked this pull request as ready for review February 27, 2024 03:02
@beroy beroy force-pushed the optimize_indexer_for_pandas branch 2 times, most recently from 76e4a10 to 3ac83b3 Compare February 27, 2024 23:44
@beroy beroy changed the title [c++] Optimizing indexer for pandas by removing std::vector map_locations [c++] Optimizing indexer for pandas and pyarrow Feb 28, 2024
@beroy beroy force-pushed the optimize_indexer_for_pandas branch 3 times, most recently from e44765f to d5c61cd Compare February 29, 2024 21:46
@@ -0,0 +1,99 @@
from time import perf_counter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to start adding perf tests to pytest unit tests? Without some sort of pass/fail criteria, they don't seem very useful. Historically we have separated unit tests (function, correctness) from performance, and only the former were run in CI

CC: @johnkerl for any thoughts he may have

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkmartinjr I moved the perf benchmark outside the CI. Agree totally they should not be in the CI. Also I have correctness version of those tests (smaller memory footprint) in the CI.

Copy link
Member

@bkmartinjr bkmartinjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall LGTM. Not sure the perf test belongs in our suite of unit tests without a clear pass/fail criteria, but I'll leave that decision to John, et al.

-Removed std::vector based lookup function to speedup panda's lookup
-Add a speiclized lookup for pyarrow

Signed-off-by: Behnam Robatmili <[email protected]>
@beroy beroy force-pushed the optimize_indexer_for_pandas branch from d5c61cd to ce62bbc Compare March 1, 2024 00:19
@beroy beroy merged commit 84f0815 into main Mar 1, 2024
21 checks passed
@beroy beroy deleted the optimize_indexer_for_pandas branch March 1, 2024 01:10
@@ -0,0 +1,99 @@
from time import perf_counter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@beroy is test_indexed_dtatye_perf.py a typo? What does dtatye mean?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a typo. It should say data_types

from tiledbsoma.options._soma_tiledb_context import _validate_soma_tiledb_context

"""
Performance test evaluating the reindexer performance compared to pnadas.Index for different data types
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: pnadas -> pandas

@johnkerl
Copy link
Member

johnkerl commented Mar 1, 2024

@beroy there are several post-merge comments here -- let's get a follow-up PR going

@johnkerl
Copy link
Member

johnkerl commented Mar 1, 2024

@beroy I don't want to tag 1.7.3 until we have these questions resolved

Copy link
Member

@nguyenv nguyenv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think my comments need to be addressed immediately, but we should definitely correct the typos @johnkerl pointed out.

Comment on lines 152 to +161
.def(
"get_indexer",
[](IntIndexer& indexer, py::array_t<int64_t> lookups) {
auto input_buffer = lookups.request();
int64_t* input_ptr = static_cast<int64_t*>(input_buffer.ptr);
size_t size = input_buffer.shape[0];
auto results = py::array_t<int64_t>(size);
auto results_buffer = results.request();
size_t results_size = results_buffer.shape[0];

int64_t* results_ptr = static_cast<int64_t*>(
results_buffer.ptr);

indexer.lookup(input_ptr, results_ptr, size);
return results;
return get_indexer_general(indexer, lookups);
})
// Perform lookup for a large input array of keys and writes the looked
// up values into previously allocated array (works for the cases in
// which python and R pre-allocate the array)
.def(
"get_indexer",
[](IntIndexer& indexer,
py::array_t<int64_t> lookups,
py::array_t<int64_t>& results) {
auto input_buffer = lookups.request();
int64_t* input_ptr = static_cast<int64_t*>(input_buffer.ptr);
size_t size = input_buffer.shape[0];

auto results_buffer = results.request();
int64_t* results_ptr = static_cast<int64_t*>(
results_buffer.ptr);
size_t results_size = input_buffer.shape[0];
indexer.lookup(input_ptr, input_ptr, size);
});
// If the input is not arrow (does not have _export_to_c attribute),
// it will be handled using a general input method.
.def("get_indexer", [](IntIndexer& indexer, py::object py_arrow_array) {
return get_indexer_py_arrow(indexer, py_arrow_array);
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these should have distinct names as we discussed Pybind11 will not handle dispatching to the correct arguments. You can also simplify the binding by just passing in the function name.

.def("get_indexer_general", get_indexer_general)
.def("get_indexer_py_arrow", get_indexer_py_arrow);

Instead of doing the type checking within the Pybind11 function, they should be done in the Python code. And then call the correct get_indexer_general or get_indexer_py_arrow for that object.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has a huge implication to our entire indexer API! Changing the API is huge change that has ripple effects. I strongly rather not change API specially given the fact that our API is intentionally copying the pandas one.

Copy link
Member

@nguyenv nguyenv Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not have any effect in the API. You would implement it in Python by doing

def get_indexer(obj):
    if(is pyarrow obj):
        return get_indexer_py_arrow()
    else:
        return get_indexer_general()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see the issue now. I think just keep how you have it for now, but we should have the clib stuff just internal. In the future, have a class IntIndex that holds a member to clib.IntIndexer.

Comment on lines +89 to +95
// Check if it is not a pyarrow array or pyarrow chunked array
if (!py::hasattr(py_arrow_array, "_export_to_c") &&
!py::hasattr(py_arrow_array, "chunks") &&
!py::hasattr(py_arrow_array, "combine_chunks")) {
// Handle the general case (no py arrow objects)
return get_indexer_general(indexer, py_arrow_array);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I said something different over DMs a few days ago, but I think it's better to remove this check from here and do these on the Python side.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing it on python is not practical as the API call goes directly to the indexer meaning there's no python layer between the python get_index and this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will create one PR with minor changes needed for this and other PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants