-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[c++] Optimizing indexer for pandas and pyarrow #2159
Conversation
3472f5d
to
b7f14a4
Compare
@beroy - do you have post-change benchmarking numbers for all cases? Just want to ensure removing this does not add any significant regression. I noted the post-change measurement of Pandas types, but not the others (e.g., Arrow). Most concerned about:
Appreciate it if you could check these. Somewhat related: the pytests do not seem to test all types (e.g., Arrow types). It would be excellent if we had a unit test to ensure the expected types work in a basic manner. |
@bkmartinjr, I posted all the benchmarks that I performed in the issue for all cases. I will add correctness benchmarks for all of the cases you mentioned as well. |
I don't see "after change" benchmarks for all cases. Are they identical to "before change" numbers? (I'm being explicit as I'm just looking for any potential regression) |
b7f14a4
to
26d7585
Compare
26d7585
to
6e53ea4
Compare
Yes, for the rest of the cases the benchmark results are identical (made that comments). Actually I only posted the one with major regression. LMK if you need all the results there. The only remaining concern for me is chucked array. I made my py.uncked_arrays 3x larger than pa.array but overall runtime (both setup and lookup) is more almost twice as much as py.array's runtime, still in an acceptable range. |
Updated all results |
@bkmartinjr I published all the result plus Panda's results. Some take aways:
|
Re: results - just took a look at the results in the original issue, excerpted here:
The ChunkedArray result doesn't make sense to me - it should more or less be identical if we handle it correctly, as it should be zero copy (and there are only three chunks, so the overhead from this should be minimal). The results imply it is not zero copy -- ie it is being converted to some other structure, ie.., copied. Given that ChunkedArrays are commonly used in our API, I think we need to chase this down as well. In principle, it should be very close to the same as np.ndarray or pa.array. I suspect you may need to provide a ChunkedArray-specific handler to do this. |
std::vector map_locations
A couple of points:
I'll create a spreadsheet with all the results and the arrays all of the same size soon. Easier to discuss that way. |
Not sure I understand, as the items I pasted are Arrow.
That completely escaped me - apologies! Results still should not scale sub-linearly, right? My rationale: setup and lookup do not inherentlyrequire a flattened vector or a copy - the should be approximately the same speed regardless of chunked or flat. Under the covers, they are all just 1-n_chunks flat vectors, so any overhead is caused by either:
I suspect the ChunkedArray issue is primarily the first. Given that this is the standard Arrow array type we receive from Table, etc, it is worth handling correctly. |
@bkmartinjr you're absolutely right. When I changed the chunked array to the same size with 10 chunks my result became consistent with np.array. Here is the spreadsheet with all details. For me, right now one concern is in general our setup time vs pandas. |
thanks for positing that including the benchmark code. There is a bug in the benchmark which explains some of the odd results (e.g., time==0 for Pandas setup) - sent you a slack with the details. |
@bkmartinjr updated the doc with random large lookups. There are multiple major observations:
|
@beroy - much better results! I'd focus on pyarrow, numpy and pandas keys, as they should all be roughly equivalent in speed (all are flat int64 arrays underneath, so any difference is due to the wrappers and/or excess copies). The |
6e53ea4
to
77b2edb
Compare
77b2edb
to
3e95f00
Compare
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #2159 +/- ##
==========================================
- Coverage 78.52% 72.08% -6.45%
==========================================
Files 136 102 -34
Lines 10687 6881 -3806
Branches 215 215
==========================================
- Hits 8392 4960 -3432
+ Misses 2196 1822 -374
Partials 99 99
Flags with carried forward coverage won't be shown. Click here to find out more.
|
76e4a10
to
3ac83b3
Compare
std::vector map_locations
e44765f
to
d5c61cd
Compare
@@ -0,0 +1,99 @@ | |||
from time import perf_counter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to start adding perf tests to pytest unit tests? Without some sort of pass/fail criteria, they don't seem very useful. Historically we have separated unit tests (function, correctness) from performance, and only the former were run in CI
CC: @johnkerl for any thoughts he may have
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bkmartinjr I moved the perf benchmark outside the CI. Agree totally they should not be in the CI. Also I have correctness version of those tests (smaller memory footprint) in the CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall LGTM. Not sure the perf test belongs in our suite of unit tests without a clear pass/fail criteria, but I'll leave that decision to John, et al.
-Removed std::vector based lookup function to speedup panda's lookup -Add a speiclized lookup for pyarrow Signed-off-by: Behnam Robatmili <[email protected]>
d5c61cd
to
ce62bbc
Compare
@@ -0,0 +1,99 @@ | |||
from time import perf_counter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@beroy is test_indexed_dtatye_perf.py
a typo? What does dtatye
mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a typo. It should say data_types
from tiledbsoma.options._soma_tiledb_context import _validate_soma_tiledb_context | ||
|
||
""" | ||
Performance test evaluating the reindexer performance compared to pnadas.Index for different data types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: pnadas
-> pandas
@beroy there are several post-merge comments here -- let's get a follow-up PR going |
@beroy I don't want to tag 1.7.3 until we have these questions resolved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think my comments need to be addressed immediately, but we should definitely correct the typos @johnkerl pointed out.
.def( | ||
"get_indexer", | ||
[](IntIndexer& indexer, py::array_t<int64_t> lookups) { | ||
auto input_buffer = lookups.request(); | ||
int64_t* input_ptr = static_cast<int64_t*>(input_buffer.ptr); | ||
size_t size = input_buffer.shape[0]; | ||
auto results = py::array_t<int64_t>(size); | ||
auto results_buffer = results.request(); | ||
size_t results_size = results_buffer.shape[0]; | ||
|
||
int64_t* results_ptr = static_cast<int64_t*>( | ||
results_buffer.ptr); | ||
|
||
indexer.lookup(input_ptr, results_ptr, size); | ||
return results; | ||
return get_indexer_general(indexer, lookups); | ||
}) | ||
// Perform lookup for a large input array of keys and writes the looked | ||
// up values into previously allocated array (works for the cases in | ||
// which python and R pre-allocate the array) | ||
.def( | ||
"get_indexer", | ||
[](IntIndexer& indexer, | ||
py::array_t<int64_t> lookups, | ||
py::array_t<int64_t>& results) { | ||
auto input_buffer = lookups.request(); | ||
int64_t* input_ptr = static_cast<int64_t*>(input_buffer.ptr); | ||
size_t size = input_buffer.shape[0]; | ||
|
||
auto results_buffer = results.request(); | ||
int64_t* results_ptr = static_cast<int64_t*>( | ||
results_buffer.ptr); | ||
size_t results_size = input_buffer.shape[0]; | ||
indexer.lookup(input_ptr, input_ptr, size); | ||
}); | ||
// If the input is not arrow (does not have _export_to_c attribute), | ||
// it will be handled using a general input method. | ||
.def("get_indexer", [](IntIndexer& indexer, py::object py_arrow_array) { | ||
return get_indexer_py_arrow(indexer, py_arrow_array); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these should have distinct names as we discussed Pybind11 will not handle dispatching to the correct arguments. You can also simplify the binding by just passing in the function name.
.def("get_indexer_general", get_indexer_general)
.def("get_indexer_py_arrow", get_indexer_py_arrow);
Instead of doing the type checking within the Pybind11 function, they should be done in the Python code. And then call the correct get_indexer_general
or get_indexer_py_arrow
for that object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has a huge implication to our entire indexer API! Changing the API is huge change that has ripple effects. I strongly rather not change API specially given the fact that our API is intentionally copying the pandas one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not have any effect in the API. You would implement it in Python by doing
def get_indexer(obj):
if(is pyarrow obj):
return get_indexer_py_arrow()
else:
return get_indexer_general()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see the issue now. I think just keep how you have it for now, but we should have the clib
stuff just internal. In the future, have a class IntIndex
that holds a member to clib.IntIndexer
.
// Check if it is not a pyarrow array or pyarrow chunked array | ||
if (!py::hasattr(py_arrow_array, "_export_to_c") && | ||
!py::hasattr(py_arrow_array, "chunks") && | ||
!py::hasattr(py_arrow_array, "combine_chunks")) { | ||
// Handle the general case (no py arrow objects) | ||
return get_indexer_general(indexer, py_arrow_array); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know I said something different over DMs a few days ago, but I think it's better to remove this check from here and do these on the Python side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing it on python is not practical as the API call goes directly to the indexer meaning there's no python layer between the python get_index
and this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will create one PR with minor changes needed for this and other PRs.
Issue and/or context:
#2099
Changes:
Notes for Reviewer: