-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Python] value_counts
extremely slow for chunked DictionaryArray
#37055
Comments
compute.value_counts
extremely slow for chunked dictionary[int32,string]
-typesvalue_counts
extremely slow for chunked dictionary[int32,string]
-types
value_counts
extremely slow for chunked dictionary[int32,string]
-typesvalue_counts
extremely slow for chunked dictionary[int32,string]
-Array
value_counts
extremely slow for chunked dictionary[int32,string]
-Arrayvalue_counts
extremely slow for chunked DictionaryArray
I assume this is with pyarrow 12? |
Yes, 12.0.1. |
value_counts
extremely slow for chunked DictionaryArray
value_counts
extremely slow for chunked DictionaryArray
Ah, I had done some research on this issue but forgot to post my findings. I think @rok's comment here and the discussion here explain it well. We can optimize it by first computing it over each chunk and hash-aggregate the result. However, I don't think we can directly call hash aggregate functions in compute kernels, without having to depend on acero? cc @westonpace Can you confirm? |
I'm not entirely sure I understand the goal. The aggregate operations do have standalone python bindings. For example:
However, the individuals parts (the partial aggregate func (Consume) and the final aggregate func (Finalize)) cannot be called from python individually. So, for example, it is not possible to create a streaming aggregator in python. However, in this case, you might be able to get away with something like this:
I'm not sure if it will be faster or not. |
Sorry I wasn't clear enough. As discussed here, there are two ways to implement the |
Hi @randolf-scholz, do you remember how many chunks are in your |
@js8544 The dataset in question was the table "hosp/labevents.csv" from the MIMIC-IV dataset: https://physionet.org/content/mimiciv/2.2/. I changed my own preprocessing, so it doesn't really affect me anymore, but I was able to reproduce it in pyarrow 13:
The stats of the data are:
|
Thanks! Since the original file requires registration and some other verificaiton processes, I downloaded a demo file with about 100K rows. Nevertheless I was able to optimize # Before
1.04 ms ± 6.88 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) # value_counts()
625 µs ± 19.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) # combine_chunks().value_counts()
# After
642 µs ± 4.94 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) # value_counts()
610 µs ± 2.71 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) # combine_chunks().value_counts() I'll write a formal C++ benchmark to further verify and send a PR shortly. |
…38394) ### Rationale for this change When merging dictionaries across chunks, the hash kernels unnecessarily unify the existing dictionary, dragging down the performance. ### What changes are included in this PR? Reuse the dictionary unifier across chunks. ### Are these changes tested? Yes, with a new benchmark for dictionary chunked arrays. ### Are there any user-facing changes? No. * Closes: #37055 Lead-authored-by: Jin Shang <[email protected]> Co-authored-by: Felipe Oliveira Carvalho <[email protected]> Signed-off-by: Felipe Oliveira Carvalho <[email protected]>
…ays (apache#38394) ### Rationale for this change When merging dictionaries across chunks, the hash kernels unnecessarily unify the existing dictionary, dragging down the performance. ### What changes are included in this PR? Reuse the dictionary unifier across chunks. ### Are these changes tested? Yes, with a new benchmark for dictionary chunked arrays. ### Are there any user-facing changes? No. * Closes: apache#37055 Lead-authored-by: Jin Shang <[email protected]> Co-authored-by: Felipe Oliveira Carvalho <[email protected]> Signed-off-by: Felipe Oliveira Carvalho <[email protected]>
…ays (apache#38394) ### Rationale for this change When merging dictionaries across chunks, the hash kernels unnecessarily unify the existing dictionary, dragging down the performance. ### What changes are included in this PR? Reuse the dictionary unifier across chunks. ### Are these changes tested? Yes, with a new benchmark for dictionary chunked arrays. ### Are there any user-facing changes? No. * Closes: apache#37055 Lead-authored-by: Jin Shang <[email protected]> Co-authored-by: Felipe Oliveira Carvalho <[email protected]> Signed-off-by: Felipe Oliveira Carvalho <[email protected]>
Describe the bug, including details regarding any error messages, version, and platform.
I have a large dataset (>100M rows) with a
dictionary[int32,string]
column (ChunkedArray
) and noticed thatcompute.value_counts
is extremely slow for this column, compared to other columns.table[col].value_counts()
is 10x-100x slower thantable[col].combine_chunks().value_counts()
in this case.Component(s)
C++, Python
The text was updated successfully, but these errors were encountered: