-
Notifications
You must be signed in to change notification settings - Fork 810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Run Length Encoding (RLE) / Run End Encoding (REE) support (Epic) #3520
Comments
One thing to perhaps give thought to is what kernels we need to facilitate this use-case. For example, a core pattern for dictionaries is to evaluate against the child values array and then |
I would like to start this by writing REE Array. |
Would be adding parquet support in the scope? I’m not sure if parquet has RLE directly, but highly compressed (zstd,snappy,zlib) values can be converted to REE arrays instead of inflating. |
Dictionaries are RLE encoded in parquet, so we could theoretically preserve RLE encoding for dictionary arrays, on top of the existing logic to preserve the dictionary encoding. However, the cost and complexity of translating between the two representations is likely to negate much of the potential gains.
Do you have a link for this, I'm not sure how you would reliably use the codec's internal RLE coding as it isn't guaranteed to be at a meaningful granularity for the encoded data? |
@tustvold I agree that the reading performance might not improve but the downstream operations like filter/join/group by (if optimized for REE) would definitely make it worth. Absolutely not a day 0 feature. The zlib/zstd idea is far fetched, but: Both of these are super theoretical and needs access to the compression building blocks, the low level api (as you don’t want to ask the lib to decompress the data, but you want to go compressed -> REE to save big) |
I am starting to write
My questions are below
|
I'm not sure you can implement ArrayAccessor for REEArray as it doesn't know its value type? This is fine imo, we don't implement ArrayAccessor for DictionaryArray for much the same reason. Providing an iterator abstraction, similar to TypedDictionaryArray, that downcasts the values and uses ArrayAccessor to "decode" the runs makes sense to me to help ergonomics However, most kernels I imagine will need custom logic to handle RunEncodedArrays efficiently, e.g. take will need to parse the runs array and compute a new set of runs along with the take indices to apply to the values array. Filter will need to do something similar. One thing to be extremely careful of, is to avoid generic code typed on both the key type and the value type in our kernels - this explodes codegen and has caused a lot of pain with dictionaries |
Yes, I am planning to model
Does it make sense to add a function to
I am planning to model after |
Yes, wherever possible we should handle the indices and values separately, similarly to how we currently handle dictionaries. Otherwise you catastrophically explode code gen and build times, see #2596 and apache/datafusion#4999 for some more context. Non-scalar binary kernels are the only ones where I foresee this not being possible, we should gate those behind feature flags like we do for dictionaries. TypedDictionaryArray is useful for statically typed codepaths, but using it in kernels that must be generated for every combination of types ends up being problematic |
Hi, I'm working on raphtory trying to get a query engine off the ground with datafusion. One of the key ingredients would be REE array support, because hopping around graphs can be expressed efficiently with REE. Can I start looking into adding support for filtering or is there ongoing work that's required first? |
That sounds like an awesome project 🙏
That would be great. Some key features for arrays are (I am not sure offhand what REE has already, you would have to to some research)
Something else that might be worth looking at is the ability to read REE arrays directly from parquet (and avoid decoding) if you are reading for arrow You might take a look at the list of items we made for |
Looks like IPC and filtering are done, so we can tick those off (haven't checked/used casting yet with REE so can't say yet). Another thing we just ran into is parquet conversion:
|
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Arrow has added REE support apache/arrow#14176, similar to dictionary arrays that allow repeated values to be encoded in a space efficient manner that also allows fast processing.
Describe the solution you'd like
Implement REE in arrow-rs. Some likely candidate:
Describe alternatives you've considered
Additional context
The text was updated successfully, but these errors were encountered: