-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement exact median, add AggregateState
#3009
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3009 +/- ##
==========================================
- Coverage 85.81% 85.81% -0.01%
==========================================
Files 282 286 +4
Lines 51531 51790 +259
==========================================
+ Hits 44219 44441 +222
- Misses 7312 7349 +37
📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more |
I plan to review this carefully later today |
AggregateState
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @andygrove -- I think this is great!
I think the way that other aggregates (like distinct
) have handled the need to hold multiple values is to encode them in ScalarValue::List
-- I suspect this approach will be much higher performance (and I can imagine adding other extension variants here like Box<dyn Any>
or something to allow people to encode their aggregate state using whatever custom type they wanted 🤔
I am still somewhat worried about the split in Grouping that we have: a Row and a Column one -- e.g. the Row accumulator does not support median. However, I think this is tracked in #2723 and I don't think anything new is needed for this PR
But may
@@ -44,3 +44,27 @@ pub trait Accumulator: Send + Sync + Debug { | |||
/// returns its value based on its current state. | |||
fn evaluate(&self) -> Result<ScalarValue>; | |||
} | |||
|
|||
#[derive(Debug)] | |||
pub enum AggregateState { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very elegant idea. Can you please add docstrings to AggregateState
explaining what is going on?
I think it would be worth updating the docstrings in the accumulator trait with some discussion / examples of how to use the Array state.
use std::sync::Arc; | ||
|
||
/// MEDIAN aggregate expression. This uses a lot of memory because all values need to be | ||
/// stored in memory before a result can be computed. If an approximation is sufficient |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it is worth (perhaps as a follow on PR) putting a cap on the number of values DataFusion will try to buffer to compute median and throw a runtime error if that number is exceeded 🤔 That way we could avoid OOM kills
Co-authored-by: Andrew Lamb <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it looks good to me (this time I actually finished the review!)
I left some suggestions on how to make the implementation possibly better, but I think any of them could be done as a follow on.
The only thing I think we should really do prior to merging this test is to add additional coverage in the sql_integration
test
#[derive(Debug)] | ||
struct MedianAccumulator { | ||
data_type: DataType, | ||
all_values: Vec<ArrayRef>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if you would be better served here by using an ArrayBuilder (though I realize they are strongly typed so it might be more award -- though it is likely faster)
.map(|v| AggregateState::Array(v.clone())) | ||
.collect(); | ||
if vec.is_empty() { | ||
match self.data_type { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it correct to produce a single [0]
element array? Wouldn't that mean that the 0 is now included in the median calculation even though it was not in the original data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These arrays have length 0. I just pushed a refactor to clean this up and make it more obvious.
|
||
fn evaluate(&self) -> Result<ScalarValue> { | ||
match self.all_values[0].data_type() { | ||
DataType::Int8 => median!(self, arrow::datatypes::Int8Type, Int8, 2), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using a macro here, I wonder if you could use the concat
and take
kernels
https://docs.rs/arrow/19.0.0/arrow/compute/kernels/concat/index.html
https://docs.rs/arrow/19.0.0/arrow/compute/kernels/take/index.html
Something like (untested):
let sorted = sort(concat(&self.all_values));
let len = sorted.len();
let mid = len / 2;
if len % 2 == 0 {
let indexes: UInt64Array = [mid-1, mid].into_iter().collect();
// 🤔 Not sure how to do an average:
let values = average(take(sorted, indexes))
ScalarValue::try_from_array(values, 0)
} else {
ScalarValue::try_from_array(sorted, mid)
}
But the need for an average
stymies that - though I guess we could implement an average
kernel in datafusion and then put it back into arrow
} | ||
|
||
/// Combine all non-null values from provided arrays into a single array | ||
fn combine_arrays<T: ArrowPrimitiveType>(arrays: &[ArrayRef]) -> Result<ArrayRef> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might be able to do this with concat
and take
as well
Untested
let final_array = concat(arrays);
let indexes = final_array.iter().enumerate().filter_map(|(i, v)| v.map(|_| i)).collect();
take(final_array, indexes)
// specific language governing permissions and limitations | ||
// under the License. | ||
|
||
//! Utilities used in aggregates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -221,7 +221,7 @@ async fn csv_query_stddev_6() -> Result<()> { | |||
} | |||
|
|||
#[tokio::test] | |||
async fn csv_query_median_1() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible, I would recommend adding a basic test in sql for a median for all the different data types that are supported (not just on aggregate_test_100 but a dedicated test setup with known data (maybe integers 10, 9, 8, ... 0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in ef1effd
Thanks for the review @alamb. I will work on addressing the feedback over the next couple of days. |
The main concern that departure For the |
Clever idea. I'm not sure that would work for floating-point types? |
@alamb It isn't clear to me which approach you are referring to here. I assume you are saying that the approach in this PR of using Array rather than ScalarList::Vec is likely more performant? |
I guess I was saying that making lots of small
It would probably work but likely take up more space 😆 In general @yjshen 's approach would be good for low cardinality (relatively low numbers of distinct values) and not as good for high cardinality (relatively high numbers of distinct values) -- floating point data happens to often be quite high cardinality |
@alamb I think this is ready for another look. I added tests and filed a couple of follow-on issues: |
} | ||
|
||
#[tokio::test] | ||
async fn median_u8() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
"approx_median", | ||
DataType::Float64, | ||
Arc::new(Float64Array::from(vec![1.1, f64::NAN, f64::NAN, f64::NAN])), | ||
"NaN", // probably not the desired behavior? - see https://github.com/apache/arrow-datafusion/issues/3039 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testing for the win!
@@ -127,7 +127,13 @@ where | |||
l.as_ref().parse::<f64>().unwrap(), | |||
r.as_str().parse::<f64>().unwrap(), | |||
); | |||
assert!((l - r).abs() <= 2.0 * f64::EPSILON); | |||
if l.is_nan() || r.is_nan() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Benchmark runs are scheduled for baseline = 581934d and contender = 245def0. 245def0 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #2925
Rationale for this change
Needed for h2o benchmarks.
What changes are included in this PR?
median
aggregateunwrap
Are there any user-facing changes?
Yes, if implementing UDAFs.