Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Added Scalar API #56

Merged
merged 3 commits into from
Aug 11, 2021
Merged

Added Scalar API #56

merged 3 commits into from
Aug 11, 2021

Conversation

jorgecarleitao
Copy link
Owner

@jorgecarleitao jorgecarleitao commented Apr 25, 2021

This PR adds support for scalar values, the dimension 0 version of an Array. See README with the design choices.

It does not enforce any in-memory specification, but their struct alignments are semantically equivalent to arrays.

The rational for this scalar is that it allows some of our compute operators to be written as generic functions. The two most relevant use-cases:

  1. aggregates, Fn(Array) -> Scalar
  2. arithmetic with scalars, Fn(Array, Scalar) -> Array

This PR demonstrates the usefulness on the aggregates.

@jorgecarleitao jorgecarleitao added the enhancement An improvement to an existing feature label Apr 25, 2021
@codecov
Copy link

codecov bot commented Apr 25, 2021

Codecov Report

Merging #56 (ffb73d2) into main (fa3c2ce) will decrease coverage by 0.57%.
The diff coverage is 47.30%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #56      +/-   ##
==========================================
- Coverage   76.81%   76.23%   -0.58%     
==========================================
  Files         229      238       +9     
  Lines       19617    19966     +349     
==========================================
+ Hits        15068    15222     +154     
- Misses       4549     4744     +195     
Impacted Files Coverage Δ
src/compute/aggregate/min_max.rs 59.16% <0.00%> (-27.10%) ⬇️
src/scalar/list.rs 0.00% <0.00%> (ø)
src/scalar/mod.rs 0.00% <0.00%> (ø)
src/scalar/null.rs 0.00% <0.00%> (ø)
src/scalar/struct_.rs 0.00% <0.00%> (ø)
src/scalar/equal.rs 15.90% <15.90%> (ø)
src/compute/aggregate/sum.rs 77.10% <52.94%> (-17.44%) ⬇️
src/scalar/boolean.rs 89.65% <89.65%> (ø)
src/scalar/binary.rs 94.28% <94.28%> (ø)
src/scalar/utf8.rs 94.28% <94.28%> (ø)
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fa3c2ce...ffb73d2. Read the comment docs.

@jorgecarleitao
Copy link
Owner Author

@elferherrera what do you think about this?

It is a bit different from DataFusions' current ScalarValue because it does not use an enum, and has one variation per physical, not logical, data type, to be aligned with arrow2's overall design.

The API would be along the lines of

let scalar = max(array);

match scalar.data_type() {
     DataType::Float64 => scalar.as_any().downcast_ref<PrimitiveScalar<f64>>(),
     DataType::Int64 | DataType::Date64 | DataType::Timestamp(_, _) | ... => scalar.as_any().downcast_ref<PrimitiveScalar<i64>>(),
     ...
}

which is the same idiom that arrow2 uses for arrays, where each physical type has a unique in-memory representation.

@nevi-me
Copy link
Contributor

nevi-me commented Apr 26, 2021

  • aggregates, Fn(Array) -> Scalar
  • arithmetic with scalars, Fn(Array, Scalar) -> Array

I like this approach, and using a trait instead of an enum offers flexibility with implementing a possible Datum like in C++, where kernels can take Fn(Datum, Datum) -> Array | Scalar | Datum depending on the kernel function

@elferherrera
Copy link
Contributor

@jorgecarleitao I also like this approach. It makes Scalar a more robust representation. I would guess that this Scalar trait object will have to be able to be added to a PrimitiveArray, right? We could use the arithmetic traits to help with this operation.

@jorgecarleitao
Copy link
Owner Author

@elferherrera , I think that we have the exact same problem as dyn Array: some kernels support, others do not.

@elferherrera
Copy link
Contributor

@elferherrera , I think that we have the exact same problem as dyn Array: some kernels support, others do not.

@jorgecarleitao Im lost on this one. Do you mean the same issue we had with take?

@sundy-li
Copy link
Collaborator

sundy-li commented Aug 4, 2021

In the columnar database system, Scalar represents one-row value. In datafusion, there is DataValue represents scalar value.

But I think to take number * 3 as multi(Scalar, Array) is not a good way, because there have different sizes(the length of dimension). We can introduce ConstantArray which implement the Array and iterator trait.

Thus the API could be much more cleaner, and everything in compute kernels is op(Array,Array)

@jorgecarleitao
Copy link
Owner Author

Hey,

Thanks for the feedback. I understand the idea. numpy uses the same approach with the shape.

I see three concerns with a "constant Array":

The first is that we are effectively extending the core trait with an out-of-spec encoding: const array is essentially an Array with an encoding compressing all values to the same value. Thus, I would feel more confortable working something like this to the arrow spec.

The second is that iter cannot be used for SIMD because Iterator<i32> (as opposed to &[i32]) is not guaranteed to be contiguous. So, there will be a lot of optimisations missed out in this case because we can no longer write things like add(&[i32], &[i32], &mut [i32]); instead, we, need to write add(&[i32], IntoIter, &mut [i32]) to accumulate to the constant array.

With scalars, we can write add(&[i32], &[i32], &mut [i32]) and add(&[i32], i32, &mut [i32]), both of which can be optimised to SIMD.

The third concern is that all our dynamic operators use match array.data_type() to downcast to concrete implementations. If we extend Array with a different encoding, we would need to an extra item in the match, e.g. match (is_scalar, array.data_type()). This would result in the same number of branches for most of our cases. There are also cases where an operation over a scalar does not make sense (e.g. filter, limit, shift, take), which would again make the API a bit less intuitive.

The scalar API proposed here is mostly used to simply consumers' code with an API like sum(&dyn array) -> Box<dyn Scalar>.

The second motivation is to make the implementation of the Union easier to use (as a Union's value's type changes row by row).

@jorgecarleitao jorgecarleitao merged commit 4b987fc into main Aug 11, 2021
@jorgecarleitao jorgecarleitao deleted the scalar branch August 11, 2021 05:14
@jorgecarleitao jorgecarleitao changed the title Scalar Added Scalar API Aug 11, 2021
@jorgecarleitao jorgecarleitao removed the enhancement An improvement to an existing feature label Aug 11, 2021
@jorgecarleitao jorgecarleitao added feature A new feature and removed backwards-incompatible labels Aug 11, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature A new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants