-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] List slice expression #2479
[FEAT] List slice expression #2479
Conversation
f71d007
to
1099dd0
Compare
1099dd0
to
446ee32
Compare
daft/series.py
Outdated
@@ -541,6 +541,8 @@ def sum(self) -> Series: | |||
assert self._series is not None | |||
return Series._from_pyseries(self._series.sum()) | |||
|
|||
# TODO: Is this needed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Mostly for testing only)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
src/daft-core/src/array/ops/list.rs
Outdated
// Given an i64 array that may have either 1 or `self.len()` elements, create an iterator with | ||
// `self.len()` elements. If there was originally 1 element, we repeat this element `self.len()` | ||
// times, otherwise we simply take the original array. | ||
fn create_iter<'a>(&'a self, arr: &'a Int64Array) -> Box<dyn Iterator<Item = i64> + '_> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's dedup this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/daft-core/src/array/ops/list.rs
Outdated
} | ||
} | ||
let total_capacity = *new_offsets.last().unwrap(); | ||
let default = Series::full_null("null", self.child_data_type(), 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think our growable allows you to add_null without needing a dummy array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this unnecessary default value
src/daft-core/src/array/ops/list.rs
Outdated
self.name(), | ||
self.child_data_type(), | ||
vec![&self.flat_child, &default], | ||
true, // TODO is this right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are very sure that you won't be adding any nulls, you can set this to false
for some performance gains.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/daft-core/src/array/ops/list.rs
Outdated
total_capacity as usize, | ||
); | ||
for (i, start) in slicing_indexes.iter().enumerate() { | ||
if *start >= 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's follow python behavior instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, changed the API
start_iter: impl Iterator<Item = i64>, | ||
length_iter: impl Iterator<Item = i64>, | ||
) -> DaftResult<Series> { | ||
let list_size = self.fixed_element_len(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably dedup a lot of the logic, just need to turn the offsets into an interable or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deduped
src/daft-core/src/series/ops/list.rs
Outdated
self.fixed_size_list()?.get_slices(start_arr, length_arr) | ||
} | ||
dt => Err(DaftError::TypeError(format!( | ||
"list splice not implemented for {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"list splice not implemented for {}", | |
"list slice not implemented for {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch! Fixed
table.eval_expression_list([col("col").list.slice()]) | ||
with pytest.raises(TypeError, match="takes 3 positional arguments but 4 were given"): | ||
table.eval_expression_list([col("col").list.slice(0, 0, 0)]) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also other corner-case (completely empty MicroPartition). This comes up quite a lot, especially in distributed settings when a filter is applied and completely removes data from a MicroPartition (making it empty).
You can make a well-typed empty series like this:
pa.array([], type=pa.list_(...))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this test
tests/table/list/test_list_slice.py
Outdated
def test_list_slice_non_list_type(): | ||
table = MicroPartition.from_pydict( | ||
{ | ||
"mapcol": [{"a": 1}, {"b": 1}, {"c": 1}], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a struct 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😂 fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jaychia thanks for the review! I addressed your comments
table.eval_expression_list([col("col").list.slice()]) | ||
with pytest.raises(TypeError, match="takes 3 positional arguments but 4 were given"): | ||
table.eval_expression_list([col("col").list.slice(0, 0, 0)]) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this test
daft/series.py
Outdated
@@ -541,6 +541,8 @@ def sum(self) -> Series: | |||
assert self._series is not None | |||
return Series._from_pyseries(self._series.sum()) | |||
|
|||
# TODO: Is this needed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
src/daft-core/src/array/ops/list.rs
Outdated
// Given an i64 array that may have either 1 or `self.len()` elements, create an iterator with | ||
// `self.len()` elements. If there was originally 1 element, we repeat this element `self.len()` | ||
// times, otherwise we simply take the original array. | ||
fn create_iter<'a>(&'a self, arr: &'a Int64Array) -> Box<dyn Iterator<Item = i64> + '_> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/daft-core/src/array/ops/list.rs
Outdated
} | ||
} | ||
let total_capacity = *new_offsets.last().unwrap(); | ||
let default = Series::full_null("null", self.child_data_type(), 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this unnecessary default value
src/daft-core/src/array/ops/list.rs
Outdated
self.name(), | ||
self.child_data_type(), | ||
vec![&self.flat_child, &default], | ||
true, // TODO is this right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/daft-core/src/array/ops/list.rs
Outdated
total_capacity as usize, | ||
); | ||
for (i, start) in slicing_indexes.iter().enumerate() { | ||
if *start >= 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, changed the API
start_iter: impl Iterator<Item = i64>, | ||
length_iter: impl Iterator<Item = i64>, | ||
) -> DaftResult<Series> { | ||
let list_size = self.fixed_element_len(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deduped
src/daft-core/src/series/ops/list.rs
Outdated
self.fixed_size_list()?.get_slices(start_arr, length_arr) | ||
} | ||
dt => Err(DaftError::TypeError(format!( | ||
"list splice not implemented for {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch! Fixed
tests/table/list/test_list_slice.py
Outdated
def test_list_slice_non_list_type(): | ||
table = MicroPartition.from_pydict( | ||
{ | ||
"mapcol": [{"a": 1}, {"b": 1}, {"c": 1}], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😂 fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lookin good -- congrats on 1st PR!!!!!!!!!!!!
src/daft-core/src/array/ops/list.rs
Outdated
} | ||
|
||
pub fn get_slices_helper( | ||
parent_offsets: &arrow2::offset::OffsetsBuffer<i64>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can actually make this an impl Iterator<Item = i64>
(since you mostly use it that way anyways!) and then the fixed size list case would be quite clean: std::iter::repeat(element_len).take(list_len)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks for the suggestion!
Adds an expression that returns a specified subset of elements from a list.
The API to use this expression is
daft.Expression.list.slice(start: int | Expression, end: int | Expression)
.start
is the 0-indexed position of the list to start retrieving elements. If this value is negative, then it's the position from the end of the array. i.e.-1
points to the last element of the arrayend
is the 0-indexed position of the list to stop retrieving elements. If this value is negative, then it's the position from the end of the array. i.e.-1
points to the last element of the array