-
Notifications
You must be signed in to change notification settings - Fork 784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve doc string for substring
kernel
#1529
Conversation
Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
Codecov Report
@@ Coverage Diff @@
## master #1529 +/- ##
==========================================
- Coverage 82.78% 82.77% -0.01%
==========================================
Files 190 190
Lines 54827 54827
==========================================
- Hits 45386 45384 -2
- Misses 9441 9443 +2
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️ thank you so much @HaoYang670
Looks very nice
/// let array = StringArray::from(vec![Some("E=mc²")]); | ||
/// let result = substring(&array, -1, &None).unwrap(); | ||
/// let result = result.as_any().downcast_ref::<StringArray>().unwrap(); | ||
/// assert_eq!(result.value(0).as_bytes(), &[0x00B2]); // invalid utf-8 format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 Is it worth a ticket to add utf8 validation to this kernel? It should be straightforward, but will likely slow it down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am afraid the time complexity will be changed. Currently, the time complexity is O(length of offset buffer)
. Adding validation might increase the time complexity to O(length of offset buffer + length of value buffer)
because we have to read every byte in the value buffer in the worst case:
https://github.com/apache/arrow-rs/blob/master/arrow/src/array/data.rs#L1090-L1112
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, is it necessary to mark substring
unsafe?
Besides, we could implement a safe version of substring
and also substring_by_char
:
name | Safety | Performance |
---|---|---|
unsafe substring_unckecked | user should make sure valid utf8 substrings can be gotten with start and length |
fast |
substring_checked | None | slow |
substring_by_char | None | slow |
Users should always make sure all values of the input array are in valid utf-8 format.
This will improve the safety of the substring
kernel, but break the public API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, Arrow2 has implemented substring
for binary array. I am not sure whether we need to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also add substring
support for ListArray
, FixedSizeListArray
and FixedSizeBinaryArray
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HaoYang670 I think these are all good ideas (making safe / unsafe versions, and adding substring for binary etc). I would be happy to create tickets for them if that would help
If we are going to change the substring kernel, I suggest we do the following to follow Rust's safe-by-default mantra:
name | Safety | Performance | Notes |
---|---|---|---|
unsafe substring_bytes_unchecked | unsafe | fast | what is currently called substring - user should make sure valid utf8 substrings can be gotten with start and length |
substring_bytes | safe | slow | Implement substring by bytes - throws error of invalid utf-8 sequence is created |
substring | safe | slow | Implement substring by char, ensuring all substrings are utf-8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: Andrew Lamb <[email protected]>
substring
substring
kernel
Signed-off-by: remzi [email protected]
Which issue does this PR close?
Closes #1478.
Rationale for this change
Tell users the
pitfall
ofsubstring
function.Are there any user-facing changes?
Doc is updated.
Follow on work:
#1531