Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve doc string for substring kernel #1529

Merged
merged 3 commits into from
Apr 10, 2022

Conversation

HaoYang670
Copy link
Contributor

@HaoYang670 HaoYang670 commented Apr 8, 2022

Signed-off-by: remzi [email protected]

Which issue does this PR close?

Closes #1478.

Rationale for this change

Tell users the pitfall of substring function.

Are there any user-facing changes?

Doc is updated.

Follow on work:

#1531

Signed-off-by: remzi <[email protected]>
@github-actions github-actions bot added the arrow Changes to the arrow crate label Apr 8, 2022
Signed-off-by: remzi <[email protected]>
@codecov-commenter
Copy link

codecov-commenter commented Apr 8, 2022

Codecov Report

Merging #1529 (5cbc229) into master (252a983) will decrease coverage by 0.00%.
The diff coverage is n/a.

❗ Current head 5cbc229 differs from pull request most recent head ab62bbc. Consider uploading reports for the commit ab62bbc to get more accurate results

@@            Coverage Diff             @@
##           master    #1529      +/-   ##
==========================================
- Coverage   82.78%   82.77%   -0.01%     
==========================================
  Files         190      190              
  Lines       54827    54827              
==========================================
- Hits        45386    45384       -2     
- Misses       9441     9443       +2     
Impacted Files Coverage Δ
arrow/src/compute/kernels/substring.rs 98.31% <ø> (ø)
arrow/src/ffi.rs 87.52% <ø> (ø)
parquet/src/arrow/schema.rs 85.68% <ø> (ø)
parquet_derive/src/parquet_field.rs 65.75% <0.00%> (-0.69%) ⬇️
arrow/src/array/transform/mod.rs 86.46% <0.00%> (+0.11%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 252a983...ab62bbc. Read the comment docs.

@alamb alamb added the documentation Improvements or additions to documentation label Apr 8, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ thank you so much @HaoYang670
Looks very nice

arrow/src/compute/kernels/substring.rs Outdated Show resolved Hide resolved
/// let array = StringArray::from(vec![Some("E=mc²")]);
/// let result = substring(&array, -1, &None).unwrap();
/// let result = result.as_any().downcast_ref::<StringArray>().unwrap();
/// assert_eq!(result.value(0).as_bytes(), &[0x00B2]); // invalid utf-8 format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 Is it worth a ticket to add utf8 validation to this kernel? It should be straightforward, but will likely slow it down.

Copy link
Contributor Author

@HaoYang670 HaoYang670 Apr 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid the time complexity will be changed. Currently, the time complexity is O(length of offset buffer). Adding validation might increase the time complexity to O(length of offset buffer + length of value buffer) because we have to read every byte in the value buffer in the worst case:
https://github.com/apache/arrow-rs/blob/master/arrow/src/array/data.rs#L1090-L1112

Copy link
Contributor Author

@HaoYang670 HaoYang670 Apr 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, is it necessary to mark substring unsafe?
Besides, we could implement a safe version of substring and also substring_by_char:

name Safety Performance
unsafe substring_unckecked user should make sure valid utf8 substrings can be gotten with start and length fast
substring_checked None slow
substring_by_char None slow

Users should always make sure all values of the input array are in valid utf-8 format.

This will improve the safety of the substring kernel, but break the public API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, Arrow2 has implemented substring for binary array. I am not sure whether we need to do it.

Copy link
Contributor Author

@HaoYang670 HaoYang670 Apr 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also add substring support for ListArray, FixedSizeListArray and FixedSizeBinaryArray

Copy link
Contributor

@alamb alamb Apr 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HaoYang670 I think these are all good ideas (making safe / unsafe versions, and adding substring for binary etc). I would be happy to create tickets for them if that would help

If we are going to change the substring kernel, I suggest we do the following to follow Rust's safe-by-default mantra:

name Safety Performance Notes
unsafe substring_bytes_unchecked unsafe fast what is currently called substring - user should make sure valid utf8 substrings can be gotten with start and length
substring_bytes safe slow Implement substring by bytes - throws error of invalid utf-8 sequence is created
substring safe slow Implement substring by char, ensuring all substrings are utf-8

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @alamb . I have filed as epic issue: #1531 .
Please review.

@alamb alamb merged commit 08f386a into apache:master Apr 10, 2022
@alamb alamb changed the title update the doc of substring Improve doc string for substring kernel Apr 10, 2022
@HaoYang670 HaoYang670 deleted the update_substring_doc branch April 10, 2022 12:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The substring kernel panics when chars > U+0x007F
3 participants