Improve doc string for `substring` kernel #1529

HaoYang670 · 2022-04-08T12:15:55Z

Signed-off-by: remzi [email protected]

Which issue does this PR close?

Closes #1478.

Rationale for this change

Tell users the pitfall of substring function.

Are there any user-facing changes?

Doc is updated.

Follow on work:

#1531

Signed-off-by: remzi <[email protected]>

codecov-commenter · 2022-04-08T12:42:39Z

Codecov Report

Merging #1529 (5cbc229) into master (252a983) will decrease coverage by 0.00%.
The diff coverage is n/a.

❗ Current head 5cbc229 differs from pull request most recent head ab62bbc. Consider uploading reports for the commit ab62bbc to get more accurate results

@@            Coverage Diff             @@
##           master    #1529      +/-   ##
==========================================
- Coverage   82.78%   82.77%   -0.01%     
==========================================
  Files         190      190              
  Lines       54827    54827              
==========================================
- Hits        45386    45384       -2     
- Misses       9441     9443       +2

Impacted Files	Coverage Δ
arrow/src/compute/kernels/substring.rs	`98.31% <ø> (ø)`
arrow/src/ffi.rs	`87.52% <ø> (ø)`
parquet/src/arrow/schema.rs	`85.68% <ø> (ø)`
parquet_derive/src/parquet_field.rs	`65.75% <0.00%> (-0.69%)`	⬇️
arrow/src/array/transform/mod.rs	`86.46% <0.00%> (+0.11%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 252a983...ab62bbc. Read the comment docs.

alamb

❤️ thank you so much @HaoYang670
Looks very nice

arrow/src/compute/kernels/substring.rs

alamb · 2022-04-08T18:21:49Z

arrow/src/compute/kernels/substring.rs

+/// let array = StringArray::from(vec![Some("E=mc²")]);
+/// let result = substring(&array, -1, &None).unwrap();
+/// let result = result.as_any().downcast_ref::<StringArray>().unwrap();
+/// assert_eq!(result.value(0).as_bytes(), &[0x00B2]); // invalid utf-8 format


🤔 Is it worth a ticket to add utf8 validation to this kernel? It should be straightforward, but will likely slow it down.

I am afraid the time complexity will be changed. Currently, the time complexity is O(length of offset buffer). Adding validation might increase the time complexity to O(length of offset buffer + length of value buffer) because we have to read every byte in the value buffer in the worst case:
https://github.com/apache/arrow-rs/blob/master/arrow/src/array/data.rs#L1090-L1112

By the way, is it necessary to mark substring unsafe?
Besides, we could implement a safe version of substring and also substring_by_char:

name Safety Performance

unsafe substring_unckecked user should make sure valid utf8 substrings can be gotten with start and length fast

substring_checked None slow

substring_by_char None slow

Users should always make sure all values of the input array are in valid utf-8 format.

This will improve the safety of the substring kernel, but break the public API.

Also, Arrow2 has implemented substring for binary array. I am not sure whether we need to do it.

We could also add substring support for ListArray, FixedSizeListArray and FixedSizeBinaryArray

@HaoYang670 I think these are all good ideas (making safe / unsafe versions, and adding substring for binary etc). I would be happy to create tickets for them if that would help

If we are going to change the substring kernel, I suggest we do the following to follow Rust's safe-by-default mantra:

name Safety Performance Notes

unsafe substring_bytes_unchecked unsafe fast what is currently called substring - user should make sure valid utf8 substrings can be gotten with start and length

substring_bytes safe slow Implement substring by bytes - throws error of invalid utf-8 sequence is created

substring safe slow Implement substring by char, ensuring all substrings are utf-8

Hi @alamb . I have filed as epic issue: #1531 .
Please review.

Co-authored-by: Andrew Lamb <[email protected]>

update doc

70ee47c

Signed-off-by: remzi <[email protected]>

github-actions bot added the arrow Changes to the arrow crate label Apr 8, 2022

update doc

445ef41

Signed-off-by: remzi <[email protected]>

alamb added the documentation Improvements or additions to documentation label Apr 8, 2022

alamb approved these changes Apr 8, 2022

View reviewed changes

Update arrow/src/compute/kernels/substring.rs

ab62bbc

Co-authored-by: Andrew Lamb <[email protected]>

alamb merged commit 08f386a into apache:master Apr 10, 2022

alamb changed the title ~~update the doc of substring~~ Improve doc string for substring kernel Apr 10, 2022

HaoYang670 mentioned this pull request Apr 10, 2022

Epic: enhance the substring kernel #1531

Open

12 tasks

HaoYang670 deleted the update_substring_doc branch April 10, 2022 12:16

HaoYang670 mentioned this pull request Apr 12, 2022

Mark the current substring function as unsafe and rename it. #1541

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve doc string for `substring` kernel #1529

Improve doc string for `substring` kernel #1529

HaoYang670 commented Apr 8, 2022 •

edited by alamb

Loading

codecov-commenter commented Apr 8, 2022 •

edited

Loading

alamb left a comment

alamb Apr 8, 2022

HaoYang670 Apr 8, 2022 •

edited

Loading

HaoYang670 Apr 8, 2022 •

edited

Loading

HaoYang670 Apr 9, 2022

HaoYang670 Apr 10, 2022 •

edited

Loading

alamb Apr 10, 2022 •

edited

Loading

HaoYang670 Apr 10, 2022

name	Safety	Performance
unsafe substring_unckecked	user should make sure valid utf8 substrings can be gotten with `start` and `length`	fast
substring_checked	None	slow
substring_by_char	None	slow

name	Safety	Performance	Notes
unsafe substring_bytes_unchecked	unsafe	fast	what is currently called `substring` - user should make sure valid utf8 substrings can be gotten with `start` and `length`
substring_bytes	safe	slow	Implement substring by bytes - throws error of invalid utf-8 sequence is created
substring	safe	slow	Implement substring by char, ensuring all substrings are utf-8

Improve doc string for substring kernel #1529

Improve doc string for substring kernel #1529

Conversation

HaoYang670 commented Apr 8, 2022 • edited by alamb Loading

Which issue does this PR close?

Rationale for this change

Are there any user-facing changes?

Follow on work:

codecov-commenter commented Apr 8, 2022 • edited Loading

Codecov Report

alamb left a comment

Choose a reason for hiding this comment

alamb Apr 8, 2022

Choose a reason for hiding this comment

HaoYang670 Apr 8, 2022 • edited Loading

Choose a reason for hiding this comment

HaoYang670 Apr 8, 2022 • edited Loading

Choose a reason for hiding this comment

HaoYang670 Apr 9, 2022

Choose a reason for hiding this comment

HaoYang670 Apr 10, 2022 • edited Loading

Choose a reason for hiding this comment

alamb Apr 10, 2022 • edited Loading

Choose a reason for hiding this comment

HaoYang670 Apr 10, 2022

Choose a reason for hiding this comment

Improve doc string for `substring` kernel #1529

Improve doc string for `substring` kernel #1529

HaoYang670 commented Apr 8, 2022 •

edited by alamb

Loading

codecov-commenter commented Apr 8, 2022 •

edited

Loading

HaoYang670 Apr 8, 2022 •

edited

Loading

HaoYang670 Apr 8, 2022 •

edited

Loading

HaoYang670 Apr 10, 2022 •

edited

Loading

alamb Apr 10, 2022 •

edited

Loading