Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add StringArray::num_chars for calculating number of characters #1503

Merged
merged 3 commits into from
Mar 30, 2022

Conversation

HaoYang670
Copy link
Contributor

Which issue does this PR close?

Closes #1493 .

Rationale for this change

We need a method to calculate the number of chars for StringArray

What changes are included in this PR?

Provide a pub method num_chars.

delete unchecked fn
update doc

Signed-off-by: remzi <[email protected]>
/// This function has `O(n)` time complexity where `n` is the string length.
/// If you can make sure that all chars in the string are in the range `U+0x0000` ~ `U+0x007F`,
/// please use the function [`value_length`](#method.value_length) which has O(1) time complexity.
pub fn num_chars(&self, i: usize) -> usize {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't find an elegant way to make the returned type as OffsetSize.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think returning the length as a usize is a good API and is what would be expected by rust programmers 👍

/// # Performance
/// This function has `O(n)` time complexity where `n` is the string length.
/// If you can make sure that all chars in the string are in the range `U+0x0000` ~ `U+0x007F`,
/// please use the function [`value_length`](#method.value_length) which has O(1) time complexity.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

length of string == number of chars when all chars are in 0000 ~ 007F

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if you have considered using array.value(i).chars().len() to count utf8 codepoints, as described in
https://stackoverflow.com/questions/46290655/get-the-string-length-in-characters-in-rust?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if you have considered using array.value(i).chars().len() to count utf8 codepoints, as described in https://stackoverflow.com/questions/46290655/get-the-string-length-in-characters-in-rust?

Thank you for your helpful suggestion, I will have a try!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@github-actions github-actions bot added the arrow Changes to the arrow crate label Mar 29, 2022
update doc and test

Signed-off-by: remzi <[email protected]>
@codecov-commenter
Copy link

Codecov Report

Merging #1503 (bc9ef92) into master (f1eba3c) will increase coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1503      +/-   ##
==========================================
+ Coverage   82.70%   82.73%   +0.03%     
==========================================
  Files         188      188              
  Lines       54403    54359      -44     
==========================================
- Hits        44993    44974      -19     
+ Misses       9410     9385      -25     
Impacted Files Coverage Δ
arrow/src/array/array_string.rs 97.70% <100.00%> (+0.04%) ⬆️
arrow/src/datatypes/datatype.rs 66.40% <0.00%> (-0.40%) ⬇️
arrow/src/array/transform/mod.rs 86.46% <0.00%> (-0.12%) ⬇️
arrow/src/buffer/mutable.rs 90.53% <0.00%> (+0.02%) ⬆️
arrow/src/compute/kernels/take.rs 95.41% <0.00%> (+0.13%) ⬆️
parquet/src/arrow/array_reader.rs 86.70% <0.00%> (+2.34%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f1eba3c...bc9ef92. Read the comment docs.

@@ -377,9 +386,9 @@ mod tests {

#[test]
fn test_string_array_from_u8_slice() {
let values: Vec<&str> = vec!["hello", "", "parquet"];
let values: Vec<&str> = vec!["hello", "", "A£ऀ𖼚𝌆৩ƐZ"];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@liukun4515 liukun4515 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alamb
Copy link
Contributor

alamb commented Mar 30, 2022

Thanks @HaoYang670 and @liukun4515 for the review

@alamb alamb merged commit f9975c5 into apache:master Mar 30, 2022
@alamb alamb changed the title Support calculating number of chars for StringArray Add StringArray::num_chars for calculating number of characters Mar 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support calculate length by chars for StringArray
4 participants