Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding ability to parse float from number with leading decimal #831

Merged
merged 6 commits into from
Nov 9, 2021

Conversation

brianrackle
Copy link
Contributor

Which issue does this PR close?

Closes #830

Rationale for this change

The issue this fixes results in real-world data sets being incorrectly inferred as strings when they are floats.

What changes are included in this PR?

Modifying the regular expression to make the digit proceeding the decimal optional

Are there any user-facing changes?

CSV inferred schemas which include floating-point data doesn't need a proceeding zero for floats between 0 and 1

@github-actions github-actions bot added the arrow Changes to the arrow crate label Oct 15, 2021
@codecov-commenter
Copy link

codecov-commenter commented Oct 16, 2021

Codecov Report

Merging #831 (f8a2dc1) into master (05eb63f) will decrease coverage by 0.14%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #831      +/-   ##
==========================================
- Coverage   82.45%   82.31%   -0.15%     
==========================================
  Files         168      168              
  Lines       48175    48092      -83     
==========================================
- Hits        39723    39587     -136     
- Misses       8452     8505      +53     
Impacted Files Coverage Δ
arrow/src/csv/reader.rs 90.01% <100.00%> (-1.77%) ⬇️
arrow/src/array/transform/fixed_binary.rs 75.00% <0.00%> (-3.95%) ⬇️
parquet/src/schema/visitor.rs 67.10% <0.00%> (-2.64%) ⬇️
arrow/src/compute/kernels/cast_utils.rs 87.50% <0.00%> (-2.50%) ⬇️
parquet/src/encodings/decoding.rs 90.45% <0.00%> (-2.02%) ⬇️
arrow/src/datatypes/ffi.rs 70.85% <0.00%> (-2.02%) ⬇️
parquet/src/arrow/arrow_reader.rs 89.32% <0.00%> (-1.95%) ⬇️
arrow/src/ffi.rs 84.69% <0.00%> (-1.89%) ⬇️
arrow/src/util/data_gen.rs 75.44% <0.00%> (-1.66%) ⬇️
... and 70 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 05eb63f...f8a2dc1. Read the comment docs.

@@ -1341,6 +1341,7 @@ mod tests {
assert_eq!(infer_field_schema("\"123\""), DataType::Utf8);
assert_eq!(infer_field_schema("10"), DataType::Int64);
assert_eq!(infer_field_schema("10.2"), DataType::Float64);
assert_eq!(infer_field_schema(".2"), DataType::Float64);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also try to cover cases like '2.'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, your suggesting that should also be inferred and parsed as a valid float as well right?

Copy link
Contributor

@alamb alamb Oct 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brianrackle can you please rebase this PR (it should then pass CI) and let us know if you will be able to add the test that @jimexist suggests?

I am starting to prepare for the 6.1 release (I plan to build the release candidate late next week) and would like to get this in if possible

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a look at the JSON parser in more detail and running some tests to see how the JSON parser deals with a case like '2.'. Reason being is that I think '2.' is a little less clear how to handle because that value can be represented by an Int, so the question is does the presence of a decimal automatically infer Float or should there be some laziness to upgrading from Int to Float and only done when necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it isn't clear what to do in this case, filed #929 to track this as a follow on work

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brackle-lattice - here is a PR to your branch to pickup changes from master as well as add a test showing what happens with 2. (it is parsed as Utf8). brianrackle#1 (github shows the full diff which is somewhat silly, as it is a merge commit and then a 2 line change). I think what that change this PR will be ready to go

If you don't have time to merge it I can also create a new PR with the changes too to get it into arrow-rs

@alamb
Copy link
Contributor

alamb commented Oct 17, 2021

The tests were failing with

Run export CARGO_HOME="/github/home/.cargo"
    Updating crates.io index
error: failed to select a version for the requirement `comfy-table = "^4.0"`
candidate versions found which didn't match: 2.1.0, 2.0.0, 1.6.0, ...
location searched: crates.io index
required by package `arrow v7.0.0-SNAPSHOT (/__w/arrow-rs/arrow-rs/arrow)`
Error: Process completed with exit code 101.

I have triggered the tests to re-run so hopefully they will pass on the next run

@alamb
Copy link
Contributor

alamb commented Oct 18, 2021

Filed #838 for the CI failures

@alamb
Copy link
Contributor

alamb commented Oct 18, 2021

filed #838 for nightly test failures (there is a PR up so hopefully once that is merged, we can rebase this branch and have it pass too)

@brianrackle
Copy link
Contributor Author

It looks like workflows are still awaiting approval

@alamb
Copy link
Contributor

alamb commented Nov 3, 2021

Workflows started

@alamb
Copy link
Contributor

alamb commented Nov 3, 2021

The integration test failure is unrelated. @brianrackle can you clarify your plans for #831 (comment) ?

@@ -1341,6 +1341,7 @@ mod tests {
assert_eq!(infer_field_schema("\"123\""), DataType::Utf8);
assert_eq!(infer_field_schema("10"), DataType::Int64);
assert_eq!(infer_field_schema("10.2"), DataType::Float64);
assert_eq!(infer_field_schema(".2"), DataType::Float64);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brackle-lattice - here is a PR to your branch to pickup changes from master as well as add a test showing what happens with 2. (it is parsed as Utf8). brianrackle#1 (github shows the full diff which is somewhat silly, as it is a merge commit and then a 2 line change). I think what that change this PR will be ready to go

If you don't have time to merge it I can also create a new PR with the changes too to get it into arrow-rs

Update for JSON schema inference
@brianrackle
Copy link
Contributor Author

@alamb thanks, I believe I have the PR ready. Thanks for helping me get this change in. I will follow up on the other ticket with more info when I get a chance to look into it.

@alamb alamb merged commit ece8a7c into apache:master Nov 9, 2021
alamb added a commit that referenced this pull request Nov 20, 2021
* Adding ability to parse float from number with leading decimal

* Fixing deprecated std::usize::MAX constant per https://doc.rust-lang.org/core/usize/constant.MAX.html and making consistent with other usages

* Add test case for 2. and issue link

Co-authored-by: Andrew Lamb <[email protected]>
alamb added a commit that referenced this pull request Nov 22, 2021
…#962)

* Adding ability to parse float from number with leading decimal

* Fixing deprecated std::usize::MAX constant per https://doc.rust-lang.org/core/usize/constant.MAX.html and making consistent with other usages

* Add test case for 2. and issue link

Co-authored-by: Andrew Lamb <[email protected]>

Co-authored-by: Brian Rackle <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Floating-Points Without Leading Digit In CSV Are Inferred As Strings
6 participants