Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update DECIMAL_RE to allow scientific notation in auto inferred schemas #1216

Merged
merged 2 commits into from
Jan 22, 2022

Conversation

pjmore
Copy link
Contributor

@pjmore pjmore commented Jan 21, 2022

Which issue does this PR close?

Closes #1215.

Rationale for this change

Allow the csv schema inference to handle scientific notation in floating point columns.

What changes are included in this PR?

Update DECIMAL_RE to match on numbers in the form 2e2, -2e-3, 3.4e-5.
Update some csv tests.

Are there any user-facing changes?

Columns that would have been inferred to be utf8 might infer to Float64

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jan 21, 2022
@matthewmturner
Copy link
Contributor

@pjmore thank you for this. i tried out the regex and looks good.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thank you @pjmore

@codecov-commenter
Copy link

codecov-commenter commented Jan 21, 2022

Codecov Report

Merging #1216 (aa5d946) into master (fcd37ee) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1216   +/-   ##
=======================================
  Coverage   82.70%   82.71%           
=======================================
  Files         175      175           
  Lines       51711    51712    +1     
=======================================
+ Hits        42769    42771    +2     
+ Misses       8942     8941    -1     
Impacted Files Coverage Δ
arrow/src/csv/reader.rs 88.12% <100.00%> (+0.01%) ⬆️
parquet/src/encodings/encoding.rs 93.52% <0.00%> (-0.20%) ⬇️
arrow/src/array/transform/mod.rs 85.69% <0.00%> (+0.13%) ⬆️
arrow/src/datatypes/field.rs 54.10% <0.00%> (+0.30%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fcd37ee...aa5d946. Read the comment docs.

@alamb
Copy link
Contributor

alamb commented Jan 21, 2022

@pjmore it appears there is a lint failure on this PR https://github.com/apache/arrow-rs/runs/4900531577?check_suite_focus=true

Can you please fix it?

@liukun4515
Copy link
Contributor

@pjmore please fix lint.
LGTM

@alamb alamb merged commit e72875e into apache:master Jan 22, 2022
@alamb
Copy link
Contributor

alamb commented Jan 22, 2022

Thanks @pjmore and @liukun4515

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extend CSV schema inference to allow scientific notation for floating point types
5 participants