Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review / fix behavior of json schema inference for numbers with trailing decimal point (e.g. 2.) #929

Closed
alamb opened this issue Nov 9, 2021 · 4 comments · Fixed by #950
Labels
arrow Changes to the arrow crate bug

Comments

@alamb
Copy link
Contributor

alamb commented Nov 9, 2021

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
This is a follow on to this conversation #831 (comment) on #831 . Specifically, the question is what to do with numbers like 2.

Today the JSON parser schema inference treats numbers like 2. as Utf8 (aka as strings).

As @brianrackle and @jimexist noted, this could be treated as either Int or a Float and it is not clear exactly what the behavior should be

Describe the solution you'd like
Propose a behavior and implement any code / test changes necessary

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@alamb alamb added arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog labels Nov 9, 2021
@alamb alamb added bug and removed enhancement Any new improvement worthy of a entry in the changelog labels Nov 9, 2021
@alamb alamb changed the title Review / fix behavior of json parser for numbers with trailing decimal point (e.g. 2.) Review / fix behavior of json schema inference for numbers with trailing decimal point (e.g. 2.) Nov 9, 2021
@brianrackle
Copy link
Contributor

Investigated JSON further and 2. and .2 are not valid tokens in JSON. Since this isn't valid JSON I don't think these values need any special handling to interpret them as numbers. In these cases, serde rightfully returns an error.

json spec https://www.json.org/json-en.html

@brianrackle
Copy link
Contributor

But I think this issue should be about CSV handling of these values and was mistakenly worded to refer to JSON.

For CSV there is not a formal spec, so the expected behavior is undefined. My thoughts are that since 2. can be interpreted by an Int then it should be and the usual schema inference rules will upgrade from Int to Float only if there are other values in the column which can only be represented by Float.

@brianrackle
Copy link
Contributor

brianrackle commented Nov 13, 2021

But I think this issue should be about CSV handling of these values and was mistakenly worded to refer to JSON.

For CSV there is not a formal spec, so the expected behavior is undefined. My thoughts are that since 2. can be interpreted by an Int then it should be and the usual schema inference rules will upgrade from Int to Float only if there are other values in the column which can only be represented by Float.

After playing with the solution some more inferring as an Int would mean changing the native parse logic which currently does not support converting "2." to an Int. So I followed the path of least resistance and went ahead with doing a conversion which is already supported and infer "2." as a Float so that the existing native parsing logic can be used.

@alamb
Copy link
Contributor Author

alamb commented Nov 13, 2021

So I followed the path of least resistance and went ahead with doing a conversion which is already supported and infer "2." as a Float so that the existing native parsing logic can be used.

I think that makes sense 👍

alamb pushed a commit that referenced this issue Nov 13, 2021
alamb pushed a commit that referenced this issue Nov 22, 2021
* Inferring 2. as Float64 for issue #929

* Adding pretty print support for fixed size list array

* fixing linting errors

* adding null row to test
alamb pushed a commit that referenced this issue Nov 22, 2021
* Inferring 2. as Float64 for issue #929

* Adding pretty print support for fixed size list array

* fixing linting errors

* adding null row to test
alamb pushed a commit that referenced this issue Nov 23, 2021
* Inferring 2. as Float64 for issue #929

* Adding pretty print support for fixed size list array

* fixing linting errors

* adding null row to test
alamb added a commit that referenced this issue Nov 23, 2021
alamb added a commit that referenced this issue Nov 23, 2021
* Inferring 2. as Float64 for issue #929

* Adding pretty print support for fixed size list array

* fixing linting errors

* adding null row to test

Co-authored-by: Brian Rackle <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants