Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow parsing numbers w/ underscores (e.g. 1_000) from strings #868

Merged
merged 5 commits into from
Aug 10, 2023

Conversation

adriangb
Copy link
Member

@adriangb adriangb commented Aug 9, 2023

Fixes pydantic/pydantic#7053

Selected Reviewer: @samuelcolvin

Comment on lines 147 to 150
JsonInput::String(str) => match str_as_int(self, str) {
Ok(i) => Ok(i),
Err(_) => str_as_int(self, &str.replace('_', "")),
},
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to only eat the performance cost in the error case. This means that parsing 1_000 will be slower than 1000 but I think 1000 is more common so that should be okay.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid duplication should we put this inside str_as_int ? It also has strip_decimal_zeros path, we could combine the two stripping operations for efficiency's sake.

@adriangb
Copy link
Member Author

adriangb commented Aug 9, 2023

please review

@codecov
Copy link

codecov bot commented Aug 9, 2023

Codecov Report

Merging #868 (8aece8a) into main (3f7df80) will increase coverage by 0.01%.
Report is 4 commits behind head on main.
The diff coverage is 100.00%.

❗ Current head 8aece8a differs from pull request most recent head fb7c5a6. Consider uploading reports for the commit fb7c5a6 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #868      +/-   ##
==========================================
+ Coverage   94.05%   94.06%   +0.01%     
==========================================
  Files         102      102              
  Lines       15028    15039      +11     
  Branches       25       25              
==========================================
+ Hits        14135    14147      +12     
+ Misses        887      886       -1     
  Partials        6        6              
Files Changed Coverage Δ
src/input/input_json.rs 91.55% <100.00%> (+0.22%) ⬆️
src/input/input_python.rs 98.14% <100.00%> (-0.01%) ⬇️
src/input/shared.rs 96.59% <100.00%> (+0.81%) ⬆️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3f7df80...fb7c5a6. Read the comment docs.

Copy link
Contributor

@davidhewitt davidhewitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to see these fixes moved inside of str_as_int (and new one for float) and also some concerns on edge cases.

JsonInput::String(str) => str_as_int(self, str),
JsonInput::String(str) => match str_as_int(self, str) {
Ok(i) => Ok(i),
Err(_) => str_as_int(self, &str.replace('_', "")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some edge cases: trailing / leading underscores are not allowed in Python, nor is double-underscore.

>>> int("5__0")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '5__0'
>>> int("5_0_")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '5_0_'
>>> int("_5_0")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '_5_0'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh annoying

Comment on lines 147 to 150
JsonInput::String(str) => match str_as_int(self, str) {
Ok(i) => Ok(i),
Err(_) => str_as_int(self, &str.replace('_', "")),
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid duplication should we put this inside str_as_int ? It also has strip_decimal_zeros path, we could combine the two stripping operations for efficiency's sake.

Comment on lines 180 to 183
Err(_) => match str.replace('_', "").parse::<f64>() {
Ok(i) => Ok(EitherFloat::F64(i)),
Err(_) => Err(ValError::new(ErrorTypeDefaults::FloatParsing, self)),
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should introduce str_as_float to avoid similar duplication.

@codspeed-hq
Copy link

codspeed-hq bot commented Aug 9, 2023

CodSpeed Performance Report

Merging #868 will degrade performances by 48.65%

Comparing parse-numbers-underscores (fb7c5a6) with main (3f7df80)

🎉 Hooray! pytest-codspeed just leveled up to 2.1.0!

A heads-up, this is a breaking change and it might affect your current performance baseline a bit. But here's the exciting part - it's packed with new, cool features and promises improved result stability 🥳!
Curious about what's new? Visit our releases page to delve into all the awesome details about this new version.

Summary

❌ 2 regressions
✅ 133 untouched benchmarks

🆕 3 new benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main parse-numbers-underscores Change
test_core_string_lax_wrong 37.9 µs 63.7 µs -40.5%
test_tagged_union_int_keys_python 33 µs 64.2 µs -48.65%
🆕 test_decimal_from_string_pyd N/A 69.4 µs N/A
🆕 test_decimal_from_string_core N/A 69.9 µs N/A
🆕 test_decimal_from_string_limit N/A 23.6 µs N/A

@davidhewitt
Copy link
Contributor

For anyone wondering, the decision to add this is (as I understand it) because in V1 we used int(s) and float(s) to parse strings, which do accept these underscores. So this is a regression from V1.

@adriangb
Copy link
Member Author

please review

@adriangb
Copy link
Member Author

@davidhewitt are we ready to merge this?

Copy link
Contributor

@davidhewitt davidhewitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better! Just two tiny nits. Also I'll try to fix the integration tests shortly.

src/input/shared.rs Outdated Show resolved Hide resolved
/// and if it's not subsequent parsing will just fail
fn strip_underscores(s: &str) -> Option<String> {
if s.starts_with('_') || s.ends_with('_') || !s.contains('_') || s.contains("__") {
// no underscores to strip
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be nice to add a comment here explaining why startswith / endswith / __ are rejected, because otherwise this comment is just a touch confusing.

@adriangb adriangb enabled auto-merge (squash) August 10, 2023 15:51
@adriangb adriangb merged commit 87b4789 into main Aug 10, 2023
28 of 29 checks passed
@adriangb adriangb deleted the parse-numbers-underscores branch August 10, 2023 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants