Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): Improved type-inference for read_excel and read_ods, use calamine engine for read_ods #15808

Merged
merged 1 commit into from
Apr 21, 2024

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented Apr 20, 2024

Closes #15596.
Closes #15748.
Closes #15763.
Closes #11274.

Taking care of a number of paper-cut issues here; four for the price of one... ;)

Updates

  • Exposes "infer_schema_length" parameter at the top-level of both read_excel and read_ods, automatically integrating it with engine-specific "read_options" (where available; otherwise it's passed down to DataFrame init).
  • Improves initial parse quality of "calamine" (fastexcel) when type-inference is provided via "schema_overrides", translating the Polars dtypes to the dtype name expected by fastexcel's load_sheet_by_name method.
  • Improves docstring/clarity for read_options, making it a lot clearer which method the options will be passed to.
  • Updates the default engine for read_ods to "calamine"; this is not a breaking change as the "ezodf" engine was not configurable in any way, and read_ods has never had an "engine" parameter. Switching out internally for "calamine" removes a lot of custom parsing code, massively improves performance, allows for additional parsing options, and gets rid of an optional dependency on a package that hasn't had an update in about 10(!) years.
  • Marks the "pyxlsb" engine as deprecated, with a suggestion to move to "calamine"; this will allow for further cleanups later. As with "ezodf" this engine has no configurability and is 10x slower than the "calamine" (fastexcel) alternative.
  • Some minor drive-by improvements to the "source" parameter docstring (across all relevant methods).

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Apr 20, 2024
@alexander-beedie alexander-beedie added the A-io-spreadsheet Area: reading/writing Excel/ODS files label Apr 20, 2024
…pe-inference pass-through and new defaults for "xlsb" and "ods"
@alexander-beedie alexander-beedie force-pushed the read-spreadsheet-updates branch from 0dab002 to 64a1ab9 Compare April 20, 2024 22:08
@alexander-beedie alexander-beedie changed the title feat(python): Improved read_excel and read_ods with additional type-inference options, use calamine for "xlsb" and "ods" feat(python): Improved read_excel and read_ods with additional type-inference options, use calamine engine for "ods" Apr 20, 2024
@alexander-beedie alexander-beedie changed the title feat(python): Improved read_excel and read_ods with additional type-inference options, use calamine engine for "ods" feat(python): Improved type-inference for read_excel and read_ods with , use calamine engine for "ods" Apr 20, 2024
@alexander-beedie alexander-beedie changed the title feat(python): Improved type-inference for read_excel and read_ods with , use calamine engine for "ods" feat(python): Improved type-inference for read_excel and read_ods, use calamine engine for read_ods Apr 20, 2024
Copy link

codecov bot commented Apr 20, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 80.40%. Comparing base (0c37ead) to head (64a1ab9).
Report is 3 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #15808   +/-   ##
=======================================
  Coverage   80.39%   80.40%           
=======================================
  Files        1264     1264           
  Lines      165421   165423    +2     
=======================================
+ Hits       132994   133008   +14     
+ Misses      32427    32415   -12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ritchie46
Copy link
Member

Cool stuff! Can we switch to calamine for excel as well?

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Apr 21, 2024

Cool stuff! Can we switch to calamine for excel as well?

Definitely, and I plan to do so; fastexcel has had a number of good updates since we initially integrated it 😎

I think that it needs a bit more prep work than the two (more niche) engines I'm switching over here though, due to "xlsx2csv" being the default, and having a lot of customisation available via the read_csv params you can set via "read_options". I'm going to expose a few more options at the top-level to make the common-cases more accessible to all engines (selecting specific columns to read, etc), and then once that's done will make a breaking change to have "calamine" be the new default engine (while still respecting explicit opt-in to the other engines, of course) ✌️

@alexander-beedie alexander-beedie added the performance Performance issues or improvements label Apr 21, 2024
@ritchie46
Copy link
Member

All right. That sounds good. 👌😁

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-spreadsheet Area: reading/writing Excel/ODS files enhancement New feature or an improvement of an existing feature performance Performance issues or improvements python Related to Python Polars
Projects
None yet
2 participants