Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

engine_options vs read_options in read_excel is confusing #15596

Closed
mcrumiller opened this issue Apr 11, 2024 · 4 comments · Fixed by #15808
Closed

engine_options vs read_options in read_excel is confusing #15596

mcrumiller opened this issue Apr 11, 2024 · 4 comments · Fixed by #15808
Labels
documentation Improvements or additions to documentation

Comments

@mcrumiller
Copy link
Contributor

mcrumiller commented Apr 11, 2024

Description

It is not clear from reading the docs what the functional difference between engine_options and read_options is:

  • engine_options is declared as "additional options passed to the underlying engine’s primary parsing constructor".
  • read_options is "Extra options passed to the function that reads the sheet data".

Is the function that reads the sheet data not the same as the engine that reads the sheet data? If not, what does the engine do, how does one find the underlying function, and where does one find the API for each of those?

Additionally, I think that we should directly link to the APIs of the underlying engines so that the user can know explicitly what can be passed. See #15486 where this caused confusion elsewhere. There are also many different fastexcel repositories.

@mcrumiller mcrumiller added the documentation Improvements or additions to documentation label Apr 11, 2024
@deanm0000
Copy link
Collaborator

It used to be that the only engine was xlsx2csv and so there was a parameter to pass options to xlsx2csv and then there was another parameter to pass options to pl.read_csv and the parameter names reflected that. The parameters got renamed since there are more options so they're less clear.

For the newer engines that have a workbook class and a load_sheet method, the workbook class gets the engine_options and the load_sheet method gets the read_options. For the old xlsx2csv, its engine_options go to xlsx2csv (including which sheet) and then read_options goes to pl.read_csv so unfortunately it can't be delineated as workbook and worksheet options. It seems like it might just need to be options1 and options2 with a description of what that means per engine since they're not consistent enough to have a more generic meaning. To be clear, I'm not arguing to do that, I'm just lamenting the lack of good options.

@mcrumiller
Copy link
Contributor Author

In this, case I think the clearest would be to have separate input argument names for each engine, which only apply when that engine is selected, e.g.:

pl.read_excel(file, engine="calamine", calamine_options = {...})
pl.read_excel(file, engine="xlsx2", xlsx2_options = {...})

We could then tailor each to the appropriate engine, and the user does not have to mentally separate engine and read options.

@deanm0000
Copy link
Collaborator

For that to work, polars has to take the user inputs in calamine_options and manually separate them into the ones that go to fastexcel.read_excel(filepath, **engine_options) and the ones that go to fastexcel.read_excel(filepath, **engine_options).load_sheet_by_name(sheetname, **read_options)

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Apr 13, 2024

In this, case I think the clearest would be to have separate input argument names for each engine // ...

That isn't really any more ergonomic. Imagine if we took the same approach to "execute_options" in read_database; we'd have about 150 parameters (one for every possible database & driver combination) 🤣 Splitting out options that are common between the different engines into dedicated single-purpose parameters is likely be a good move though.

Also, I'm not sure if you were looking at older code, but the "engine_options" parameter docstring does actually tell you exactly what it's going to pass those options to:

engine_options
    Additional options passed to the underlying engine's primary parsing
    constructor (given below), if supported:

    * "xlsx2csv": `Xlsx2csv`
    * "openpyxl": `load_workbook`
    * "pyxlsb": `open_workbook`
    * "calamine": `n/a`

However, "read_options" needs some updates to achieve the same level of clarity, and linking directly to the underlying engine's docs also sounds like a good idea 👌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
3 participants