`engine_options` vs `read_options` in `read_excel` is confusing #15596

mcrumiller · 2024-04-11T13:55:51Z

Description

It is not clear from reading the docs what the functional difference between engine_options and read_options is:

engine_options is declared as "additional options passed to the underlying engine’s primary parsing constructor".
read_options is "Extra options passed to the function that reads the sheet data".

Is the function that reads the sheet data not the same as the engine that reads the sheet data? If not, what does the engine do, how does one find the underlying function, and where does one find the API for each of those?

Additionally, I think that we should directly link to the APIs of the underlying engines so that the user can know explicitly what can be passed. See #15486 where this caused confusion elsewhere. There are also many different fastexcel repositories.

The text was updated successfully, but these errors were encountered:

deanm0000 · 2024-04-12T17:33:17Z

It used to be that the only engine was xlsx2csv and so there was a parameter to pass options to xlsx2csv and then there was another parameter to pass options to pl.read_csv and the parameter names reflected that. The parameters got renamed since there are more options so they're less clear.

For the newer engines that have a workbook class and a load_sheet method, the workbook class gets the engine_options and the load_sheet method gets the read_options. For the old xlsx2csv, its engine_options go to xlsx2csv (including which sheet) and then read_options goes to pl.read_csv so unfortunately it can't be delineated as workbook and worksheet options. It seems like it might just need to be options1 and options2 with a description of what that means per engine since they're not consistent enough to have a more generic meaning. To be clear, I'm not arguing to do that, I'm just lamenting the lack of good options.

mcrumiller · 2024-04-12T18:00:17Z

In this, case I think the clearest would be to have separate input argument names for each engine, which only apply when that engine is selected, e.g.:

pl.read_excel(file, engine="calamine", calamine_options = {...})
pl.read_excel(file, engine="xlsx2", xlsx2_options = {...})

We could then tailor each to the appropriate engine, and the user does not have to mentally separate engine and read options.

deanm0000 · 2024-04-12T19:34:34Z

For that to work, polars has to take the user inputs in calamine_options and manually separate them into the ones that go to fastexcel.read_excel(filepath, **engine_options) and the ones that go to fastexcel.read_excel(filepath, **engine_options).load_sheet_by_name(sheetname, **read_options)

alexander-beedie · 2024-04-13T15:05:58Z

In this, case I think the clearest would be to have separate input argument names for each engine // ...

That isn't really any more ergonomic. Imagine if we took the same approach to "execute_options" in read_database; we'd have about 150 parameters (one for every possible database & driver combination) 🤣 Splitting out options that are common between the different engines into dedicated single-purpose parameters is likely be a good move though.

Also, I'm not sure if you were looking at older code, but the "engine_options" parameter docstring does actually tell you exactly what it's going to pass those options to:

engine_options
    Additional options passed to the underlying engine's primary parsing
    constructor (given below), if supported:

    * "xlsx2csv": `Xlsx2csv`
    * "openpyxl": `load_workbook`
    * "pyxlsb": `open_workbook`
    * "calamine": `n/a`

However, "read_options" needs some updates to achieve the same level of clarity, and linking directly to the underlying engine's docs also sounds like a good idea 👌

mcrumiller added the documentation Improvements or additions to documentation label Apr 11, 2024

cmdlineluser mentioned this issue Apr 14, 2024

Add link to Fastexcel docs on pl.read_excel(...) page for easy reference of valid options in the read_options=... arg #15640

Closed

alexander-beedie mentioned this issue Apr 20, 2024

feat(python): Improved type-inference for read_excel and read_ods, use calamine engine for read_ods #15808

Merged

ritchie46 closed this as completed in #15808 Apr 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`engine_options` vs `read_options` in `read_excel` is confusing #15596

`engine_options` vs `read_options` in `read_excel` is confusing #15596

mcrumiller commented Apr 11, 2024 •

edited

Loading

deanm0000 commented Apr 12, 2024

mcrumiller commented Apr 12, 2024

deanm0000 commented Apr 12, 2024

alexander-beedie commented Apr 13, 2024 •

edited

Loading

engine_options vs read_options in read_excel is confusing #15596

engine_options vs read_options in read_excel is confusing #15596

Comments

mcrumiller commented Apr 11, 2024 • edited Loading

Description

deanm0000 commented Apr 12, 2024

mcrumiller commented Apr 12, 2024

deanm0000 commented Apr 12, 2024

alexander-beedie commented Apr 13, 2024 • edited Loading

`engine_options` vs `read_options` in `read_excel` is confusing #15596

`engine_options` vs `read_options` in `read_excel` is confusing #15596

mcrumiller commented Apr 11, 2024 •

edited

Loading

alexander-beedie commented Apr 13, 2024 •

edited

Loading