Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add XLSXToDocument converter #8522

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open

feat: Add XLSXToDocument converter #8522

wants to merge 17 commits into from

Conversation

sjrl
Copy link
Contributor

@sjrl sjrl commented Nov 8, 2024

Related Issues

Proposed Changes:

Draft of the Excel to Document converter

How did you test it?

Added tests

Notes for the reviewer

Checklist

@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Nov 8, 2024
@sjrl sjrl requested a review from bglearning November 8, 2024 09:48
@sjrl
Copy link
Contributor Author

sjrl commented Nov 8, 2024

Hey @bglearning I'd like to discuss with you how got references to work (e.g. pointing to the correct row in the original Excel table) to make sure we accommodate that properly in this component.

@coveralls
Copy link
Collaborator

coveralls commented Nov 8, 2024

Pull Request Test Coverage Report for Build 12298066419

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 44 unchanged lines in 6 files lost coverage.
  • Overall coverage increased (+0.06%) to 90.408%

Files with Coverage Reduction New Missed Lines %
components/converters/pypdf.py 2 95.95%
components/converters/txt.py 2 91.18%
components/converters/tika.py 3 93.22%
components/converters/json.py 7 84.88%
components/converters/markdown.py 12 38.64%
components/converters/azure.py 18 89.73%
Totals Coverage Status
Change from base Build 12258896395: 0.06%
Covered Lines: 8096
Relevant Lines: 8955

💛 - Coveralls

@bglearning
Copy link
Contributor

btw @sjrl , looks like to_csv actually also supports list of strings.

header: bool or list of str, default True

So the out_header (or equivalent) can be made the same for both (List[str]) which would just be the columns by default (if excel_column_names is optional).

Also on the flip side, to_markdown is eventually calling the tabulate library which uses the headers kwarg

- `headers` can be an explicit list of column headers
- if `headers="firstrow"`, then the first row of data is used
- if `headers="keys"`, then dictionary keys or column indices are used

@sjrl sjrl marked this pull request as ready for review December 11, 2024 14:21
@sjrl sjrl requested review from a team as code owners December 11, 2024 14:21
@sjrl sjrl requested review from dfokina and anakin87 and removed request for a team December 11, 2024 14:21
@sjrl
Copy link
Contributor Author

sjrl commented Dec 11, 2024

Just realized I'd like to do a bit more testing. Especially what happens when a complicated table is converted such as what happens if cells are merged, etc.

So a TODO

  • Add test with realistic table (ie like one from a client)

Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some initial comments.
Overall, the PR looks good.

haystack/components/converters/xlsx.py Show resolved Hide resolved
haystack/components/converters/xlsx.py Outdated Show resolved Hide resolved
haystack/components/converters/xlsx.py Outdated Show resolved Hide resolved
haystack/components/converters/xlsx.py Outdated Show resolved Hide resolved
@sjrl sjrl requested a review from anakin87 December 12, 2024 14:08
Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Feel free to add a test with a realistic table (#8522 (comment)), then it's ready to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants