feat: add DocxToDocument converter #7838

CarlosFerLo · 2024-06-10T20:32:21Z

Related Issues

fixes feat: Port (and upgrade) DocxToDocument converter from Haystack v1 #7797

Proposed Changes:

Introducing the DocxFileToDocument converter component. It works using python-docx and a similar implementation to the one in v1.x.

How did you test it?

I have added a new test file containing tests to check it is functioning ok, I was inspired in the tests for PyPDFToDocument converter.

Notes for the reviewer

Currently, we have two issues:

I do not know how to add the 'python-docx' package to haystack, neither what to write in the lazy import.
I have found no way to add the page breaks to the resulting document, this makes a test brake.
The normal ByteStream, declared from a b-string and metadata, seems to make the python-docx library fail, as it only expects IO byte stream corresponding to a document, do not know how to proceed.

Checklist

I have read the contributors guidelines and the code of conduct ✅
I have updated the related issue with new insights and changes ✅
I added unit tests and updated the docstrings ✅
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:. ✅
I documented my code ✅
I ran pre-commit hooks and fixed any issue ✅

haystack/components/converters/docx.py

test/components/converters/test_docx_file_to_document.py

haystack/components/converters/docx.py

sjrl · 2024-06-11T07:51:02Z

Thanks for the quick work on this @CarlosFerLo! Most of the comments are minor except for the page breaks. If you are willing to take a look into it that would be great! But given it's complexity I think leaving out page break counting is okay.

Co-authored-by: Sebastian Husch Lee <[email protected]>

…into issue-7797

Co-authored-by: Sebastian Husch Lee <[email protected]>

… as skip

haystack/components/converters/docx.py

sjrl · 2024-06-11T10:37:07Z

Hey @CarlosFerLo a few more requests:

Update your base branch with main
Add python-docx to the extra dependencies in the pyproject.toml to get tests to pass. See here
Add docx to the modules in this docs list so your API docs will show up on the website.

CarlosFerLo · 2024-06-11T11:51:57Z

@sjrl I believe everything is set now.

CarlosFerLo · 2024-06-11T11:58:06Z

I don't know why, but this way to evaluate strings in warnings seems to be the cause for some tests to fail. I don't know why, as it is used in other components, and it works all right.

haystack/components/converters/docx.py

test/components/converters/test_docx_file_to_document.py

sjrl · 2024-06-11T12:39:40Z

test/components/converters/test_docx_file_to_document.py

+        assert len(docs) == 1
+        assert "History" in docs[0].content
+
+    @pytest.mark.skip("For now, DocxToDocument does not preserve page brakes.")


Let's go ahead and delete this test, and instead, we can open a feature request if we like for adding page break support.

Okey, I will create an issue about it once this PR is resolved.

Thanks! And in the mean time can we delete this test?

haystack/components/converters/docx.py

Co-authored-by: Sebastian Husch Lee <[email protected]>

coveralls · 2024-06-11T13:22:55Z

Pull Request Test Coverage Report for Build 9466297399

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.05%) to 89.757%

Totals
Change from base Build 9451690597:	-0.05%
Covered Lines:	6879
Relevant Lines:	7664

💛 - Coveralls

coveralls · 2024-06-11T13:23:07Z

Pull Request Test Coverage Report for Build 9466302370

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.05%) to 89.757%

Totals
Change from base Build 9451690597:	-0.05%
Covered Lines:	6879
Relevant Lines:	7664

💛 - Coveralls

coveralls · 2024-06-11T19:49:49Z

Pull Request Test Coverage Report for Build 9471734026

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
51 unchanged lines in 1 file lost coverage.
Overall coverage decreased (-0.03%) to 89.775%

Files with Coverage Reduction	New Missed Lines	%
core/pipeline/pipeline.py	51	65.48%

Totals
Change from base Build 9451690597:	-0.03%
Covered Lines:	6892
Relevant Lines:	7677

💛 - Coveralls

haystack/components/converters/docx.py

Co-authored-by: Sebastian Husch Lee <[email protected]>

coveralls · 2024-06-12T08:07:32Z

Pull Request Test Coverage Report for Build 9478875947

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
51 unchanged lines in 1 file lost coverage.
Overall coverage decreased (-0.04%) to 89.763%

Files with Coverage Reduction	New Missed Lines	%
core/pipeline/pipeline.py	51	65.48%

Totals
Change from base Build 9451690597:	-0.04%
Covered Lines:	6892
Relevant Lines:	7678

💛 - Coveralls

coveralls · 2024-06-12T08:07:56Z

Pull Request Test Coverage Report for Build 9478877690

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
51 unchanged lines in 1 file lost coverage.
Overall coverage decreased (-0.04%) to 89.763%

Files with Coverage Reduction	New Missed Lines	%
core/pipeline/pipeline.py	51	65.48%

Totals
Change from base Build 9451690597:	-0.04%
Covered Lines:	6892
Relevant Lines:	7678

💛 - Coveralls

coveralls · 2024-06-12T08:09:48Z

Pull Request Test Coverage Report for Build 9478934650

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
51 unchanged lines in 1 file lost coverage.
Overall coverage decreased (-0.04%) to 89.763%

Files with Coverage Reduction	New Missed Lines	%
core/pipeline/pipeline.py	51	65.48%

Totals
Change from base Build 9451690597:	-0.04%
Covered Lines:	6892
Relevant Lines:	7678

💛 - Coveralls

sjrl

Thanks @CarlosFerLo this looks good!

CarlosFerLo added 2 commits June 10, 2024 22:23

first fucntioning DocxFileToDocument

69fda9e

fix lazy import message

2d3d617

CarlosFerLo requested a review from a team as a code owner June 10, 2024 20:32

CarlosFerLo requested review from Amnah199 and removed request for a team June 10, 2024 20:32

github-actions bot added topic:tests type:documentation Improvements on the docs labels Jun 10, 2024

add reno

8984c48

CarlosFerLo requested a review from a team as a code owner June 10, 2024 20:37

CarlosFerLo requested review from dfokina and removed request for a team June 10, 2024 20:37

sjrl reviewed Jun 11, 2024

View reviewed changes

haystack/components/converters/docx.py Show resolved Hide resolved

sjrl reviewed Jun 11, 2024

View reviewed changes

haystack/components/converters/docx.py Outdated Show resolved Hide resolved

sjrl reviewed Jun 11, 2024

View reviewed changes

haystack/components/converters/docx.py Show resolved Hide resolved

sjrl reviewed Jun 11, 2024

View reviewed changes

haystack/components/converters/docx.py Outdated Show resolved Hide resolved

sjrl reviewed Jun 11, 2024

View reviewed changes

test/components/converters/test_docx_file_to_document.py Outdated Show resolved Hide resolved

sjrl reviewed Jun 11, 2024

View reviewed changes

haystack/components/converters/docx.py Outdated Show resolved Hide resolved

CarlosFerLo and others added 5 commits June 11, 2024 10:59

Add license headder

d76d4cf

Co-authored-by: Sebastian Husch Lee <[email protected]>

change DocxFileToDocument to DocxToDocument

0f41730

Merge branch 'issue-7797' of https://github.com/carlosFerLo/haystack …

a5f2055

…into issue-7797

Update library install to the maintained version

9173433

Co-authored-by: Sebastian Husch Lee <[email protected]>

clan try-exvept to only take non haystack errors into account

f391948

CarlosFerLo requested a review from sjrl June 11, 2024 09:14

sjrl changed the title ~~feat: add DocxFIleToDocument converter~~ feat: add DocxToDocument converter Jun 11, 2024

Add wanring on docstring of component ignoring page brakes, mark test…

de44301

… as skip

sjrl reviewed Jun 11, 2024

View reviewed changes

haystack/components/converters/docx.py Outdated Show resolved Hide resolved

sjrl reviewed Jun 11, 2024

View reviewed changes

haystack/components/converters/docx.py Outdated Show resolved Hide resolved

sjrl reviewed Jun 11, 2024

View reviewed changes

haystack/components/converters/docx.py Outdated Show resolved Hide resolved

add 'python-docx' dependency and docs

5eddd2a

github-actions bot added the topic:build/distribution label Jun 11, 2024

CarlosFerLo requested a review from sjrl June 11, 2024 11:52

sjrl reviewed Jun 11, 2024

View reviewed changes

haystack/components/converters/docx.py Outdated Show resolved Hide resolved

sjrl reviewed Jun 11, 2024

View reviewed changes

test/components/converters/test_docx_file_to_document.py Outdated Show resolved Hide resolved

sjrl reviewed Jun 11, 2024

View reviewed changes

haystack/components/converters/docx.py Outdated Show resolved Hide resolved

CarlosFerLo and others added 2 commits June 11, 2024 15:16

Change logging import

10c796a

Co-authored-by: Sebastian Husch Lee <[email protected]>

Fix typo

a35b91a

Co-authored-by: Sebastian Husch Lee <[email protected]>

CarlosFerLo added 2 commits June 11, 2024 15:40

remake metadata extraction for docx

c3bd356

solve merge issues

3400c7d

CarlosFerLo requested a review from sjrl June 11, 2024 19:34

solve bug regarding _get_docx_metadata method

9a849c9

sjrl reviewed Jun 12, 2024

View reviewed changes

haystack/components/converters/docx.py Show resolved Hide resolved

sjrl reviewed Jun 12, 2024

View reviewed changes

haystack/components/converters/docx.py Outdated Show resolved Hide resolved

CarlosFerLo and others added 3 commits June 12, 2024 09:59

Update haystack/components/converters/docx.py

ed37423

Co-authored-by: Sebastian Husch Lee <[email protected]>

Update haystack/components/converters/docx.py

c3b06fa

Co-authored-by: Sebastian Husch Lee <[email protected]>

Delete unused test

aa33ff4

sjrl approved these changes Jun 12, 2024

View reviewed changes

sjrl merged commit c1c3399 into deepset-ai:main Jun 12, 2024
24 checks passed

sjrl mentioned this pull request Jun 12, 2024

docs: Add docs for DocxToDocument converter #7847

Closed

CarlosFerLo deleted the issue-7797 branch June 12, 2024 14:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add DocxToDocument converter #7838

feat: add DocxToDocument converter #7838

CarlosFerLo commented Jun 10, 2024 •

edited

Loading

sjrl commented Jun 11, 2024

sjrl commented Jun 11, 2024

CarlosFerLo commented Jun 11, 2024

CarlosFerLo commented Jun 11, 2024

sjrl Jun 11, 2024

CarlosFerLo Jun 11, 2024 •

edited

Loading

sjrl Jun 12, 2024

coveralls commented Jun 11, 2024 •

edited

Loading

coveralls commented Jun 11, 2024 •

edited

Loading

coveralls commented Jun 11, 2024 •

edited

Loading

coveralls commented Jun 12, 2024 •

edited

Loading

coveralls commented Jun 12, 2024 •

edited

Loading

coveralls commented Jun 12, 2024 •

edited

Loading

sjrl left a comment

feat: add DocxToDocument converter #7838

feat: add DocxToDocument converter #7838

Conversation

CarlosFerLo commented Jun 10, 2024 • edited Loading

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

sjrl commented Jun 11, 2024

sjrl commented Jun 11, 2024

CarlosFerLo commented Jun 11, 2024

CarlosFerLo commented Jun 11, 2024

sjrl Jun 11, 2024

Choose a reason for hiding this comment

CarlosFerLo Jun 11, 2024 • edited Loading

Choose a reason for hiding this comment

sjrl Jun 12, 2024

Choose a reason for hiding this comment

coveralls commented Jun 11, 2024 • edited Loading

Pull Request Test Coverage Report for Build 9466297399

Details

💛 - Coveralls

coveralls commented Jun 11, 2024 • edited Loading

Pull Request Test Coverage Report for Build 9466302370

Details

💛 - Coveralls

coveralls commented Jun 11, 2024 • edited Loading

Pull Request Test Coverage Report for Build 9471734026

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

coveralls commented Jun 12, 2024 • edited Loading

Pull Request Test Coverage Report for Build 9478875947

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

coveralls commented Jun 12, 2024 • edited Loading

Pull Request Test Coverage Report for Build 9478877690

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

coveralls commented Jun 12, 2024 • edited Loading

Pull Request Test Coverage Report for Build 9478934650

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

sjrl left a comment

Choose a reason for hiding this comment

CarlosFerLo commented Jun 10, 2024 •

edited

Loading

CarlosFerLo Jun 11, 2024 •

edited

Loading

coveralls commented Jun 11, 2024 •

edited

Loading

coveralls commented Jun 11, 2024 •

edited

Loading

coveralls commented Jun 11, 2024 •

edited

Loading

coveralls commented Jun 12, 2024 •

edited

Loading

coveralls commented Jun 12, 2024 •

edited

Loading

coveralls commented Jun 12, 2024 •

edited

Loading