-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add DocxToDocument converter #7838
Conversation
Thanks for the quick work on this @CarlosFerLo! Most of the comments are minor except for the page breaks. If you are willing to take a look into it that would be great! But given it's complexity I think leaving out page break counting is okay. |
Co-authored-by: Sebastian Husch Lee <[email protected]>
Co-authored-by: Sebastian Husch Lee <[email protected]>
Hey @CarlosFerLo a few more requests: |
@sjrl I believe everything is set now. |
I don't know why, but this way to evaluate strings in warnings seems to be the cause for some tests to fail. I don't know why, as it is used in other components, and it works all right. |
assert len(docs) == 1 | ||
assert "History" in docs[0].content | ||
|
||
@pytest.mark.skip("For now, DocxToDocument does not preserve page brakes.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's go ahead and delete this test, and instead, we can open a feature request if we like for adding page break support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okey, I will create an issue about it once this PR is resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! And in the mean time can we delete this test?
Co-authored-by: Sebastian Husch Lee <[email protected]>
Co-authored-by: Sebastian Husch Lee <[email protected]>
Pull Request Test Coverage Report for Build 9466297399Details
💛 - Coveralls |
Pull Request Test Coverage Report for Build 9466302370Details
💛 - Coveralls |
Pull Request Test Coverage Report for Build 9471734026Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
Co-authored-by: Sebastian Husch Lee <[email protected]>
Co-authored-by: Sebastian Husch Lee <[email protected]>
Pull Request Test Coverage Report for Build 9478875947Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
Pull Request Test Coverage Report for Build 9478877690Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
Pull Request Test Coverage Report for Build 9478934650Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @CarlosFerLo this looks good!
Related Issues
Proposed Changes:
Introducing the
DocxFileToDocument
converter component. It works usingpython-docx
and a similar implementation to the one inv1.x
.How did you test it?
I have added a new test file containing tests to check it is functioning ok, I was inspired in the tests for
PyPDFToDocument
converter.Notes for the reviewer
Currently, we have two issues:
ByteStream
, declared from a b-string and metadata, seems to make thepython-docx
library fail, as it only expects IO byte stream corresponding to a document, do not know how to proceed.Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
. ✅