Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle large docx files #1272

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

JoelGotsch
Copy link

fix issue were parser throws "AttValue length too long" error message for large docx files

Without that, the parser will throw Error AttValue length too long
allow parser to work with large docx files
@scanny
Copy link
Contributor

scanny commented Oct 20, 2023

Hi @JoelGotsch this is interesting. Can you say more about where you're encountering a docx file that is "too big" and what possible downside consequences there might be for choosing this setting?

@scanny
Copy link
Contributor

scanny commented Oct 20, 2023

Hmm, looks like this setting disables security restrictions, so probably not a good idea to enable by default :)

@JoelGotsch
Copy link
Author

JoelGotsch commented Oct 21, 2023

Hi @scanny!
Thanks for the quick reply!
We have word files that were converted from pdfs (annual reports of companies). The issue starts occurring with documents that are around 100MB.
But in retrospect and given the discussion at TAXIIProject/libtaxii#18 (comment) I agree that it should probably be a setting that’s False by default.
I would maybe let the user optionally pass a custom xml parser.
Shall I adapt the PR accordingly?

@scanny
Copy link
Contributor

scanny commented Oct 22, 2023

Let's discuss how we might approach this.

docx.oxml.parser.oxml_parser is going to be initialized quite early, probably right on from docx import Document, so there's not going to be a way I can see to configure its original value, we will have to "re-initialize" its value.

So maybe a function like docx.oxml.parser.disable_parser_security() or something like that, so it's clear you're taking on risk, and looking something like this:

def disable_parser_security() -> None:
    """Disable XML exploit detection to allow handling of large DOCX files (e.g. > 100MB)."""
    global oxml_parser
    oxml_parser = etree.XMLParser(remove_blank_text=True, resolve_entities=False, huge_tree=True)
    oxml_parser.set_element_class_lookup(element_class_lookup)

We're going to want a better name than that, something more specific, like disable_parser_exploit_protection() or something, let's give that some more thought. You can suggest something based on what that article says is the reason they put that limitation in there.

You can try this from your own code to see if it works, something like this:

import docx.oxml.parser as parser

parser.oxml_parser = etree.XMLParser(remove_blank_text=True, resolve_entities=False, huge_tree=True)
parser.oxml_parser.set_element_class_lookup(parser.element_class_lookup)

document = Document("huge_docx.docx")

If that works I think that gives us pretty high confidence that such an approach would work.

We'd need to work out where in the documentation to mention this. I don't think there's a section there for the Oxml parser because it doesn't have any actual interface yet as far as I know.

@JoelGotsch
Copy link
Author

I just tested it and it works.
The implementation also looks OK to me and given the current way how the parser is instantiated it is the wisest choice.
To me as a user it would probably feel the most natural to use it like Document("huge_doc.docx", huge_tree=True). But that would probably mean that the parser would need to be instantiated for each Document that is loaded instead just once when the module is loaded. And I guess that was the reason why it's instantiated as a global variable in the first place?
In any case, I could adapt the PR if you want to.
Regarding the naming I am quite flexbile, maybe just disable_xml_exploit_protection?
And regarding documentation, how about in user/documents?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants