Handle large docx files #1272

JoelGotsch · 2023-10-20T19:08:23Z

fix issue were parser throws "AttValue length too long" error message for large docx files

Without that, the parser will throw Error AttValue length too long

allow parser to work with large docx files

scanny · 2023-10-20T22:18:10Z

Hi @JoelGotsch this is interesting. Can you say more about where you're encountering a docx file that is "too big" and what possible downside consequences there might be for choosing this setting?

scanny · 2023-10-20T22:20:09Z

Hmm, looks like this setting disables security restrictions, so probably not a good idea to enable by default :)

JoelGotsch · 2023-10-21T03:40:38Z

Hi @scanny!
Thanks for the quick reply!
We have word files that were converted from pdfs (annual reports of companies). The issue starts occurring with documents that are around 100MB.
But in retrospect and given the discussion at TAXIIProject/libtaxii#18 (comment) I agree that it should probably be a setting that’s False by default.
I would maybe let the user optionally pass a custom xml parser.
Shall I adapt the PR accordingly?

scanny · 2023-10-22T00:56:54Z

Let's discuss how we might approach this.

docx.oxml.parser.oxml_parser is going to be initialized quite early, probably right on from docx import Document, so there's not going to be a way I can see to configure its original value, we will have to "re-initialize" its value.

So maybe a function like docx.oxml.parser.disable_parser_security() or something like that, so it's clear you're taking on risk, and looking something like this:

def disable_parser_security() -> None:
    """Disable XML exploit detection to allow handling of large DOCX files (e.g. > 100MB)."""
    global oxml_parser
    oxml_parser = etree.XMLParser(remove_blank_text=True, resolve_entities=False, huge_tree=True)
    oxml_parser.set_element_class_lookup(element_class_lookup)

We're going to want a better name than that, something more specific, like disable_parser_exploit_protection() or something, let's give that some more thought. You can suggest something based on what that article says is the reason they put that limitation in there.

You can try this from your own code to see if it works, something like this:

import docx.oxml.parser as parser

parser.oxml_parser = etree.XMLParser(remove_blank_text=True, resolve_entities=False, huge_tree=True)
parser.oxml_parser.set_element_class_lookup(parser.element_class_lookup)

document = Document("huge_docx.docx")

If that works I think that gives us pretty high confidence that such an approach would work.

We'd need to work out where in the documentation to mention this. I don't think there's a section there for the Oxml parser because it doesn't have any actual interface yet as far as I know.

JoelGotsch · 2023-10-22T18:29:28Z

I just tested it and it works.
The implementation also looks OK to me and given the current way how the parser is instantiated it is the wisest choice.
To me as a user it would probably feel the most natural to use it like Document("huge_doc.docx", huge_tree=True). But that would probably mean that the parser would need to be instantiated for each Document that is loaded instead just once when the module is loaded. And I guess that was the reason why it's instantiated as a global variable in the first place?
In any case, I could adapt the PR if you want to.
Regarding the naming I am quite flexbile, maybe just disable_xml_exploit_protection?
And regarding documentation, how about in user/documents?

JoelGotsch added 2 commits October 20, 2023 21:06

allow parser to work with large docx files

3c8b6e3

Without that, the parser will throw Error AttValue length too long

Merge pull request #1 from JoelGotsch/feature/handle-large-files

927f5e9

allow parser to work with large docx files

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle large docx files #1272

Handle large docx files #1272

JoelGotsch commented Oct 20, 2023

scanny commented Oct 20, 2023

scanny commented Oct 20, 2023

JoelGotsch commented Oct 21, 2023 •

edited

Loading

scanny commented Oct 22, 2023

JoelGotsch commented Oct 22, 2023

Handle large docx files #1272

Are you sure you want to change the base?

Handle large docx files #1272

Conversation

JoelGotsch commented Oct 20, 2023

scanny commented Oct 20, 2023

scanny commented Oct 20, 2023

JoelGotsch commented Oct 21, 2023 • edited Loading

scanny commented Oct 22, 2023

JoelGotsch commented Oct 22, 2023

JoelGotsch commented Oct 21, 2023 •

edited

Loading