-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle large docx files #1272
base: master
Are you sure you want to change the base?
Handle large docx files #1272
Conversation
Without that, the parser will throw Error AttValue length too long
allow parser to work with large docx files
Hi @JoelGotsch this is interesting. Can you say more about where you're encountering a docx file that is "too big" and what possible downside consequences there might be for choosing this setting? |
Hmm, looks like this setting disables security restrictions, so probably not a good idea to enable by default :) |
Hi @scanny! |
Let's discuss how we might approach this.
So maybe a function like def disable_parser_security() -> None:
"""Disable XML exploit detection to allow handling of large DOCX files (e.g. > 100MB)."""
global oxml_parser
oxml_parser = etree.XMLParser(remove_blank_text=True, resolve_entities=False, huge_tree=True)
oxml_parser.set_element_class_lookup(element_class_lookup) We're going to want a better name than that, something more specific, like You can try this from your own code to see if it works, something like this: import docx.oxml.parser as parser
parser.oxml_parser = etree.XMLParser(remove_blank_text=True, resolve_entities=False, huge_tree=True)
parser.oxml_parser.set_element_class_lookup(parser.element_class_lookup)
document = Document("huge_docx.docx") If that works I think that gives us pretty high confidence that such an approach would work. We'd need to work out where in the documentation to mention this. I don't think there's a section there for the Oxml parser because it doesn't have any actual interface yet as far as I know. |
I just tested it and it works. |
fix issue were parser throws "AttValue length too long" error message for large docx files