-
Notifications
You must be signed in to change notification settings - Fork 804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add script to render html from unstructured elements #3799
Conversation
This is definitely a bug that additional pages are missing! |
since this is not intended to be used as a module, lets move it to a new directory under
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @cragwolfe would be happy if we move it from html
to htmlv2
directory ❤️
There are some improvements possible, but for ad-hoc type of scripts I think it is good to go :D
LGTM! 🚀
@@ -0,0 +1,146 @@ | |||
# pyright: reportPrivateUsage=false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to add example usage in docstrings.
I personally tested that this way
python3 scripts/html/rendered_html_from_elements.py Breast_Cancer1-5.pdf.json --outdir .
cat Breast_Cancer1-5.pdf.json | PROCESS_FROM_STDIN=true python3 scripts/html/rendered_html_from_elements.py
if filepath is None and text is None: | ||
logger.error("Either filepath or text must be provided.") | ||
raise ValueError("Either filepath or text must be provided.") | ||
if filepath is not None and text is not None: | ||
logger.error("Both filepath and text cannot be provided.") | ||
raise ValueError("Both filepath and text cannot be provided.") | ||
if filepath is not None: | ||
logger.info("Rendering HTML from file: %s", filepath) | ||
else: | ||
logger.info("Rendering HTML from text.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a huge fan of lots of ifs that can be simply avoided.
From STDIN we have text
From file path we can read text and give here text
So just we could expect always text.
I see that method expecting 'stringified' json is kinda unusual, we could also always expect filename (temp dir would have to be used with STDIN )
return html_document | ||
|
||
|
||
def group_elements_by_page( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hint for the future https://docs.python.org/3/library/itertools.html#itertools.groupby
Script to render HTML from unstructured elements.
NOTE: This script is not intended to be used as a module.
NOTE: This script is only intended to be used with outputs with non-empty
metadata.text_as_html
.TODO: It was noted that unstructured_elements_to_ontology func always returns a single page
This script is using helper functions to handle multiple pages. I am not sure if this was intended, or it is a bug - if it is a bug it would require bit longer debugging - to make it usable fast I used workarounds.
Usage: test with any outputs with non-empty
metadata.text_as_html
.Example files attached.
[Example-Bill-of-Lading-Waste.docx.pdf.json](https://github.com/user-attachments/files/17922898/Example-Bill-of-Lading-Waste.docx.pdf.json)
Breast_Cancer1-5.pdf.json