Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add script to render html from unstructured elements #3799

Merged
merged 6 commits into from
Dec 5, 2024
Merged

Conversation

mariannaparzych
Copy link
Contributor

@mariannaparzych mariannaparzych commented Nov 26, 2024

Script to render HTML from unstructured elements.

NOTE: This script is not intended to be used as a module.
NOTE: This script is only intended to be used with outputs with non-empty metadata.text_as_html.

TODO: It was noted that unstructured_elements_to_ontology func always returns a single page
This script is using helper functions to handle multiple pages. I am not sure if this was intended, or it is a bug - if it is a bug it would require bit longer debugging - to make it usable fast I used workarounds.

Usage: test with any outputs with non-empty metadata.text_as_html.
Example files attached.
[Example-Bill-of-Lading-Waste.docx.pdf.json](https://github.com/user-attachments/files/17922898/Example-Bill-of-Lading-Waste.docx.pdf.json)

Breast_Cancer1-5.pdf.json

@plutasnyy
Copy link
Contributor

plutasnyy commented Nov 26, 2024

This is definitely a bug that additional pages are missing!
Thanks for quick workaround.

@cragwolfe
Copy link
Contributor

since this is not intended to be used as a module, lets move it to a new directory under /scripts:

/scripts/htmlv2/

Copy link
Contributor

@plutasnyy plutasnyy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @cragwolfe would be happy if we move it from html to htmlv2 directory ❤️
There are some improvements possible, but for ad-hoc type of scripts I think it is good to go :D
LGTM! 🚀

@@ -0,0 +1,146 @@
# pyright: reportPrivateUsage=false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to add example usage in docstrings.
I personally tested that this way

python3 scripts/html/rendered_html_from_elements.py Breast_Cancer1-5.pdf.json --outdir . 
 cat Breast_Cancer1-5.pdf.json | PROCESS_FROM_STDIN=true python3 scripts/html/rendered_html_from_elements.py

Comment on lines +82 to +91
if filepath is None and text is None:
logger.error("Either filepath or text must be provided.")
raise ValueError("Either filepath or text must be provided.")
if filepath is not None and text is not None:
logger.error("Both filepath and text cannot be provided.")
raise ValueError("Both filepath and text cannot be provided.")
if filepath is not None:
logger.info("Rendering HTML from file: %s", filepath)
else:
logger.info("Rendering HTML from text.")
Copy link
Contributor

@plutasnyy plutasnyy Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a huge fan of lots of ifs that can be simply avoided.
From STDIN we have text
From file path we can read text and give here text
So just we could expect always text.

I see that method expecting 'stringified' json is kinda unusual, we could also always expect filename (temp dir would have to be used with STDIN )

return html_document


def group_elements_by_page(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mariannaparzych mariannaparzych added this pull request to the merge queue Dec 3, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 3, 2024
@cragwolfe cragwolfe merged commit 4140f62 into main Dec 5, 2024
41 checks passed
@cragwolfe cragwolfe deleted the ml_577 branch December 5, 2024 03:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants