Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch from "HTML-like" to hOCR, an open-standard for OCR results #62

Merged
merged 21 commits into from
Sep 24, 2020

Conversation

HiromuHota
Copy link
Contributor

@HiromuHota HiromuHota commented Sep 21, 2020

To help HazyResearch/fonduer#509, this PR introduces two major changes:

  1. Switch from "HTML-like" to hOCR
  2. Currently, a leaf element in the output HTML can contain multiple words, but it will contain only a single word (ie <span class="ocrx_word"). This change is to comply with hOCR.

In addition, I'm wondering if pdftotree should provide rich information like <section_header/> or <table_caption/>, which could be extracted nicely from a certain type of documents, but may not from others. Instead I would take a step back and just provide basic but more reliable information that OCR could produce.

@HiromuHota
Copy link
Contributor Author

README says

Our goal in this project is to develop a tool that extracts text, figures and tables in a pdf document and maintains the structure of the document using a tree data structure.

Providing just a basic information rather than a rich information that captures the document structure like "section_header" would change pdftotree's original goal. I might have to fork and rename it to "pdftohocr" (or "pdf2hocr") instead of merging this PR to pdftotree.

@HiromuHota HiromuHota marked this pull request as ready for review September 23, 2020 23:06
@@ -66,8 +65,8 @@ def parse(
log.info("Tree structure built, creating html...")
pdf_html = extractor.get_html_tree()
log.info("HTML created.")
# Check html_path exists, create if not
pdf_html = re.sub(r"[\x00-\x1F]+", "", pdf_html)
# TODO: what is the following substition for and is it required?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I recall, we had some filenames with special characters that caused issues, so this was taking the first 31 special ASCII chars and turning them into nothing.

I don't know if it's is important to do anymore though, and it seems fine to leave it out until someone runs into it again (if someone runs into it again).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filenames? A bit strange because this line of code substitutes characters in the content.
Anyway, I'll keep this line commented out until this causes some issue again.
Thanks for your recall!

@lukehsiao
Copy link
Contributor

LGTM. I'll let you take care of merging whenever you're read :).

@HiromuHota HiromuHota merged commit 54065ad into HazyResearch:master Sep 24, 2020
@HiromuHota HiromuHota deleted the hocr branch September 24, 2020 19:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants