-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch from "HTML-like" to hOCR, an open-standard for OCR results #62
Conversation
README says
Providing just a basic information rather than a rich information that captures the document structure like "section_header" would change pdftotree's original goal. I might have to fork and rename it to "pdftohocr" (or "pdf2hocr") instead of merging this PR to pdftotree. |
LTTextBox in PDFMiner represents a group of text chunks contained in a "geometric" rectangular area. but does not necessarily represents a "logical" boundary of the text. https://pdfminer-docs.readthedocs.io/programming.html?highlight=lttextline#performing-layout-analysis
@@ -66,8 +65,8 @@ def parse( | |||
log.info("Tree structure built, creating html...") | |||
pdf_html = extractor.get_html_tree() | |||
log.info("HTML created.") | |||
# Check html_path exists, create if not | |||
pdf_html = re.sub(r"[\x00-\x1F]+", "", pdf_html) | |||
# TODO: what is the following substition for and is it required? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I recall, we had some filenames with special characters that caused issues, so this was taking the first 31 special ASCII chars and turning them into nothing.
I don't know if it's is important to do anymore though, and it seems fine to leave it out until someone runs into it again (if someone runs into it again).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filenames? A bit strange because this line of code substitutes characters in the content.
Anyway, I'll keep this line commented out until this causes some issue again.
Thanks for your recall!
LGTM. I'll let you take care of merging whenever you're read :). |
To help HazyResearch/fonduer#509, this PR introduces two major changes:
<span class="ocrx_word"
). This change is to comply with hOCR.In addition, I'm wondering if pdftotree should provide rich information like
<section_header/>
or<table_caption/>
, which could be extracted nicely from a certain type of documents, but may not from others. Instead I would take a step back and just provide basic but more reliable information that OCR could produce.