Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/html parsing incorrectly categorizing text #3666

Closed
bhoppeadoy opened this issue Sep 26, 2024 · 1 comment · Fixed by #3841
Closed

bug/html parsing incorrectly categorizing text #3666

bhoppeadoy opened this issue Sep 26, 2024 · 1 comment · Fixed by #3841
Assignees
Labels
bug Something isn't working html

Comments

@bhoppeadoy
Copy link

bhoppeadoy commented Sep 26, 2024

Describe the bug
The html parser is incorrectly categorizing a set of divs as different text types. There are three divs, one right after each other and each div is categorized as UncategorizedText, Title, and NarrativeText. These should all be the same text category and definitely not a title category.

To Reproduce
Parse the attached html file. Examine the elements created for these three div ids SBOS510440, SBOS5102933, SBOS5105314

< div id="SBOS510440" class="textnote">< span class="section-label">1. </ span>V< sub>IN</ sub> = +V< sub>S</ sub> + 500 mV.</ div>

< div id="SBOS5102933" class="textnote">< span class="section-label">2. </ span>TVS: +V< sub>S(max)</ sub> > V< sub>TVSBR (Min)</ sub> > +VS</ sub> </ div>

< div id="SBOS5105314" class="textnote">< span class="section-label">3. </ span>Suggested value is approximately 5 kΩ in overvoltage conditions.</ div>

elements = partition_html(text=html_text)
electrical-overstress-sbos8126148.zip

Expected behavior
Expect these to three divs to be the same text category, but not a title category

Screenshots
Here's the list of elements for the three divs:

elements

Environment Info
OS version: Linux-6.1.83-4.ph5-x86_64-with-glibc2.28
Python version: 3.10.14
unstructured version: 0.15.13
unstructured-inference is not installed
pytesseract is not installed
Torch is not installed
Detectron2 is not installed
PaddleOCR is not installed
Libmagic version: file-5.33
magic file from /etc/magic:/usr/share/misc/magic

@bhoppeadoy bhoppeadoy added the bug Something isn't working label Sep 26, 2024
@scanny scanny added the html label Dec 16, 2024
@scanny
Copy link
Collaborator

scanny commented Dec 16, 2024

Yep, good call. I'll sort this out.

@scanny scanny self-assigned this Dec 16, 2024
scanny added a commit that referenced this issue Dec 18, 2024
github-merge-queue bot pushed a commit that referenced this issue Dec 18, 2024
Fixes #3666

---------

Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: scanny <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working html
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants