You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
The html parser is incorrectly categorizing a set of divs as different text types. There are three divs, one right after each other and each div is categorized as UncategorizedText, Title, and NarrativeText. These should all be the same text category and definitely not a title category.
To Reproduce
Parse the attached html file. Examine the elements created for these three div ids SBOS510440, SBOS5102933, SBOS5105314
< div id="SBOS5105314" class="textnote">< span class="section-label">3. </ span>Suggested value is approximately 5 kΩ in overvoltage conditions.</ div>
Expected behavior
Expect these to three divs to be the same text category, but not a title category
Screenshots
Here's the list of elements for the three divs:
Environment Info
OS version: Linux-6.1.83-4.ph5-x86_64-with-glibc2.28
Python version: 3.10.14
unstructured version: 0.15.13
unstructured-inference is not installed
pytesseract is not installed
Torch is not installed
Detectron2 is not installed
PaddleOCR is not installed
Libmagic version: file-5.33
magic file from /etc/magic:/usr/share/misc/magic
The text was updated successfully, but these errors were encountered:
Describe the bug
The html parser is incorrectly categorizing a set of divs as different text types. There are three divs, one right after each other and each div is categorized as UncategorizedText, Title, and NarrativeText. These should all be the same text category and definitely not a title category.
To Reproduce
Parse the attached html file. Examine the elements created for these three div ids SBOS510440, SBOS5102933, SBOS5105314
< div id="SBOS510440" class="textnote">< span class="section-label">1. </ span>V< sub>IN</ sub> = +V< sub>S</ sub> + 500 mV.</ div>
< div id="SBOS5102933" class="textnote">< span class="section-label">2. </ span>TVS: +V< sub>S(max)</ sub> > V< sub>TVSBR (Min)</ sub> > +VS</ sub> </ div>
< div id="SBOS5105314" class="textnote">< span class="section-label">3. </ span>Suggested value is approximately 5 kΩ in overvoltage conditions.</ div>
elements = partition_html(text=html_text)
electrical-overstress-sbos8126148.zip
Expected behavior
Expect these to three divs to be the same text category, but not a title category
Screenshots
Here's the list of elements for the three divs:
Environment Info
OS version: Linux-6.1.83-4.ph5-x86_64-with-glibc2.28
Python version: 3.10.14
unstructured version: 0.15.13
unstructured-inference is not installed
pytesseract is not installed
Torch is not installed
Detectron2 is not installed
PaddleOCR is not installed
Libmagic version: file-5.33
magic file from /etc/magic:/usr/share/misc/magic
The text was updated successfully, but these errors were encountered: