-
Notifications
You must be signed in to change notification settings - Fork 824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: method to catch and classify overlapping bounding boxes #1803
Conversation
Includes catching nesting elements with an pixel error tolerance (default to 5px)
This PR isn't marked draft but it's missing the normal code quality features, like type hints and tests. |
unstructured/utils.py
Outdated
def catch_overlapping_bboxes( | ||
elements, | ||
) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is over 150 lines long, which is way over the rule of thumb limit of 5-20 lines. Breaking it up into logical subtasks that get their own functions would really increase the readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed catch_overlapping_bboxes
to catch_overlapping_and_nested_bboxes
and create two more methods to break it up: identify_overlapping_or_nesting_case
and identify_overlapping_case
. A short description can be found in the PR description. Is it better? I can try to split it further but most of the conditions to classify the overlapping-case shall be in one method, making it large.
Could you add some context to the PR description about what this is in service of? Like once we have a robust way of classifying the overlaps, what will the next step be? |
Similarly to Alan, I am wondering what is the end game of this PR. In which scenario is this going to be used? e.g. support processing of overlapping bounding boxes. I would like to have this clear to understand what the code needs to do. Thanks! As well, I think the intention of n-gram calculation is to identify the longest substring common in the strings of two bounding boxes. This is achieved using n-gram calculation and a recursive function is used to find the largest n, which seems kind of expensive, even though the strings might not be long enough for this to be a problem. What about the following example function instead to calculate the longest substring common to two strings?
|
I added a more detailed description. Hopefully that covers the purpose of the PR. Talking with @pravin-unstructured__who created this ticket__we thought it was making sense to put this in |
@ajjimeno . I compared your proposed solution against mine. Here are the results for a small pair of strings:
First, what I intended to get is the largest common substring but with consecutive words. The output "This is text" from your method is wrong then. My method was faster too. Perhaps this example would help clarifying. Here are two paragraphs A =
B =
The results for
The results for your
I don't think I can do much for what I wanted with the result of your method, and it takes longer to execute. However, we could add this metric to the metadata if you find it useful, I hope this is enough to clarify. Let me know your thoughts. |
Typing and tests added. The test covers a case with overlap and a case without. |
box2: Union[List, Tuple], | ||
intersection_ratio_method: str = "total", | ||
): | ||
"""Box format: [x_bottom_left, y_bottom_left, x_top_right, y_top_right]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be difficult to change the function to work with our current format (x_topleft,y_topleft, x_bottom_right,y_bottom_right)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is working with the coordinate format present in the element list output. No necessary transform. Just pass the list of elements exactly how the partition method is giving them to you. The variables x_bottom_left
, y_bottom_left
, x_top_right
, and y_top_right
, are named with reference coordinate system with a y
axis incrementing positive in the upper direction and not incrementing ↓. But this should not affect the results. For instance in the bbox:
'coordinates': {'points': ((374.0989074707031,
257.9061279296875),
(374.0989074707031, 282.63995361328125),
(613.169189453125, 282.63995361328125),
(613.169189453125, 257.9061279296875)),
The first point would be a bottom_left corner in the described reference system.
And Inside the methods I work with the bottom_left
and top_right
corners. These methods are only going to be called when catch_overlapping_and_nested_bboxes
is called. So, changing to the other corners won't make any difference in practice... I am not sure what you mean with our current format
if what we have as input to catch_overlapping_and_nested_bboxes
is a list of elements with the coords as I showed. The method calculate_overlap_percentage
is NOT intended to be run outside catch_overlapping_and_nested_bboxes
, so I don't see the point. Let me know if you consider the changes are necessary.
Thanks @LaverdeS for adding content and the evaluation of the functions. No need to change it then based on your example. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I used the code below to test it.
from unstructured.partition.auto import partition
from unstructured.utils import catch_overlapping_and_nested_bboxes
target = "example-docs/layout-parser-paper-fast.pdf"
model_name = "yolox_quantized"
elements = partition(filename=target, strategy='hi_res', model_name=model_name)
overlapping_flag, overlapping_cases = catch_overlapping_and_nested_bboxes(elements)
for element in elements:
print(element.__dict__)
for case in overlapping_cases:
print(case, "\n")
We have established that overlapping bounding boxes does not have a one-fits-all solution, so different cases need to be handled differently to avoid information loss. We have manually identified the cases/categories of overlapping. Now we need a method to programmatically classify overlapping-bboxes cases within detected elements in a document, and return a report about it (list of cases with metadata). This fits two purposes:
partition_pdf
, to handle the calls to post-processing methods to fix overlapping. Tested on ~331 documents, the worst time per page is around 5ms. For a document such aslayout-parser-paper.pdf
it takes 4.46 ms.Introduces functionality to take a list of unstructured elements (which contain bounding boxes) and identify pairs of bounding boxes which overlap and which case is pertinent to the pairing. This PR includes the following methods in
utils.py
:ngrams(s, n)
: Generate n-grams from a stringcalculate_shared_ngram_percentage(string_A, string_B, n)
: Calculate the percentage ofcommon_ngrams
betweenstring_A
andstring_B
with reference to the total number of ngrams instring_A
.calculate_largest_ngram_percentage(string_A, string_B)
: Iteratively callcalculate_shared_ngram_percentage
starting from the biggest ngram possible until the shared percentage is >0.0%is_parent_box(parent_target, child_target, add=0)
: True if thechild_target
bounding box is nested in theparent_target
Box format: [x_bottom_left
,y_bottom_left
,x_top_right
,y_top_right
]. The parameter 'add' is the pixel error tolerance for extra pixels outside the parent regioncalculate_overlap_percentage(box1, box2, intersection_ratio_method="total")
: Box format: [x_bottom_left
,y_bottom_left
,x_top_right
,y_top_right
]. Calculates the percentage of overlapped region with reference to biggest element-region (intersection_ratio_method="parent"
), the smallest element-region (intersection_ratio_method="partial"
), or to the disjunctive union region (intersection_ratio_method="total"
).identify_overlapping_or_nesting_case
: Identify if there are nested or overlapping elements. If overlapping is present,it identifies the case calling the method
identify_overlapping_case
.identify_overlapping_case
: Classifies the overlapping case for an element_pair input in one of 5 categories of overlapping.catch_overlapping_and_nested_bboxes
: Catch overlapping and nested bounding boxes cases across a list of elements. The paramsnested_error_tolerance_px
andsm_overlap_threshold
help controling the separation of the cases.The overlapping/nested elements cases that are being caught are:
calculate_largest_ngram_percentage(...)
}% of the textHere is a snippet to test it:
Here is a screenshot of a json built with the output list
overlapping_cases
: