feat: method to catch and classify overlapping bounding boxes #1803

LaverdeS · 2023-10-19T19:03:04Z

We have established that overlapping bounding boxes does not have a one-fits-all solution, so different cases need to be handled differently to avoid information loss. We have manually identified the cases/categories of overlapping. Now we need a method to programmatically classify overlapping-bboxes cases within detected elements in a document, and return a report about it (list of cases with metadata). This fits two purposes:

Evaluation: We can have a pipeline using the DVC data registry that assess the performance of a detection model against a set of documents (PDF/Images), by analysing the overlapping-bboxes cases it has. The metadata in the output can be used for generating metrics for this.
Scope overlapping cases: Manual inspection give us a clue about currently present cases of overlapping bboxes. We need to propose solutions to fix those on code. This method generates a report by analysing several aspects of two overlapping regions. This data can be used to profile and specify the necessary changes that will fix each case.
Fix overlapping cases: We could introduce this functionality in the flow of a partition method (such as partition_pdf, to handle the calls to post-processing methods to fix overlapping. Tested on ~331 documents, the worst time per page is around 5ms. For a document such as layout-parser-paper.pdf it takes 4.46 ms.

Introduces functionality to take a list of unstructured elements (which contain bounding boxes) and identify pairs of bounding boxes which overlap and which case is pertinent to the pairing. This PR includes the following methods in utils.py:

ngrams(s, n): Generate n-grams from a string
calculate_shared_ngram_percentage(string_A, string_B, n): Calculate the percentage of common_ngrams between string_A and string_B with reference to the total number of ngrams in string_A.
calculate_largest_ngram_percentage(string_A, string_B): Iteratively call calculate_shared_ngram_percentage starting from the biggest ngram possible until the shared percentage is >0.0%
is_parent_box(parent_target, child_target, add=0): True if the child_target bounding box is nested in the parent_target Box format: [x_bottom_left, y_bottom_left, x_top_right, y_top_right]. The parameter 'add' is the pixel error tolerance for extra pixels outside the parent region
calculate_overlap_percentage(box1, box2, intersection_ratio_method="total"): Box format: [x_bottom_left, y_bottom_left, x_top_right, y_top_right]. Calculates the percentage of overlapped region with reference to biggest element-region (intersection_ratio_method="parent"), the smallest element-region (intersection_ratio_method="partial"), or to the disjunctive union region (intersection_ratio_method="total").
identify_overlapping_or_nesting_case: Identify if there are nested or overlapping elements. If overlapping is present,
it identifies the case calling the method identify_overlapping_case.
identify_overlapping_case: Classifies the overlapping case for an element_pair input in one of 5 categories of overlapping.
catch_overlapping_and_nested_bboxes: Catch overlapping and nested bounding boxes cases across a list of elements. The params nested_error_tolerance_px and sm_overlap_threshold help controling the separation of the cases.

The overlapping/nested elements cases that are being caught are:

Nested elements
Small partial overlap
Partial overlap with empty content
Partial overlap with duplicate text (sharing 100% of the text)
Partial overlap without sharing text
Partial overlap sharing {calculate_largest_ngram_percentage(...)}% of the text

Here is a snippet to test it:

from unstructured.partition.auto import partition

model_name = "yolox_quantized"
target = "sample-docs/layout-parser-paper-fast.pdf"
elements = partition(filename=file_path_i, strategy='hi_res', model_name=model_name)
overlapping_flag, overlapping_cases = catch_overlapping_bboxes(elements)
for case in overlapping_cases:
    print(case, "\n")

Here is a screenshot of a json built with the output list overlapping_cases:

Includes catching nesting elements with an pixel error tolerance (default to 5px)

qued · 2023-10-21T17:01:42Z

This PR isn't marked draft but it's missing the normal code quality features, like type hints and tests.

qued · 2023-10-21T17:09:32Z

unstructured/utils.py

+def catch_overlapping_bboxes(
+    elements,
+) -> bool:


This function is over 150 lines long, which is way over the rule of thumb limit of 5-20 lines. Breaking it up into logical subtasks that get their own functions would really increase the readability.

I renamed catch_overlapping_bboxes to catch_overlapping_and_nested_bboxesand create two more methods to break it up: identify_overlapping_or_nesting_case and identify_overlapping_case. A short description can be found in the PR description. Is it better? I can try to split it further but most of the conditions to classify the overlapping-case shall be in one method, making it large.

qued · 2023-10-21T17:11:58Z

Could you add some context to the PR description about what this is in service of? Like once we have a robust way of classifying the overlaps, what will the next step be?

ajjimeno · 2023-10-22T22:39:16Z

Similarly to Alan, I am wondering what is the end game of this PR. In which scenario is this going to be used? e.g. support processing of overlapping bounding boxes. I would like to have this clear to understand what the code needs to do. Thanks!

As well, I think the intention of n-gram calculation is to identify the longest substring common in the strings of two bounding boxes. This is achieved using n-gram calculation and a recursive function is used to find the largest n, which seems kind of expensive, even though the strings might not be long enough for this to be a problem. What about the following example function instead to calculate the longest substring common to two strings?

def largest_common_substring(str1, str2):

    n, m = len(str1), len(str2)
    dp = [[0 for _ in range(m + 1)] for _ in range(n + 1)]

    for i in range(1, n + 1):
        for j in range(1, m + 1):
            if str1[i - 1] == str2[j - 1]:
                dp[i][j] = dp[i - 1][j - 1] + 1
            else:
                dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])

    lcs = ""
    i, j = n, m
    while i > 0 and j > 0:
        if dp[i][j] == dp[i - 1][j]:
            i -= 1
        elif dp[i][j] == dp[i][j - 1]:
            j -= 1
        else:
            lcs += str1[i - 1]
            i -= 1
            j -= 1

    return lcs[::-1]

str1 = "This is text 1"
str2 = "This is another text 2"

lcs = largest_common_substring(str1, str2)

print(lcs)

LaverdeS · 2023-10-23T10:39:23Z

Could you add some context to the PR description about what this is in service of? Like once we have a robust way of classifying the overlaps, what will the next step be?
@allan-unstructured

Similarly to Alan, I am wondering what is the end game of this PR. In which scenario is this going to be used? e.g. support processing of overlapping bounding boxes. I would like to have this clear to understand what the code needs to do. Thanks!
@ajjimeno

I added a more detailed description. Hopefully that covers the purpose of the PR. Talking with @pravin-unstructured__who created this ticket__we thought it was making sense to put this in unstructured, but let me know if you consider this should be part of a specific private repo instead.

LaverdeS · 2023-10-23T11:07:09Z

As well, I think the intention of n-gram calculation is to identify the longest substring common in the strings of two bounding boxes. This is achieved using n-gram calculation and a recursive function is used to find the largest n, which seems kind of expensive, even though the strings might not be long enough for this to be a problem. What about the following example function instead to calculate the longest substring common to two strings?
def largest_common_substring(str1, str2):

   ...
str1 = "This is text 1"
str2 = "This is another text 2"

lcs = largest_common_substring(str1, str2)

print(lcs)

@ajjimeno . I compared your proposed solution against mine. Here are the results for a small pair of strings:

str1 = "This is text 1"
str2 = "This is another text 2"
%time subs= get_largest_substring(str1, str2)
print(subs)
%time percent, shared, n_ = calculate_largest_ngram_percentage(str1, str2)
print(shared)

CPU times: user 231 µs, sys: 1e+03 ns, total: 232 µs
Wall time: 240 µs
This is text
CPU times: user 46 µs, sys: 2 µs, total: 48 µs
Wall time: 52.9 µs
{('This', 'is')}

First, what I intended to get is the largest common substring but with consecutive words. The output "This is text" from your method is wrong then. My method was faster too.

Perhaps this example would help clarifying. Here are two paragraphs A and B with little shared content, and I inserted a subtring "I AM THE LARGEST NGRAM IN THE TEXT" in both of them. The method I want should be able to quickly extract it. Here is the experiment:

A =

Typically for the 1960s, the UK concerts were arranged in a package-tour format, with multiple acts on the bill and two performances held each day.[12] The Kinks played six days a week,[13] a gruelling schedule intended to sharpen their abilities before their first US tour a month later.[14] On their off days, the band sometimes recorded in the studio.[15] The Kinks' management initially planned for the tour to be with Manfred Mann, but both bands rejected the idea in March after neither was willing to accept second billing.[8] The tour was instead the Kinks' first in which they were the featured headliners.[12]

Including support acts, each performance lasted around two hours[16] and was compèred by Bob Bain.[13] The Kinks were accompanied on tour by newly hired road manager Sam Curtis.[12] The support acts on the program were Mickey Finn, Jeff & Jon, the Yardbirds, the Riot Squad, Val McKenna and the American band Goldie and the Gingerbreads.[13] One of the first all-female rock groups,[17][18] the Gingerbreads drew extra press attention during the tour and performed immediately before the Kinks.[18][nb 2] Single appearances were made by Unit Four Plus Two, who performed on 7 May, and the Rockin' Berries, who played in place of the Yardbirds on 5 May.[15][20]

The Yardbirds closed the first half of each show.[21] Like the Kinks, the Yardbirds were part of the I AM THE LARGEST NGRAM IN THE TEXT British rhythm and blues movement; though they were initially not as commercially successful as the Kinks, they were quickly propelled to pop star status when their single "For Your Love" reached number one on NME's chart in April, three weeks before the tour began.[22][23] Decades later, Curtis recalled Ray throwing things onto the stage during one of the Yardbirds' sets,[24] an outburst biographer Johnny Rogan suggests stemmed from resentment on Ray's part at the success of a musical act comparable to the Kinks.[25] On 13 April, two weeks before the tour began,[26][27] both bands independently attempted recording some of the earliest pop songs with Indian influences, a genre later termed raga rock;[28] the Yardbirds re-recorded their song "Heart Full of Soul" on 20 April,[29] and the Kinks returned to the studio on 3 May – their first day off from the UK tour – to re-do "See My Friends".[15] Author Peter Lavezzoli writes that the tight timeline between the bands' pioneering recordings and their time together makes it interesting to speculate whether they discussed their work with one another during the tour.[30][nb 3]

The Kinks received positive reviews in the British press for the second show of the tour's debut, held on 30 April in Slough. Norman Jopling of Melody Maker characterised the band's music as powerful, adding that their energetic stage act led to an all-round entertaining show.[12] In NME, Norrie Drummond similarly highlighted the show's theatrical elements, particularly the opening, which had the Kinks begin "You Really Got Me" on a darkened stage before a spotlight expanded to reveal the group.[12][21]

B =

Background and fight in Taunton
By early 1965, the Kinks had developed a reputation for violence and aggression,[34][35] both on and off the stage.[36] The band sometimes broke into physical altercations during rehearsals, recording sessions and concerts, with infighting common between brothers Ray and Dave and between Dave and drummer Mick Avory.[37] Avory was often the target of taunting from the Davies brothers, who antagonised Avory for their own enjoyment.[38] Bassist Pete Quaife remained on agreeable terms with all of his bandmates and often worked to calm situations within the group.[39][nb 4] Further exacerbating tensions between Avory and Dave,[42] the two began sharing a rental flat in London in March 1965.[43] Decades later, Avory reflected: "[W]orking together and living together wasn't a good idea. That's when the bust-up came, and it all fell apart."[34]

A major fight between Avory and Dave ensued on 18 May, following a concert in Taunton, Somerset.[11] After returning from a post-show party,[11][44] a drunk and high Dave got into separate arguments with Ray, Curtis and the hotel's night porter.[45] Dave asked Avory to weigh in on his argument with Ray, but Avory's refusal to engage further angered Dave.[45][46] Quaife recalled that as he and Avory attempted to get away, Dave struck Avory with a suitcase.[47] An enraged Avory began fighting Dave, violently punching him in the head and also placing him in a headlock. The fight left both with cuts and bruises, as well as a pair of black eyes for Dave.[46][48] The band's management broke-up the fight,[46] and Curtis later recalled blood running along the staircase where the fight took place.[48]

Cardiff incident
To prevent further fighting, Curtis kept the Kinks separated before their 19 May show in Cardiff, Wales.[11][48] Ray and Dave travelled in I AM THE LARGEST NGRAM IN THE TEXT separate cars from Quaife and Avory. The pairs had different dressing rooms and did not interact until the concert began, entering from opposite ends of the stage.[48] Dave, who wore sunglasses at the show to obscure his injuries from the previous day's fight,[11][44] verbally insulted Avory during the show's opening number, "You Really Got Me".[44] He continued insulting Avory after the song,[15] denigrating his drumming abilities and saying it would sound better if he played "with his cock".[46][49] He next kicked Avory's drum set across the stage,[15][50] eliciting laughter from the audience who presumed the act was prearranged.[51] After a few moments on his knees,[51] Avory emerged from the wreckage, took his hi-hat stand and struck Dave over the back of the head with the cymbal end.[50][nb 5] Dave collapsed and lay motionless.[46] Watching from the edge of the stage, Chris Dreja of the Yardbirds recalled:
[Avory] delivered what to me was like an execution – a beheading. Seriously. It was such a violent act. He hit [Dave] over the back of the head. I was absolutely stunned. I remember shaking, then thinking, "He's killed him." He just ran off stage and out of the theatre.[51]

Worried he had unintentionally killed his bandmate,[15][53] Avory fled the theatre and hid in a nearby cafe.[54][55] The police were called after Dave fell unconscious,[56] and they arrived at the theatre and the Kinks' hotel looking to question Avory.[57] Ray later said that the incident "horrified him",[56] and both Genya "Goldie" Zelkowitz of the Gingerbreads and Quaife recalled Ray screaming hysterically onstage.[40][51] Bain announced to the crowd that the Kinks' second show had been cancelled,[58] and the Yardbirds instead returned to perform another set.[59] Dave was hospitalised at Cardiff Royal Infirmary, where he received 16 stitches to his head.[15] He later said that his only memory on the incident was kicking over the drums before waking up in the theatre's dressing room covered in blood.[56][60]

The results for calculate_largest_ngram_percentage(A, B):

CPU times: user 333 ms, sys: 6.38 ms, total: 340 ms
Wall time: 339 ms
('I', 'AM', 'THE', 'LARGEST', 'NGRAM', 'IN', 'THE', 'TEXT')

The results for your get_largest_substring:

CPU times: user 2.07 s, sys: 30.7 ms, total: 2.1 s
Wall time: 2.1 s
"\nca fo e 196, the Kns eeed a pato for ile ason th  and of he e.[] The ns e i ys aea ring heleinendets th iiti bee thers   a nt aer.[] n ther off as the anoie rore n et.[] e i manageet itall annd fore to to with nre ann bt oth bans ret t id in March ater etering toet nd ling her wasntad e 's wh the  e ad aler.[]\n\n or t e orane sed on 1 an a cor  o .[1] e in roma otory n hie o nae am Curtis he ots nt porere key  e  on h arith Rt ual ena the era aie ad th as e o te t eae rck rus[7] enged r ea  aeion uing the  and o imiael efe th isnbie as we a  i o ls o h eme o  , and ti erreled in lo the airs o a.[]\n\nardi coet frth fh s ke the Kinks eardbr er a of te I AM THE LARGEST NGRAM IN THE TEXT rts rm and es eent g r ni not ercil cce a enteri ropeed o  sta a wh e sngle o oure h nuer e o 's ht in Ari the ws oe e ou ea.[] ecte uti rae thong tings n te sag dun e f he adis [4] n outrst igaher oh auee o resetet as para te  f mome o h ns[5] Ar ee fo the reg, oh and ndt ae er e of the ea with ndan inles a re te ter ra o the Yardbirds reeorded t o ea l  uon  a hein ero t suion a  hit a or the  o t edoey ed. reer ai te th e iht tene etee he ad inein edis andter e tether a i inere  pce were ced ter   on anthe ri the trn he Kinks ei ostive si the ithre othe od ow of the rs de eld n ril ng. an on o eod hat the n' sc sowel, ad thair nstea ed to ro nter s.[]   oie ronimary hhied the hs hea eles atilry the nin wa ikin oe te drs before a pthteade o vee o.[][]\n"

I don't think I can do much for what I wanted with the result of your method, and it takes longer to execute. However, we could add this metric to the metadata if you find it useful, I hope this is enough to clarify. Let me know your thoughts.

LaverdeS · 2023-10-23T15:35:42Z

This PR isn't marked draft but it's missing the normal code quality features, like type hints and tests.

Typing and tests added. The test covers a case with overlap and a case without.

benjats07 · 2023-10-23T21:58:27Z

unstructured/utils.py

+    box2: Union[List, Tuple],
+    intersection_ratio_method: str = "total",
+):
+    """Box format: [x_bottom_left, y_bottom_left, x_top_right, y_top_right].


Would be difficult to change the function to work with our current format (x_topleft,y_topleft, x_bottom_right,y_bottom_right)?

This is working with the coordinate format present in the element list output. No necessary transform. Just pass the list of elements exactly how the partition method is giving them to you. The variables x_bottom_left, y_bottom_left, x_top_right, and y_top_right, are named with reference coordinate system with a y axis incrementing positive in the upper direction and not incrementing ↓. But this should not affect the results. For instance in the bbox:

'coordinates': {'points': ((374.0989074707031, 257.9061279296875), (374.0989074707031, 282.63995361328125), (613.169189453125, 282.63995361328125), (613.169189453125, 257.9061279296875)),

The first point would be a bottom_left corner in the described reference system.

And Inside the methods I work with the bottom_left and top_right corners. These methods are only going to be called when catch_overlapping_and_nested_bboxes is called. So, changing to the other corners won't make any difference in practice... I am not sure what you mean with our current format if what we have as input to catch_overlapping_and_nested_bboxes is a list of elements with the coords as I showed. The method calculate_overlap_percentage is NOT intended to be run outside catch_overlapping_and_nested_bboxes, so I don't see the point. Let me know if you consider the changes are necessary.

ajjimeno · 2023-10-24T02:55:54Z

Thanks @LaverdeS for adding content and the evaluation of the functions. No need to change it then based on your example.

ajjimeno

LGTM!

I used the code below to test it.

from unstructured.partition.auto import partition
from unstructured.utils import catch_overlapping_and_nested_bboxes

target = "example-docs/layout-parser-paper-fast.pdf"
model_name = "yolox_quantized"
elements = partition(filename=target, strategy='hi_res', model_name=model_name)

overlapping_flag, overlapping_cases = catch_overlapping_and_nested_bboxes(elements)

for element in elements:
    print(element.__dict__)

for case in overlapping_cases:
    print(case, "\n")

LaverdeS added 3 commits October 19, 2023 12:02

feat: method to catch and classify overlapping

174d983

Merge branch 'main' into sebastian/catch_overlapping_bboxes

1ff28cd

feat: catching 6 cases of overlapping bboxes

37a1077

Includes catching nesting elements with an pixel error tolerance (default to 5px)

LaverdeS requested review from ajjimeno, benjats07, qued and pravin-unstructured October 20, 2023 14:28

LaverdeS self-assigned this Oct 20, 2023

LaverdeS marked this pull request as ready for review October 20, 2023 14:29

Merge branch 'main' into sebastian/catch_overlapping_bboxes

cd97282

qued reviewed Oct 21, 2023

View reviewed changes

chore: tidy

ac406d2

LaverdeS added 2 commits October 23, 2023 11:55

chore: shorten catch_overlapping_and_nested_bboxes method

c7b36cc

chore: typo in docstrings

dac2e42

fix: max ngram n in calculate_largest_ngram_percentage

ca07bf3

LaverdeS added 3 commits October 23, 2023 04:13

Merge branch 'main' into sebastian/catch_overlapping_bboxes

85fd0fc

chore: add typing

a580ab1

chore: add testing for case with and without overlap

d8c4f44

LaverdeS added 6 commits October 23, 2023 17:44

chore: update CHANGELOG

b85225d

chore: tidy

d95fa0b

chore: better logic for is_parent_box

78057ae

chore: tidier lines

7d69d9b

chore: tidy

6c6317a

chore: tidy

d234fd3

LaverdeS added 3 commits October 23, 2023 10:13

Merge branch 'main' into sebastian/catch_overlapping_bboxes

a700881

chore: refactor ngrams() method

a74d158

chore: tidy

26cf06c

benjats07 reviewed Oct 23, 2023

View reviewed changes

LaverdeS added 6 commits October 24, 2023 05:28

Merge branch 'main' into sebastian/catch_overlapping_bboxes

32dcc41

Merge branch 'main' into sebastian/catch_overlapping_bboxes

c820900

feat: add nested_error_tolerance_px and sm_overlap_threshold params

6f6eca7

Merge branch 'main' into sebastian/catch_overlapping_bboxes

0dba67f

Merge branch 'main' into sebastian/catch_overlapping_bboxes

861e496

chore: more tests to cover all cases and use params

2abfcc1

LaverdeS requested review from benjats07 and qued October 24, 2023 19:56

benjats07 approved these changes Oct 24, 2023

View reviewed changes

ajjimeno approved these changes Oct 24, 2023

View reviewed changes

Merge branch 'main' into sebastian/catch_overlapping_bboxes

5c0c6d5

LaverdeS added this pull request to the merge queue Oct 25, 2023

Merged via the queue into main with commit c11a2ff Oct 25, 2023

LaverdeS deleted the sebastian/catch_overlapping_bboxes branch October 25, 2023 13:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: method to catch and classify overlapping bounding boxes #1803

feat: method to catch and classify overlapping bounding boxes #1803

LaverdeS commented Oct 19, 2023 •

edited

Loading

qued commented Oct 21, 2023 •

edited

Loading

qued Oct 21, 2023

LaverdeS Oct 23, 2023

qued commented Oct 21, 2023

ajjimeno commented Oct 22, 2023

LaverdeS commented Oct 23, 2023 •

edited

Loading

LaverdeS commented Oct 23, 2023 •

edited

Loading

LaverdeS commented Oct 23, 2023

benjats07 Oct 23, 2023

LaverdeS Oct 24, 2023 •

edited

Loading

ajjimeno commented Oct 24, 2023

ajjimeno left a comment

feat: method to catch and classify overlapping bounding boxes #1803

feat: method to catch and classify overlapping bounding boxes #1803

Conversation

LaverdeS commented Oct 19, 2023 • edited Loading

qued commented Oct 21, 2023 • edited Loading

qued Oct 21, 2023

Choose a reason for hiding this comment

LaverdeS Oct 23, 2023

Choose a reason for hiding this comment

qued commented Oct 21, 2023

ajjimeno commented Oct 22, 2023

LaverdeS commented Oct 23, 2023 • edited Loading

LaverdeS commented Oct 23, 2023 • edited Loading

LaverdeS commented Oct 23, 2023

benjats07 Oct 23, 2023

Choose a reason for hiding this comment

LaverdeS Oct 24, 2023 • edited Loading

Choose a reason for hiding this comment

ajjimeno commented Oct 24, 2023

ajjimeno left a comment

Choose a reason for hiding this comment

LaverdeS commented Oct 19, 2023 •

edited

Loading

qued commented Oct 21, 2023 •

edited

Loading

LaverdeS commented Oct 23, 2023 •

edited

Loading

LaverdeS commented Oct 23, 2023 •

edited

Loading

LaverdeS Oct 24, 2023 •

edited

Loading