Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Native hierarchies for elements from pptx documents #1616

Merged
merged 16 commits into from
Oct 5, 2023

Conversation

newelh
Copy link
Contributor

@newelh newelh commented Oct 2, 2023

Summary

Improve title detection in pptx documents The default title textboxes on a pptx slide are now categorized as titles.
Improve hierarchy detection in pptx documents List items, and other slide text are properly nested under the slide title. This will enable better chunking of pptx documents.

Hierarchy detection is improved by determining category depth via the following:

  • Check if the paragraph item has a level parameter via the python pptx paragraph. If so, use the paragraph level as the category_depth level.
  • If the shape being checked is a title shape and the item is not a bullet or email, the element will be set as a Title with a depth corresponding to the enumerated paragraph increment (e.g. 1st line of title shape is depth 0, second is depth 1 etc.).
  • If the shape is not a title shape but the paragraph is a title, the increment will match the level + 1, so that all paragraph titles are at least 1 to set them below the slide title element

Testing

This PR adds an additional pptx file to example-docs: presentation-sample.pptx. This document illustrates slide titles, and bullet hierarchies in a pptx file.

Try it yourself:

document = partition_pptx("example-docs/sample-presentation.pptx")
for element in document:
    print(
        f"Depth : {element.metadata.category_depth}\n"
        f"Parent: {element.metadata.parent_id}\n" \
        f"ID    : {element.id}\n" \
        f"Text  : {element.text}\n" \
        f"Type  : {element.category}\n\n" \
    )

If you have graphviz and pygraphviz installed, you can visualize the hierarchy with:

G = pgv.AGraph(strict=False, directed=True, dpi=300, rankdir='TB')

def wrap_text(text, width=50):
    return textwrap.fill(text, width=width)

# Adding nodes and edges
for element in document:
    wrapped_text = wrap_text(element.text)
    G.add_node(element.id, label=f"{element.category}\n{wrapped_text}", shape='box')
    if element.metadata.parent_id is not None:
        G.add_edge(element.metadata.parent_id, element.id)

# Render the graph
G.layout(prog='dot')
G.draw('hierarchy.png')

Unit Tests

  1. test_partition_pptx_title_shape_detection: Tests title element detection for text in title shapes. It includes a check for additional paragraphs within the title shape text frame and interprets them as Titles with incremental depth.
  2. test_partition_pptx_level_detection: Tests if the indentation level pulled from the pptx xml via python_pptx is correctly being set as a category depth.
  3. test_partition_pptx_hierarchy_sample_document: Tests if the hierarchy in the sample file is being correctly parsed

Example

Sample Slides
Screenshot 2023-10-02 at 3 00 44 PM

Before

The sample powerpoint yields no hierarchy relationships

After

document_structure_hierarchy

@newelh newelh marked this pull request as ready for review October 2, 2023 19:24
@newelh newelh requested a review from scanny October 2, 2023 19:26
Copy link
Collaborator

@scanny scanny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple comments, happy to discuss in a quick meet if you like.

Approving in advance of whatever changes you decide to make there, this looks like it works fine for the intended behavior :)

unstructured/partition/pptx.py Outdated Show resolved Hide resolved
unstructured/partition/pptx.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@scanny scanny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :)

@newelh newelh merged commit e34396b into main Oct 5, 2023
@newelh newelh deleted the newelh/hierarchy-fast/pptx branch October 5, 2023 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants