-
Notifications
You must be signed in to change notification settings - Fork 820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Native hierarchies for elements from pptx documents #1616
Conversation
…into newelh/hierarchy-fast/pptx
…into newelh/hierarchy-fast/pptx
paragraphs in the title shape title
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple comments, happy to discuss in a quick meet if you like.
Approving in advance of whatever changes you decide to make there, this looks like it works fine for the intended behavior :)
Preserves shape order (title shapes are not first ordered)
…into newelh/hierarchy-fast/pptx
title shape is used for body text
…into newelh/hierarchy-fast/pptx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM :)
Summary
Improve title detection in pptx documents The default title textboxes on a pptx slide are now categorized as titles.
Improve hierarchy detection in pptx documents List items, and other slide text are properly nested under the slide title. This will enable better chunking of pptx documents.
Hierarchy detection is improved by determining category depth via the following:
Testing
This PR adds an additional pptx file to example-docs: presentation-sample.pptx. This document illustrates slide titles, and bullet hierarchies in a pptx file.
Try it yourself:
If you have
graphviz
andpygraphviz
installed, you can visualize the hierarchy with:Unit Tests
test_partition_pptx_title_shape_detection
: Tests title element detection for text in title shapes. It includes a check for additional paragraphs within the title shape text frame and interprets them as Titles with incremental depth.test_partition_pptx_level_detection
: Tests if the indentation level pulled from the pptx xml via python_pptx is correctly being set as a category depth.test_partition_pptx_hierarchy_sample_document
: Tests if the hierarchy in the sample file is being correctly parsedExample
Sample Slides
Before
The sample powerpoint yields no hierarchy relationships
After