Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] Feat: add support for start_index in html links extraction #2876

Closed
wants to merge 41 commits into from
Closed
Changes from 1 commit
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
724cdb5
Refactor threshold to annotation_threshold and make it an optional pa…
Feb 9, 2024
643e67e
Merge branch 'main' into main
MiXiBo Feb 13, 2024
34c78de
Merge branch 'main' into main
MiXiBo Feb 26, 2024
6c34d9f
Merge branch 'Unstructured-IO:main' into main
MiXiBo Feb 29, 2024
f4a18b5
add support for start_index to html link extraction
Feb 29, 2024
ff2f3bf
Merge branch 'Unstructured-IO:main' into improved_pdf_html_links_support
MiXiBo Mar 5, 2024
4f99d67
Revert "Refactor threshold to annotation_threshold and make it an opt…
Mar 6, 2024
0c1e2c1
Merge branch 'main' into improved_pdf_html_links_support
MiXiBo Mar 6, 2024
8c12ca7
Merge branch 'main' into improved_pdf_html_links_support
MiXiBo Mar 7, 2024
99f3545
Merge branch 'main' into mixibo/improved_pdf_html_links_support
christinestraub Mar 7, 2024
e75b43d
chore: update changelog & version
christinestraub Mar 7, 2024
3be9452
feat: fix TypeError: object of type 'NoneType' has no len()
christinestraub Mar 7, 2024
9c13c98
test: add unit test
christinestraub Mar 7, 2024
70f2a87
refactor: rename link_start_indexs -> link_start_indexes
christinestraub Mar 7, 2024
8737981
[DO NOT MERGE] feat: add support for start_index in html links extrac…
ryannikolaidis Mar 8, 2024
f7623e1
test: add unit test to test partition_html() with links
christinestraub Mar 8, 2024
62f15a2
test: fix lint error
christinestraub Mar 8, 2024
2079213
Merge branch 'main' into mixibo/improved_pdf_html_links_support
christinestraub Mar 8, 2024
83b5118
Merge branch 'main' into mixibo/improved_pdf_html_links_support
christinestraub Mar 14, 2024
d8e8ac1
feat: set consolidation-strategy for `link_start_indexes metadata` fi…
christinestraub Mar 14, 2024
9aa5651
Merge branch 'main' into improved_pdf_html_links_support
ron-unstructured Mar 14, 2024
0359f3a
Merge branch 'main' into mixibo/improved_pdf_html_links_support
christinestraub Mar 15, 2024
c22e3f6
feat: remove leading extra tags when calculating link start index
christinestraub Mar 18, 2024
a5d245c
Merge branch 'main' into mixibo/improved_pdf_html_links_support
christinestraub Mar 18, 2024
5e05202
Merge branch 'main' into mixibo/improved_pdf_html_links_support
christinestraub Mar 18, 2024
9483427
chore: bump version
christinestraub Mar 18, 2024
fcee08a
reviewed start_index handling
MiXiBo Mar 27, 2024
de90ce3
add support to corner-case with tags surrounded by href
MiXiBo Mar 28, 2024
e8ff218
Merge branch 'main' into mixibo/improved_pdf_html_links_support
christinestraub Apr 8, 2024
86a95f1
feat: set link text same as element text if start_index is -1
christinestraub Apr 9, 2024
076492e
test: refactor unit test
christinestraub Apr 9, 2024
e300a70
test: fix lint error
christinestraub Apr 9, 2024
391cac0
refactor: fix missing code
christinestraub Apr 9, 2024
9b06533
feat: exclude tail text from link text when start_index is -1
christinestraub Apr 9, 2024
8ff23af
feat: include links with urls but no text
christinestraub Apr 9, 2024
a11c968
update ingest test fixtures update ci
christinestraub Apr 11, 2024
5b9dee3
Merge branch 'main' into feat/2625-html-support-link-start-index
christinestraub Apr 11, 2024
d7b6aff
ci: revert ingest test fixtures update ci
christinestraub Apr 11, 2024
8895e5a
[DO NOT MERGE] Feat: add support for `start_index` in html `links` ex…
ryannikolaidis Apr 11, 2024
b4ec676
Merge branch 'main' into feat/2625-html-support-link-start-index
christinestraub Apr 11, 2024
1765496
chore: update version
christinestraub Apr 11, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Merge branch 'main' into mixibo/improved_pdf_html_links_support
# Conflicts:
#	CHANGELOG.md
#	unstructured/__version__.py
christinestraub committed Mar 8, 2024
commit 2079213d08b7116c7fc0bab1324b194bb760b17a
12 changes: 11 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,23 @@
## 0.12.6-dev9
## 0.12.7-dev0

### Enhancements

* **Add support for `start_index` in `html` links extraction**

### Features

### Fixes

## 0.12.6

### Enhancements

* **Improve ability to capture embedded links in `partition_pdf()` for `fast` strategy** Previously, a threshold value that affects the capture of embedded links was set to a fixed value by default. This allows users to specify the threshold value for better capturing.
* **Refactor `add_chunking_strategy` decorator to dispatch by name.** Add `chunk()` function to be used by the `add_chunking_strategy` decorator to dispatch chunking call based on a chunking-strategy name (that can be dynamic at runtime). This decouples chunking dispatch from only those chunkers known at "compile" time and enables runtime registration of custom chunkers.
* **Redefine `table_level_acc` metric for table evaluation.** `table_level_acc` now is an average of individual predicted table's accuracy. A predicted table's accuracy is defined as the sequence matching ratio between itself and its corresponding ground truth table.

### Features

* **Added Unstructured Platform Documentation** The Unstructured Platform is currently in beta. The documentation provides how-to guides for setting up workflow automation, job scheduling, and configuring source and destination connectors.

### Fixes
2 changes: 1 addition & 1 deletion unstructured/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.12.6-dev9" # pragma: no cover
__version__ = "0.12.7-dev0" # pragma: no cover
You are viewing a condensed version of this merge commit. You can view the full changes here.