[REVIEW] Add Resiliparse option for text extraction #128

sarahyurick · 2024-06-24T22:55:20Z

Duplicate of #90 with successful DCO check.

Right now, we only support Common Crawl text extraction with jusText. Resiliparse is known to be a faster text extraction algorithm which may also produce better tokens.

This PR adds optional support for the Resiliparse algorithm while still keeping jusText as the default.

sarahyurick

Hi @ryantwolf @jojennin sorry for the confusion. I'm reopening #90 to fix the commit signoff issues I was dealing with; the PRs are identical otherwise. Thanks!

nemo_curator/download/commoncrawl.py

ryantwolf

Overall looks great! I just have some design comments on how I think we can make it easier to swap between algorithms and customize them.

nemo_curator/download/commoncrawl.py

setup.py

sarahyurick · 2024-06-27T23:50:16Z

Hi @ryantwolf this is ready for another review.

The only question I have is what you think the best way to go about adding the unit tests is? Locally I'm testing it with download_common_crawl.py but that seems a bit heavy for CI? Even doing a single snapshot with url_limit = 10 takes at least a couple minutes each to run.

Edit: Perhaps it may be sufficient to add examples of JusTextExtraction() and ResiliparseExtraction() to download_common_crawl.py?

ryantwolf · 2024-06-28T18:56:39Z

The only question I have is what you think the best way to go about adding the unit tests is?

Good question. I wouldn't add a unit test for the download_and_extract. I would just target each algorithm's extract_text method. You could draft like one or two simple html pages and make a unit test using each algorithm. Just something to make sure the behavior stays consistent.

Perhaps it may be sufficient to add examples of JusTextExtraction() and ResiliparseExtraction() to download_common_crawl.py?

It isn't a bad idea to showcase how users can change the algorithm. Do you mind updating the download.rst docs instead? I'm thinking right after we explain what output_type="jsonl" does you could add a snippet about the algorithm parameter. Or, you could add another small code block like

from nemo_curator.download import (
    download_common_crawl,
    ResiliparseExtraction,
)

# Change the extraction algorithm
extraction_algorithm = ResiliparseExtraction()
common_crawl = download_common_crawl(
    "/extracted/output/folder", 
    "2020-50",
    "2021-04",
    output_type="jsonl",
    algorithm=extraction_algorithm,
)

sarahyurick · 2024-06-28T20:31:02Z

Thanks @ryantwolf ! Should be ready now.

ryantwolf

Great, glad to have this in. Thanks again!

Signed-off-by: Sarah Yurick <[email protected]>

sarahyurick commented Jun 24, 2024

View reviewed changes

nemo_curator/download/commoncrawl.py Outdated Show resolved Hide resolved

nemo_curator/download/commoncrawl.py Show resolved Hide resolved

sarahyurick mentioned this pull request Jun 24, 2024

Add Resiliparse option for text extraction #90

Closed

sarahyurick changed the title ~~[RE-OPEN] Add Resiliparse option for text extraction~~ [REVIEW] Add Resiliparse option for text extraction Jun 24, 2024

ryantwolf reviewed Jun 25, 2024

View reviewed changes

ryantwolf approved these changes Jun 28, 2024

View reviewed changes

sign

4f90c28

Signed-off-by: Sarah Yurick <[email protected]>

sarahyurick force-pushed the resiliparse_dco branch from 86d6a03 to 4f90c28 Compare July 1, 2024 17:10

sarahyurick and others added 2 commits July 1, 2024 10:12

Merge branch 'main' into resiliparse_dco

abb1dc1

Signed-off-by: Sarah Yurick <[email protected]>

remove extra paragraph

e640fe2

Signed-off-by: Sarah Yurick <[email protected]>

ryantwolf merged commit 0213439 into NVIDIA:main Jul 1, 2024
3 checks passed

sarahyurick deleted the resiliparse_dco branch October 25, 2024 20:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Add Resiliparse option for text extraction #128

[REVIEW] Add Resiliparse option for text extraction #128

sarahyurick commented Jun 24, 2024

sarahyurick left a comment

ryantwolf left a comment

sarahyurick commented Jun 27, 2024 •

edited

Loading

ryantwolf commented Jun 28, 2024 •

edited

Loading

sarahyurick commented Jun 28, 2024

ryantwolf left a comment

[REVIEW] Add Resiliparse option for text extraction #128

[REVIEW] Add Resiliparse option for text extraction #128

Conversation

sarahyurick commented Jun 24, 2024

sarahyurick left a comment

Choose a reason for hiding this comment

ryantwolf left a comment

Choose a reason for hiding this comment

sarahyurick commented Jun 27, 2024 • edited Loading

ryantwolf commented Jun 28, 2024 • edited Loading

sarahyurick commented Jun 28, 2024

ryantwolf left a comment

Choose a reason for hiding this comment

sarahyurick commented Jun 27, 2024 •

edited

Loading

ryantwolf commented Jun 28, 2024 •

edited

Loading