Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Add Resiliparse option for text extraction #128

Merged
merged 3 commits into from
Jul 1, 2024

Conversation

sarahyurick
Copy link
Collaborator

Duplicate of #90 with successful DCO check.

Right now, we only support Common Crawl text extraction with jusText. Resiliparse is known to be a faster text extraction algorithm which may also produce better tokens.

This PR adds optional support for the Resiliparse algorithm while still keeping jusText as the default.

Copy link
Collaborator Author

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ryantwolf @jojennin sorry for the confusion. I'm reopening #90 to fix the commit signoff issues I was dealing with; the PRs are identical otherwise. Thanks!

nemo_curator/download/commoncrawl.py Outdated Show resolved Hide resolved
nemo_curator/download/commoncrawl.py Show resolved Hide resolved
@sarahyurick sarahyurick changed the title [RE-OPEN] Add Resiliparse option for text extraction [REVIEW] Add Resiliparse option for text extraction Jun 24, 2024
Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great! I just have some design comments on how I think we can make it easier to swap between algorithms and customize them.

nemo_curator/download/commoncrawl.py Show resolved Hide resolved
nemo_curator/download/commoncrawl.py Show resolved Hide resolved
nemo_curator/download/commoncrawl.py Outdated Show resolved Hide resolved
nemo_curator/download/commoncrawl.py Outdated Show resolved Hide resolved
nemo_curator/download/commoncrawl.py Outdated Show resolved Hide resolved
setup.py Show resolved Hide resolved
@sarahyurick
Copy link
Collaborator Author

sarahyurick commented Jun 27, 2024

Hi @ryantwolf this is ready for another review.

The only question I have is what you think the best way to go about adding the unit tests is? Locally I'm testing it with download_common_crawl.py but that seems a bit heavy for CI? Even doing a single snapshot with url_limit = 10 takes at least a couple minutes each to run.

Edit: Perhaps it may be sufficient to add examples of JusTextExtraction() and ResiliparseExtraction() to download_common_crawl.py?

@ryantwolf
Copy link
Collaborator

ryantwolf commented Jun 28, 2024

The only question I have is what you think the best way to go about adding the unit tests is?

Good question. I wouldn't add a unit test for the download_and_extract. I would just target each algorithm's extract_text method. You could draft like one or two simple html pages and make a unit test using each algorithm. Just something to make sure the behavior stays consistent.

Perhaps it may be sufficient to add examples of JusTextExtraction() and ResiliparseExtraction() to download_common_crawl.py?

It isn't a bad idea to showcase how users can change the algorithm. Do you mind updating the download.rst docs instead? I'm thinking right after we explain what output_type="jsonl" does you could add a snippet about the algorithm parameter. Or, you could add another small code block like

from nemo_curator.download import (
    download_common_crawl,
    ResiliparseExtraction,
)

# Change the extraction algorithm
extraction_algorithm = ResiliparseExtraction()
common_crawl = download_common_crawl(
    "/extracted/output/folder", 
    "2020-50",
    "2021-04",
    output_type="jsonl",
    algorithm=extraction_algorithm,
)

@sarahyurick
Copy link
Collaborator Author

Thanks @ryantwolf ! Should be ready now.

Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, glad to have this in. Thanks again!

Signed-off-by: Sarah Yurick <[email protected]>
sarahyurick and others added 2 commits July 1, 2024 10:12
@ryantwolf ryantwolf merged commit 0213439 into NVIDIA:main Jul 1, 2024
3 checks passed
@sarahyurick sarahyurick deleted the resiliparse_dco branch October 25, 2024 20:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants