Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Commoncrawl pipeline] Add load from commoncrawl component #269

Conversation

shayorshay
Copy link
Contributor

This component is the first part of the commoncrawl pipeline. Given an index, this component loads the corresponding index file from the AWS Public Data Sets (S3 bucket) and returns a list of its WARC segment file paths as a dataframe.

@shayorshay shayorshay changed the title Add component load_from_commoncrawl [Commoncrawl pipeline] Add component load_from_commoncrawl Jul 5, 2023
@shayorshay shayorshay changed the title [Commoncrawl pipeline] Add component load_from_commoncrawl [Commoncrawl pipeline] Add load from commoncrawl component Jul 5, 2023
Copy link
Contributor

@PhilippeMoussalli PhilippeMoussalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shayorshay ! Looks like a very clean component ;)
Left a few minor comments but should be good to go

@@ -0,0 +1,3 @@
boto3==1.26.158
fondant
pyarrow>=7.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think pyarrow is installed with Fondant so no need to include it here


logger = logging.getLogger(__name__)

S3_BASE_URL = "s3://commoncrawl/crawl-data"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not being used anywhere

S3_COMMONCRAWL_BUCKET = "commoncrawl"


def fetch_warc_file_from_s3(s3_bucket: str, s3_key) -> dd.DataFrame:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a docstring for the argumens.

Would also prefer to change s3_key to bucket_path. Feels more intuitive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s3_key is the key of the object in the bucket not the bucket path. Would object_key be a better name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my bad, just read on it and it seems to be an AWS specific notation (still used to GCP). It's fine to leave it as is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok great!

Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shayorshay, one minor comment since our latest version of the build script expects fondant dependencies to be defined from github.

@@ -0,0 +1,2 @@
boto3==1.26.158
fondant
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fondant
git+https://github.com/ml6team/fondant@main

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done :)

@RobbeSneyders RobbeSneyders merged commit ab1e6de into ml6team:main Jul 5, 2023
@shayorshay shayorshay deleted the feature/commoncrawl-load-from-commoncrawl branch September 5, 2023 09:08
Hakimovich99 pushed a commit that referenced this pull request Oct 16, 2023
This component is the first part of the commoncrawl pipeline. Given an
index, this component loads the corresponding index file from the AWS
Public Data Sets (S3 bucket) and returns a list of its WARC segment file
paths as a dataframe.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants