[Commoncrawl pipeline] Add load from commoncrawl component #269

shayorshay · 2023-07-05T09:45:05Z

This component is the first part of the commoncrawl pipeline. Given an index, this component loads the corresponding index file from the AWS Public Data Sets (S3 bucket) and returns a list of its WARC segment file paths as a dataframe.

PhilippeMoussalli

Thanks @shayorshay ! Looks like a very clean component ;)
Left a few minor comments but should be good to go

PhilippeMoussalli · 2023-07-05T11:03:01Z

examples/pipelines/commoncrawl/components/load_from_commoncrawl/requirements.txt

@@ -0,0 +1,3 @@
+boto3==1.26.158
+fondant
+pyarrow>=7.0


I think pyarrow is installed with Fondant so no need to include it here

PhilippeMoussalli · 2023-07-05T11:04:54Z

examples/pipelines/commoncrawl/components/load_from_commoncrawl/src/main.py

+
+logger = logging.getLogger(__name__)
+
+S3_BASE_URL = "s3://commoncrawl/crawl-data"


This is not being used anywhere

PhilippeMoussalli · 2023-07-05T11:08:42Z

examples/pipelines/commoncrawl/components/load_from_commoncrawl/src/main.py

+S3_COMMONCRAWL_BUCKET = "commoncrawl"
+
+
+def fetch_warc_file_from_s3(s3_bucket: str, s3_key) -> dd.DataFrame:


Can you add a docstring for the argumens.

Would also prefer to change s3_key to bucket_path. Feels more intuitive

s3_key is the key of the object in the bucket not the bucket path. Would object_key be a better name?

my bad, just read on it and it seems to be an AWS specific notation (still used to GCP). It's fine to leave it as is

…crawl

RobbeSneyders

Thanks @shayorshay, one minor comment since our latest version of the build script expects fondant dependencies to be defined from github.

RobbeSneyders · 2023-07-05T12:34:05Z

examples/pipelines/commoncrawl/components/load_from_commoncrawl/requirements.txt

@@ -0,0 +1,2 @@
+boto3==1.26.158
+fondant


Suggested change

fondant

git+https://github.com/ml6team/fondant@main

This component is the first part of the commoncrawl pipeline. Given an index, this component loads the corresponding index file from the AWS Public Data Sets (S3 bucket) and returns a list of its WARC segment file paths as a dataframe.

Add component load_from_commoncrawl

fa326ee

shayorshay requested review from RobbeSneyders and PhilippeMoussalli July 5, 2023 09:45

shayorshay changed the title ~~Add component load_from_commoncrawl~~ [Commoncrawl pipeline] Add component load_from_commoncrawl Jul 5, 2023

shayorshay changed the title ~~[Commoncrawl pipeline] Add component load_from_commoncrawl~~ [Commoncrawl pipeline] Add load from commoncrawl component Jul 5, 2023

PhilippeMoussalli approved these changes Jul 5, 2023

View reviewed changes

shayorshay and others added 2 commits July 5, 2023 14:18

Add component load_from_commoncrawl

367ae86

Merge branch 'ml6team:main' into feature/commoncrawl-load-from-common…

7cd5e57

…crawl

RobbeSneyders approved these changes Jul 5, 2023

View reviewed changes

Add component load_from_commoncrawl

64e88ae

RobbeSneyders merged commit ab1e6de into ml6team:main Jul 5, 2023

shayorshay deleted the feature/commoncrawl-load-from-commoncrawl branch September 5, 2023 09:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Commoncrawl pipeline] Add load from commoncrawl component #269

[Commoncrawl pipeline] Add load from commoncrawl component #269

shayorshay commented Jul 5, 2023

PhilippeMoussalli left a comment

PhilippeMoussalli Jul 5, 2023

PhilippeMoussalli Jul 5, 2023

PhilippeMoussalli Jul 5, 2023

shayorshay Jul 5, 2023

PhilippeMoussalli Jul 5, 2023

shayorshay Jul 5, 2023

RobbeSneyders left a comment

RobbeSneyders Jul 5, 2023

shayorshay Jul 5, 2023


		logger = logging.getLogger(__name__)

		S3_BASE_URL = "s3://commoncrawl/crawl-data"

		S3_COMMONCRAWL_BUCKET = "commoncrawl"


		def fetch_warc_file_from_s3(s3_bucket: str, s3_key) -> dd.DataFrame:

[Commoncrawl pipeline] Add load from commoncrawl component #269

[Commoncrawl pipeline] Add load from commoncrawl component #269

Conversation

shayorshay commented Jul 5, 2023

PhilippeMoussalli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RobbeSneyders left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment