Github scraper initial #38

karamba228 · 2024-08-05T12:42:03Z

Changes:

Added the GitrHub Scraper
Clean up utils.common
Update json scraper to normalize dictionaries before storing them in parquet

srossross · 2024-08-05T14:38:57Z

score/data_retrieval/github_scraper.py

+    for letter in letters_to_scrape:
+        directory = f"output/json/first_letter={letter}"
+        if os.path.exists(directory):
+            for file_name in os.listdir(directory):
+                if file_name.endswith(".parquet"):
+                    file_path = os.path.join(directory, file_name)
+                    df = pq.read_table(file_path).to_pandas()


is this from pypi? we probably should update the output name to output/pypi/json to be more clear

also you should use hive partiionin here

pd.read_parquet( directory, filters=[("first_letter", "==", letters_to_scrape)], )

You will have to update the filters work the "==" will not work with the list of letters

a few more thoughts.

The source list of github urls will not be exclusive to pypi, what if there is a conda package that is not in pypi?
we may want to to a separate step to create an ouptput like github_urls.parquet

as a side note this can also be done in duckdb like so:

-- Main query to process the data and return GitHub URLs with first_letter filter in package_data CTE WITH pypi_package_data AS ( SELECT first_letter, project_urls, home_page FROM read_parquet('output/json/first_letter=*/**.parquet') WHERE first_letter IN ('a', 'b', 'c') -- Replace with your desired letters ), pypi_github_urls AS ( SELECT COALESCE( json_extract(project_urls, '$.Source'), json_extract(project_urls, '$.Homepage'), home_page ) AS source_url FROM pypi_package_data ) SELECT DISTINCT source_url FROM pypi_github_urls WHERE source_url LIKE '%github.com%' ORDER BY source_url;

score/data_retrieval/github_scraper.py

srossross · 2024-08-06T14:51:12Z

score/data_retrieval/json_scraper.py

@@ -87,6 +87,7 @@ def process_packages_by_letter(letter, package_names, output_dir):
    all_package_data = []
    for package_name in tqdm(letter_package_names, desc=f"Processing letter {letter}"):
        package_data = get_package_data(package_name)
+        df = pd.json_normalize(package_data)


what is this doing?

srossross · 2024-08-07T13:20:42Z

score/cli.py

+    callback=validate_input,
+    help="Enter the ending letter or number to scrape (e.g., 'c' or '9').",
+)
+def github_aggregate(start, end, input, output):


should this one have a start,end? this will not affect performance right?

srossross · 2024-08-07T13:21:50Z

score/utils/github_aggregator.py

+    for letter in letters_to_scrape:
+        dir_path = os.path.join(input_dir, f"first_letter={letter}")
+        if os.path.exists(dir_path):


this can be done nativity with pandas or duckdb. I would use duckdb here because it is simple and I would expect it to perform much better than pandas

srossross · 2024-08-07T13:22:59Z

score/utils/github_aggregator.py

+        dir_path = os.path.join(input_dir, f"first_letter={letter}")
+        if os.path.exists(dir_path):
+            df = pd.read_parquet(dir_path)
+            df["first_letter"] = letter  # Add the first_letter column manually


you should not need this if you read as a hive partition

srossross · 2024-08-07T17:20:02Z

score/cli.py

+def scrape_github(partition, output):
+    click.echo(f"Scraping GitHub data for partition {partition}.")
+
+    input_dir = OUTPUT_ROOT / "output" / "github-urls"


lets make this the same as github_aggregate and pass input_dir as an argument

srossross · 2024-08-07T17:20:43Z

score/data_retrieval/github_scraper.py

+    "collaborators_url": "collaborators_url",
+    "contributors_url": "contributors_url",


why store this url? can we fetch a list of collaborators instead?

srossross · 2024-08-07T17:22:14Z

score/data_retrieval/github_scraper.py

+    if response.status_code == 404:
+        log.debug(f"Skipping repository not found for URL {repo_url}")
+        return None


Do we want to return none? We probably want to include this in the score like

health: unknown

notes: the included source URL does not exist

can you comment on how we will record this data?

srossross · 2024-08-07T17:23:42Z

score/utils/github_aggregator.py

+    dir_path = os.path.join(input_dir, f"partition={partition}")
+    if os.path.exists(dir_path):


You need to use duckdb or native pandas hive partitioning

srossross · 2024-08-08T20:05:10Z

@karamba228 can you use the output from the agg 'gs://openteams-score-data/2024-08-08.3/source-urls.parquet' as the input to the github scraper?

srossross

Thanks @karamba228! I've left a few comments

score/cli.py

srossross · 2024-08-09T14:31:25Z

score/cli.py

+def scrape_github(input, output):
+    click.echo("Scraping GitHub data.")
+
+    df = scrape_github_data(input_file=input)


please add a partition on the source_url and do the file read/writes in the cli.py

score/data_retrieval/github_scraper.py

srossross · 2024-08-09T14:32:22Z

score/data_retrieval/github_scraper.py

+    if response.status_code == 404:
+        log.debug(f"Skipping repository not found for URL {repo_url}")
+        return None


can you comment on how we will record this data?

score/data_retrieval/github_scraper.py

srossross

Do you have a plan to get around the request quota?

srossross · 2024-08-09T17:37:47Z

score/utils/common.py

@@ -0,0 +1,24 @@
+def extract_and_map_fields(data: dict, map: dict) -> dict:


Please rename this file to extract_and_map_fields.py or collections.py having a /util/common.py usually results in this file becoming very large over time

srossross · 2024-08-09T17:39:09Z

score/cli.py

+    click.echo("Scraping GitHub data.")
+
+    # Read the input Parquet file using pandas
+    df = pd.read_parquet(input)


Can you please partition the data? this would have 30K urls in it. does this work locally?

srossross · 2024-08-12T14:23:35Z

score/cli.py

+    if df.empty:
+        click.echo("No valid GitHub URLs found in the input file.")
+        return
+
+    total_rows = len(df)
+    partition_size = total_rows // num_partitions
+    start_index = partition * partition_size
+    end_index = (
+        total_rows if partition == num_partitions - 1 else start_index + partition_size
+    )
+
+    df_partition = df.iloc[start_index:end_index]
+    click.echo(
+        f"Processing {len(df_partition)} URLs in partition {partition + 1} of {num_partitions}"
+    )


@karamba228 can you please partition this in a consistent manner to pypi? also move that logic into its own file to be consistent with other cli functions see https://github.com/openteamsinc/Score/blob/main/score/conda/get_conda_package_names.py

srossross · 2024-08-12T14:35:54Z

score/github/github_scraper.py

+    if df.empty:
+        log.debug("No valid GitHub URLs found in the input file")
+        return pd.DataFrame()


this should be an error

srossross · 2024-08-12T14:36:34Z

score/github/github_scraper.py

+        dict: A dictionary containing the extracted data fields or an indication that the URL is broken.
+    """
+    repo_name = "/".join(repo_url.split("/")[-2:])
+    response = requests.get(GITHUB_API_URL + repo_name, headers=AUTH_HEADER)


use the get_session that will add a retry to the request for 500 errors

srossross · 2024-08-12T14:37:04Z

score/github/github_scraper.py

+        contributors_response = requests.get(contributors_url, headers=AUTH_HEADER)
+        if contributors_response.status_code == 200:
+            contributors = contributors_response.json()
+            extracted_data["contributors"] = contributors


what is the contributors data structure? do we need to store this information to create the score?

github scraper + code clean up

5ae340d

karamba228 requested a review from srossross August 5, 2024 12:42

karamba228 self-assigned this Aug 5, 2024

srossross reviewed Aug 5, 2024

View reviewed changes

srossross mentioned this pull request Aug 5, 2024

use authenticated github api requests #40

Open

karamba228 added 3 commits August 5, 2024 16:29

Merge branch 'main' into github_scraper

9ac6e9a

Github scraper clean up, bug fixes

6f193a7

added github url aggregator, updated github scraper to work with it

783dc91

karamba228 requested a review from srossross August 7, 2024 13:27

karamba228 added 2 commits August 7, 2024 10:37

Merge branch 'main' into github_scraper

8c5e454

updates to scraper and aggregator code to match new partitioning style

bfc5987

srossross requested changes Aug 7, 2024

View reviewed changes

karamba228 added 2 commits August 8, 2024 08:18

Merge branch 'main' into github_scraper

ef602cd

Merge branch 'main' into github_scraper

16e0892

karamba228 added 5 commits August 8, 2024 15:39

Merge branch 'main' into github_scraper

5706160

remmoved aggregator code/Updated github scraper

3abe848

Merge branch 'main' into github_scraper

1b73a82

removedf collaborators, addewd contributor count

fc656de

linting fix

761e5ea

karamba228 requested a review from srossross August 9, 2024 14:28

srossross requested changes Aug 9, 2024

View reviewed changes

requested PR updates

79f5ac5

karamba228 requested a review from srossross August 9, 2024 17:10

srossross requested changes Aug 9, 2024

View reviewed changes

added partitioning to the github_url

b1ecc6a

srossross requested changes Aug 12, 2024

View reviewed changes

srossross reviewed Aug 12, 2024

View reviewed changes

Added ability to read input straight from the google cloud bucket

e5aa95d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Github scraper initial #38

Github scraper initial #38

karamba228 commented Aug 5, 2024

srossross Aug 5, 2024

srossross Aug 5, 2024

srossross Aug 5, 2024

srossross Aug 6, 2024

srossross Aug 7, 2024

srossross Aug 7, 2024

srossross Aug 7, 2024

srossross Aug 7, 2024

srossross Aug 7, 2024

srossross Aug 7, 2024

srossross Aug 9, 2024

srossross Aug 7, 2024

srossross commented Aug 8, 2024

srossross left a comment

srossross Aug 9, 2024

srossross Aug 9, 2024

srossross left a comment

srossross Aug 9, 2024

srossross Aug 9, 2024

srossross Aug 12, 2024

srossross Aug 12, 2024

srossross Aug 12, 2024

srossross Aug 12, 2024

		"collaborators_url": "collaborators_url",
		"contributors_url": "contributors_url",

		dir_path = os.path.join(input_dir, f"partition={partition}")
		if os.path.exists(dir_path):

		@@ -0,0 +1,24 @@
		def extract_and_map_fields(data: dict, map: dict) -> dict:

Github scraper initial #38

Are you sure you want to change the base?

Github scraper initial #38

Conversation

karamba228 commented Aug 5, 2024

Changes:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srossross commented Aug 8, 2024

srossross left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srossross left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment