Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Github scraper initial #38

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
Open

Github scraper initial #38

wants to merge 16 commits into from

Conversation

karamba228
Copy link
Contributor

Changes:

  • Added the GitrHub Scraper
  • Clean up utils.common
  • Update json scraper to normalize dictionaries before storing them in parquet

@karamba228 karamba228 requested a review from srossross August 5, 2024 12:42
@karamba228 karamba228 self-assigned this Aug 5, 2024
Comment on lines 75 to 81
for letter in letters_to_scrape:
directory = f"output/json/first_letter={letter}"
if os.path.exists(directory):
for file_name in os.listdir(directory):
if file_name.endswith(".parquet"):
file_path = os.path.join(directory, file_name)
df = pq.read_table(file_path).to_pandas()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this from pypi? we probably should update the output name to output/pypi/json to be more clear

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also you should use hive partiionin here

pd.read_parquet(
    directory,
    filters=[("first_letter", "==",  letters_to_scrape)],
)

You will have to update the filters work the "==" will not work with the list of letters

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few more thoughts.

The source list of github urls will not be exclusive to pypi, what if there is a conda package that is not in pypi?
we may want to to a separate step to create an ouptput like github_urls.parquet

as a side note this can also be done in duckdb like so:

-- Main query to process the data and return GitHub URLs with first_letter filter in package_data CTE
WITH pypi_package_data AS (
    SELECT 
        first_letter,
        project_urls,
        home_page
    FROM read_parquet('output/json/first_letter=*/**.parquet')
    WHERE first_letter IN ('a', 'b', 'c')  -- Replace with your desired letters
),
pypi_github_urls AS (
    SELECT 
        COALESCE(
            json_extract(project_urls, '$.Source'),
            json_extract(project_urls, '$.Homepage'),
            home_page
        ) AS source_url
    FROM pypi_package_data
)
SELECT DISTINCT source_url
FROM pypi_github_urls
WHERE source_url LIKE '%github.com%'
ORDER BY source_url;

score/data_retrieval/github_scraper.py Outdated Show resolved Hide resolved
@karamba228 karamba228 requested a review from srossross August 7, 2024 13:27
@@ -87,6 +87,7 @@ def process_packages_by_letter(letter, package_names, output_dir):
all_package_data = []
for package_name in tqdm(letter_package_names, desc=f"Processing letter {letter}"):
package_data = get_package_data(package_name)
df = pd.json_normalize(package_data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this doing?

score/cli.py Outdated
callback=validate_input,
help="Enter the ending letter or number to scrape (e.g., 'c' or '9').",
)
def github_aggregate(start, end, input, output):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this one have a start,end? this will not affect performance right?

Comment on lines 12 to 14
for letter in letters_to_scrape:
dir_path = os.path.join(input_dir, f"first_letter={letter}")
if os.path.exists(dir_path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be done nativity with pandas or duckdb. I would use duckdb here because it is simple and I would expect it to perform much better than pandas

dir_path = os.path.join(input_dir, f"first_letter={letter}")
if os.path.exists(dir_path):
df = pd.read_parquet(dir_path)
df["first_letter"] = letter # Add the first_letter column manually
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should not need this if you read as a hive partition

score/cli.py Outdated
def scrape_github(partition, output):
click.echo(f"Scraping GitHub data for partition {partition}.")

input_dir = OUTPUT_ROOT / "output" / "github-urls"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets make this the same as github_aggregate and pass input_dir as an argument

Comment on lines 25 to 26
"collaborators_url": "collaborators_url",
"contributors_url": "contributors_url",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why store this url? can we fetch a list of collaborators instead?

Comment on lines 43 to 45
if response.status_code == 404:
log.debug(f"Skipping repository not found for URL {repo_url}")
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to return none? We probably want to include this in the score like

  • health: unknown
  • notes: the included source URL does not exist

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you comment on how we will record this data?

Comment on lines 12 to 13
dir_path = os.path.join(input_dir, f"partition={partition}")
if os.path.exists(dir_path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to use duckdb or native pandas hive partitioning

@srossross
Copy link
Contributor

@karamba228 can you use the output from the agg 'gs://openteams-score-data/2024-08-08.3/source-urls.parquet' as the input to the github scraper?

@karamba228 karamba228 requested a review from srossross August 9, 2024 14:28
Copy link
Contributor

@srossross srossross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @karamba228! I've left a few comments

score/cli.py Outdated Show resolved Hide resolved
score/cli.py Outdated
def scrape_github(input, output):
click.echo("Scraping GitHub data.")

df = scrape_github_data(input_file=input)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a partition on the source_url and do the file read/writes in the cli.py

score/data_retrieval/github_scraper.py Outdated Show resolved Hide resolved
Comment on lines 43 to 45
if response.status_code == 404:
log.debug(f"Skipping repository not found for URL {repo_url}")
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you comment on how we will record this data?

score/data_retrieval/github_scraper.py Outdated Show resolved Hide resolved
score/data_retrieval/github_scraper.py Outdated Show resolved Hide resolved
score/data_retrieval/github_scraper.py Outdated Show resolved Hide resolved
score/data_retrieval/github_scraper.py Outdated Show resolved Hide resolved
@karamba228 karamba228 requested a review from srossross August 9, 2024 17:10
Copy link
Contributor

@srossross srossross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a plan to get around the request quota?

@@ -0,0 +1,24 @@
def extract_and_map_fields(data: dict, map: dict) -> dict:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename this file to extract_and_map_fields.py or collections.py having a /util/common.py usually results in this file becoming very large over time

score/cli.py Outdated
click.echo("Scraping GitHub data.")

# Read the input Parquet file using pandas
df = pd.read_parquet(input)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please partition the data? this would have 30K urls in it. does this work locally?

Comment on lines +106 to +120
if df.empty:
click.echo("No valid GitHub URLs found in the input file.")
return

total_rows = len(df)
partition_size = total_rows // num_partitions
start_index = partition * partition_size
end_index = (
total_rows if partition == num_partitions - 1 else start_index + partition_size
)

df_partition = df.iloc[start_index:end_index]
click.echo(
f"Processing {len(df_partition)} URLs in partition {partition + 1} of {num_partitions}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karamba228 can you please partition this in a consistent manner to pypi? also move that logic into its own file to be consistent with other cli functions see https://github.com/openteamsinc/Score/blob/main/score/conda/get_conda_package_names.py

Comment on lines +84 to +86
if df.empty:
log.debug("No valid GitHub URLs found in the input file")
return pd.DataFrame()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be an error

dict: A dictionary containing the extracted data fields or an indication that the URL is broken.
"""
repo_name = "/".join(repo_url.split("/")[-2:])
response = requests.get(GITHUB_API_URL + repo_name, headers=AUTH_HEADER)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the get_session that will add a retry to the request for 500 errors

contributors_response = requests.get(contributors_url, headers=AUTH_HEADER)
if contributors_response.status_code == 200:
contributors = contributors_response.json()
extracted_data["contributors"] = contributors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the contributors data structure? do we need to store this information to create the score?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants