-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Github scraper initial #38
base: main
Are you sure you want to change the base?
Conversation
for letter in letters_to_scrape: | ||
directory = f"output/json/first_letter={letter}" | ||
if os.path.exists(directory): | ||
for file_name in os.listdir(directory): | ||
if file_name.endswith(".parquet"): | ||
file_path = os.path.join(directory, file_name) | ||
df = pq.read_table(file_path).to_pandas() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this from pypi? we probably should update the output name to output/pypi/json
to be more clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also you should use hive partiionin here
pd.read_parquet(
directory,
filters=[("first_letter", "==", letters_to_scrape)],
)
You will have to update the filters work the "==" will not work with the list of letters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few more thoughts.
The source list of github urls will not be exclusive to pypi, what if there is a conda package that is not in pypi?
we may want to to a separate step to create an ouptput like github_urls.parquet
as a side note this can also be done in duckdb like so:
-- Main query to process the data and return GitHub URLs with first_letter filter in package_data CTE
WITH pypi_package_data AS (
SELECT
first_letter,
project_urls,
home_page
FROM read_parquet('output/json/first_letter=*/**.parquet')
WHERE first_letter IN ('a', 'b', 'c') -- Replace with your desired letters
),
pypi_github_urls AS (
SELECT
COALESCE(
json_extract(project_urls, '$.Source'),
json_extract(project_urls, '$.Homepage'),
home_page
) AS source_url
FROM pypi_package_data
)
SELECT DISTINCT source_url
FROM pypi_github_urls
WHERE source_url LIKE '%github.com%'
ORDER BY source_url;
score/data_retrieval/json_scraper.py
Outdated
@@ -87,6 +87,7 @@ def process_packages_by_letter(letter, package_names, output_dir): | |||
all_package_data = [] | |||
for package_name in tqdm(letter_package_names, desc=f"Processing letter {letter}"): | |||
package_data = get_package_data(package_name) | |||
df = pd.json_normalize(package_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this doing?
score/cli.py
Outdated
callback=validate_input, | ||
help="Enter the ending letter or number to scrape (e.g., 'c' or '9').", | ||
) | ||
def github_aggregate(start, end, input, output): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this one have a start,end? this will not affect performance right?
score/utils/github_aggregator.py
Outdated
for letter in letters_to_scrape: | ||
dir_path = os.path.join(input_dir, f"first_letter={letter}") | ||
if os.path.exists(dir_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be done nativity with pandas or duckdb. I would use duckdb here because it is simple and I would expect it to perform much better than pandas
score/utils/github_aggregator.py
Outdated
dir_path = os.path.join(input_dir, f"first_letter={letter}") | ||
if os.path.exists(dir_path): | ||
df = pd.read_parquet(dir_path) | ||
df["first_letter"] = letter # Add the first_letter column manually |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should not need this if you read as a hive partition
score/cli.py
Outdated
def scrape_github(partition, output): | ||
click.echo(f"Scraping GitHub data for partition {partition}.") | ||
|
||
input_dir = OUTPUT_ROOT / "output" / "github-urls" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets make this the same as github_aggregate
and pass input_dir as an argument
"collaborators_url": "collaborators_url", | ||
"contributors_url": "contributors_url", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why store this url? can we fetch a list of collaborators instead?
if response.status_code == 404: | ||
log.debug(f"Skipping repository not found for URL {repo_url}") | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to return none? We probably want to include this in the score like
- health: unknown
- notes: the included source URL does not exist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you comment on how we will record this data?
score/utils/github_aggregator.py
Outdated
dir_path = os.path.join(input_dir, f"partition={partition}") | ||
if os.path.exists(dir_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to use duckdb or native pandas hive partitioning
@karamba228 can you use the output from the agg 'gs://openteams-score-data/2024-08-08.3/source-urls.parquet' as the input to the github scraper? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @karamba228! I've left a few comments
score/cli.py
Outdated
def scrape_github(input, output): | ||
click.echo("Scraping GitHub data.") | ||
|
||
df = scrape_github_data(input_file=input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a partition on the source_url and do the file read/writes in the cli.py
if response.status_code == 404: | ||
log.debug(f"Skipping repository not found for URL {repo_url}") | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you comment on how we will record this data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a plan to get around the request quota?
score/utils/common.py
Outdated
@@ -0,0 +1,24 @@ | |||
def extract_and_map_fields(data: dict, map: dict) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rename this file to extract_and_map_fields.py
or collections.py
having a /util/common.py
usually results in this file becoming very large over time
score/cli.py
Outdated
click.echo("Scraping GitHub data.") | ||
|
||
# Read the input Parquet file using pandas | ||
df = pd.read_parquet(input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please partition the data? this would have 30K urls in it. does this work locally?
if df.empty: | ||
click.echo("No valid GitHub URLs found in the input file.") | ||
return | ||
|
||
total_rows = len(df) | ||
partition_size = total_rows // num_partitions | ||
start_index = partition * partition_size | ||
end_index = ( | ||
total_rows if partition == num_partitions - 1 else start_index + partition_size | ||
) | ||
|
||
df_partition = df.iloc[start_index:end_index] | ||
click.echo( | ||
f"Processing {len(df_partition)} URLs in partition {partition + 1} of {num_partitions}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@karamba228 can you please partition this in a consistent manner to pypi? also move that logic into its own file to be consistent with other cli functions see https://github.com/openteamsinc/Score/blob/main/score/conda/get_conda_package_names.py
if df.empty: | ||
log.debug("No valid GitHub URLs found in the input file") | ||
return pd.DataFrame() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be an error
dict: A dictionary containing the extracted data fields or an indication that the URL is broken. | ||
""" | ||
repo_name = "/".join(repo_url.split("/")[-2:]) | ||
response = requests.get(GITHUB_API_URL + repo_name, headers=AUTH_HEADER) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use the get_session
that will add a retry to the request for 500 errors
contributors_response = requests.get(contributors_url, headers=AUTH_HEADER) | ||
if contributors_response.status_code == 200: | ||
contributors = contributors_response.json() | ||
extracted_data["contributors"] = contributors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the contributors data structure? do we need to store this information to create the score?
Changes: