Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraper #316

Closed
wants to merge 15 commits into from
Closed

Scraper #316

wants to merge 15 commits into from

Conversation

RyEggGit
Copy link
Collaborator

Scraper Update 🕷️

Sorry for the big PR; got carried away having to convert the code from one repo to another. For simplicity, this scraper code runs asynchronously with a cronjob that calls the flask command sracpe-v2. This iteratively crawls through the 50-a website, gathers as much information as it can (still working on agencies), and downloads the CSV's from NYPD. I then check if the information is in our scraping cache; if not, we add it to the database. The incident schemas are a bit whack and may create duplicated information based on whether the officer is a perpetrator, participant, etc., but they should get the job done for now.

@RyEggGit RyEggGit added enhancement New feature or request backend labels Jan 11, 2024
@RyEggGit RyEggGit marked this pull request as draft January 17, 2024 02:36
@@ -25,7 +25,7 @@ services:
api:
build:
context: .
dockerfile: ./backend/Dockerfile
dockerfile: ./backend/Dockerfile.cloud
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change should be reverted, breaks docker-compose build on ARM Macs.

@DMalone87
Copy link
Collaborator

Something to consider... we'll need to add some logic to handle 429 responses. Right now, it just continues making requests, but ideally, it should suspend the process and pick it up after a time. Even better it could track the occurrences of these requests and collect data on them for review.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi folks, I am a new contributor to this repo. This work looks nice. Here is my five cents:

  1. We may want to have scraping-related work outside this repo. We already have quite a bit going on here. Having a separate repo would be easier in the long for example for deployments and requirements management.
  2. I think we can eliminate the redis portion. Once scrapers generate data files, we can load the new data using upserts.
  3. There are Python packages dedicated to scraping. One of the great ones is Scrapy. It is a pretty mature package with all kinds of conveniences such as handling retries, user agents etc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is an example scraper with Scrapy I built before https://github.com/aliavni/drugbank-scraper

@DMalone87 DMalone87 closed this Jun 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants