-
-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scraper #316
Scraper #316
Conversation
@@ -25,7 +25,7 @@ services: | |||
api: | |||
build: | |||
context: . | |||
dockerfile: ./backend/Dockerfile | |||
dockerfile: ./backend/Dockerfile.cloud |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change should be reverted, breaks docker-compose build
on ARM Macs.
Something to consider... we'll need to add some logic to handle 429 responses. Right now, it just continues making requests, but ideally, it should suspend the process and pick it up after a time. Even better it could track the occurrences of these requests and collect data on them for review. |
1736fc6
to
90f9bb6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi folks, I am a new contributor to this repo. This work looks nice. Here is my five cents:
- We may want to have scraping-related work outside this repo. We already have quite a bit going on here. Having a separate repo would be easier in the long for example for deployments and requirements management.
- I think we can eliminate the redis portion. Once scrapers generate data files, we can load the new data using upserts.
- There are Python packages dedicated to scraping. One of the great ones is Scrapy. It is a pretty mature package with all kinds of conveniences such as handling retries, user agents etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is an example scraper with Scrapy I built before https://github.com/aliavni/drugbank-scraper
Scraper Update 🕷️
Sorry for the big PR; got carried away having to convert the code from one repo to another. For simplicity, this scraper code runs asynchronously with a cronjob that calls the flask command
sracpe-v2
. This iteratively crawls through the 50-a website, gathers as much information as it can (still working on agencies), and downloads the CSV's from NYPD. I then check if the information is in our scraping cache; if not, we add it to the database. The incident schemas are a bit whack and may create duplicated information based on whether the officer is a perpetrator, participant, etc., but they should get the job done for now.