Scraper #316

RyEggGit · 2024-01-11T01:14:19Z

Scraper Update 🕷️

Sorry for the big PR; got carried away having to convert the code from one repo to another. For simplicity, this scraper code runs asynchronously with a cronjob that calls the flask command sracpe-v2. This iteratively crawls through the 50-a website, gathers as much information as it can (still working on agencies), and downloads the CSV's from NYPD. I then check if the information is in our scraping cache; if not, we add it to the database. The incident schemas are a bit whack and may create duplicated information based on whether the officer is a perpetrator, participant, etc., but they should get the job done for now.

mikeyavorsky · 2024-02-21T01:04:35Z

docker-compose.yml

@@ -25,7 +25,7 @@ services:
  api:
    build:
      context: .
-      dockerfile: ./backend/Dockerfile
+      dockerfile: ./backend/Dockerfile.cloud


This change should be reverted, breaks docker-compose build on ARM Macs.

DMalone87 · 2024-02-22T21:26:43Z

Something to consider... we'll need to add some logic to handle 429 responses. Right now, it just continues making requests, but ideally, it should suspend the process and pick it up after a time. Even better it could track the occurrences of these requests and collect data on them for review.

aliavni · 2024-04-20T15:06:52Z

backend/scraper/__init__.py

Hi folks, I am a new contributor to this repo. This work looks nice. Here is my five cents:

We may want to have scraping-related work outside this repo. We already have quite a bit going on here. Having a separate repo would be easier in the long for example for deployments and requirements management.

I think we can eliminate the redis portion. Once scrapers generate data files, we can load the new data using upserts.

There are Python packages dedicated to scraping. One of the great ones is Scrapy. It is a pretty mature package with all kinds of conveniences such as handling retries, user agents etc.

Here is an example scraper with Scrapy I built before https://github.com/aliavni/drugbank-scraper

RyEggGit requested review from DMalone87, angeldzzz23 and hrauniya January 11, 2024 01:14

RyEggGit added enhancement New feature or request backend labels Jan 11, 2024

RyEggGit marked this pull request as draft January 17, 2024 02:36

mikeyavorsky requested changes Feb 21, 2024

View reviewed changes

RyEggGit and others added 15 commits March 19, 2024 18:59

Update to scraper

5e86108

Linting Fixes and add ruff config to pyproject

92c5071

add tests and queries

1bc616b

Add model to database if it doesn't exist

5083d7d

Import database core module and update seed imports based on environment

f089162

Add mroe information to nypd scrape

10251e5

Update queries and hard coded delete

b6129af

Remove dev_only decorator from scrape-v2 command

8cf61f2

lint

2456589

Fix link formatting in CrudMixin class documentation

de69caa

set state was fialing tests

5096d7c

Update Python version and fix test assertions

ae0e9fa

House Keeping and bump version

747d9a3

Add readme

448319a

Added additional Debug messages

90f9bb6

DMalone87 force-pushed the ryan/scraper-update branch from 1736fc6 to 90f9bb6 Compare April 14, 2024 20:29

aliavni reviewed Apr 20, 2024

View reviewed changes

DMalone87 closed this Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scraper #316

Scraper #316

RyEggGit commented Jan 11, 2024

mikeyavorsky Feb 21, 2024

DMalone87 commented Feb 22, 2024

aliavni Apr 20, 2024

aliavni Apr 20, 2024

Scraper #316

Scraper #316

Conversation

RyEggGit commented Jan 11, 2024

Scraper Update 🕷️

mikeyavorsky Feb 21, 2024

Choose a reason for hiding this comment

DMalone87 commented Feb 22, 2024

aliavni Apr 20, 2024

Choose a reason for hiding this comment

aliavni Apr 20, 2024

Choose a reason for hiding this comment