-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scrape "FEC Record" pages #457
Comments
There's an improved, work-in-progress display that they've been putting together here which will be way easier to scrape. One thing that I'm seeing, though, is that the taxonomy is different than we had the models originally designed for. It looks like in addition to the category we'll also need a keywords field that can have multiple values. |
Not quite sure where to check this in, so I've put it in a gist for the moment. This Python3 script https://gist.github.com/tadhg-ohiggins/6f62923b0674a5eb915e405e0c252bc8 appears to successfully scrape all of the 1638 articles referred to at http://www.fec.gov/pages/fecrecord_redesign/fecrecord.shtml |
Nice! Yeah @LindsayYoung where should these live? Should we just set up a separate repo that contains all the scraper scripts? |
Recapping some conversations that took place in Slack, we came to the conclusion that living within the CMS makes the most sense since these are scrapers intended to scrape content to go into the CMS. |
Just made a PR to add it. I ran it once and everything seemed to work as expected. Can we close this issue and move to the next step of importing? @ccostino are you still good to tackle that? |
Yes, I should be able to spend some time on that tomorrow (today?) to keep the ball moving forward! I merged the PR and continue on elsewhere. |
So that we can import the content from the FEC Record quickly, scrape all of the pages into a database.
Hopefully this work can build on the scraper built for the press releases in https://github.com/18F/openFEC/issues/1895
The text was updated successfully, but these errors were encountered: