Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrape "FEC Record" pages #457

Closed
Tracked by #60
noahmanger opened this issue Aug 27, 2016 · 7 comments
Closed
Tracked by #60

Scrape "FEC Record" pages #457

noahmanger opened this issue Aug 27, 2016 · 7 comments
Assignees
Milestone

Comments

@noahmanger
Copy link

So that we can import the content from the FEC Record quickly, scrape all of the pages into a database.

Hopefully this work can build on the scraper built for the press releases in https://github.com/18F/openFEC/issues/1895

@LindsayYoung
Copy link
Contributor

@noahmanger
Copy link
Author

There's an improved, work-in-progress display that they've been putting together here which will be way easier to scrape.

One thing that I'm seeing, though, is that the taxonomy is different than we had the models originally designed for. It looks like in addition to the category we'll also need a keywords field that can have multiple values.

@tadhg-ohiggins
Copy link
Contributor

Not quite sure where to check this in, so I've put it in a gist for the moment. This Python3 script https://gist.github.com/tadhg-ohiggins/6f62923b0674a5eb915e405e0c252bc8 appears to successfully scrape all of the 1638 articles referred to at http://www.fec.gov/pages/fecrecord_redesign/fecrecord.shtml

@noahmanger
Copy link
Author

Nice! Yeah @LindsayYoung where should these live? Should we just set up a separate repo that contains all the scraper scripts?

@ccostino ccostino self-assigned this Dec 14, 2016
@ccostino
Copy link
Contributor

Recapping some conversations that took place in Slack, we came to the conclusion that living within the CMS makes the most sense since these are scrapers intended to scrape content to go into the CMS.

@noahmanger
Copy link
Author

Just made a PR to add it. I ran it once and everything seemed to work as expected. Can we close this issue and move to the next step of importing? @ccostino are you still good to tackle that?

@ccostino
Copy link
Contributor

Yes, I should be able to spend some time on that tomorrow (today?) to keep the ball moving forward! I merged the PR and continue on elsewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants