Scrape "FEC Record" pages #457

noahmanger · 2016-08-27T16:22:50Z

So that we can import the content from the FEC Record quickly, scrape all of the pages into a database.

Hopefully this work can build on the scraper built for the press releases in https://github.com/18F/openFEC/issues/1895

LindsayYoung · 2016-08-29T21:37:33Z

http://www.fec.gov/pages/fecrecord/fecrecord.shtml

noahmanger · 2016-09-13T16:05:17Z

There's an improved, work-in-progress display that they've been putting together here which will be way easier to scrape.

One thing that I'm seeing, though, is that the taxonomy is different than we had the models originally designed for. It looks like in addition to the category we'll also need a keywords field that can have multiple values.

tadhg-ohiggins · 2016-12-14T02:08:37Z

Not quite sure where to check this in, so I've put it in a gist for the moment. This Python3 script https://gist.github.com/tadhg-ohiggins/6f62923b0674a5eb915e405e0c252bc8 appears to successfully scrape all of the 1638 articles referred to at http://www.fec.gov/pages/fecrecord_redesign/fecrecord.shtml

noahmanger · 2016-12-14T02:30:35Z

Nice! Yeah @LindsayYoung where should these live? Should we just set up a separate repo that contains all the scraper scripts?

ccostino · 2016-12-15T21:32:57Z

Recapping some conversations that took place in Slack, we came to the conclusion that living within the CMS makes the most sense since these are scrapers intended to scrape content to go into the CMS.

noahmanger · 2016-12-16T01:58:33Z

Just made a PR to add it. I ran it once and everything seemed to work as expected. Can we close this issue and move to the next step of importing? @ccostino are you still good to tackle that?

ccostino · 2016-12-16T05:39:29Z

Yes, I should be able to spend some time on that tomorrow (today?) to keep the ball moving forward! I merged the PR and continue on elsewhere.

noahmanger added this to the Sprint 21 milestone Aug 27, 2016

noahmanger added Dashboard: Content migration labels Aug 27, 2016

noahmanger assigned LindsayYoung Aug 29, 2016

noahmanger removed the Dashboard: Data label Sep 13, 2016

noahmanger removed this from the Sprint 21 - 20160912 milestone Sep 13, 2016

noahmanger added this to the Sprint 23 - 20161010 milestone Sep 27, 2016

noahmanger assigned xtine and unassigned LindsayYoung Sep 30, 2016

noahmanger removed this from the Sprint 23 - 20161010 milestone Oct 12, 2016

noahmanger removed the Dashboard: Content migration label Dec 3, 2016

noahmanger added this to the Sprint 2 milestone Dec 3, 2016

noahmanger unassigned xtine Dec 3, 2016

noahmanger added Mothership Work: Back-end labels Dec 3, 2016

noahmanger assigned tadhg-ohiggins Dec 6, 2016

ccostino self-assigned this Dec 14, 2016

noahmanger mentioned this issue Dec 16, 2016

Adding fec record scraper #663

Merged

noahmanger closed this as completed Dec 20, 2016

LindsayYoung unassigned tadhg-ohiggins Mar 9, 2018

johnnyporkchops mentioned this issue Feb 10, 2021

[Snyk] Security upgrade sanitize-html from 1.18.4 to 2.3.2 #4370

Closed

patphongs mentioned this issue Mar 4, 2024

Port content fecgov/fec-epics#60

Closed

88 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scrape "FEC Record" pages #457

Scrape "FEC Record" pages #457

noahmanger commented Aug 27, 2016

LindsayYoung commented Aug 29, 2016

noahmanger commented Sep 13, 2016

tadhg-ohiggins commented Dec 14, 2016

noahmanger commented Dec 14, 2016

ccostino commented Dec 15, 2016

noahmanger commented Dec 16, 2016

ccostino commented Dec 16, 2016

Scrape "FEC Record" pages #457

Scrape "FEC Record" pages #457

Comments

noahmanger commented Aug 27, 2016

LindsayYoung commented Aug 29, 2016

noahmanger commented Sep 13, 2016

tadhg-ohiggins commented Dec 14, 2016

noahmanger commented Dec 14, 2016

ccostino commented Dec 15, 2016

noahmanger commented Dec 16, 2016

ccostino commented Dec 16, 2016