Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First pass at Indiana state scraping #20

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

marks
Copy link

@marks marks commented Mar 4, 2014

This is my first pass and is loosely based on the IL scraper Javascript. There is more work to do but this is a start.

Scraper output:

mba62:openrfps-scrapers mark$ bin/openrfps run scrapers/in/rfps.js 
[ 'http://www.in.gov/cgi-bin/idoa/cgi-bin/bidad.pl?spec=300-14-63265&desc=Contract+for+Services+for+Invasive+Plant+Control&method=NEGOTIATED BID&code=T',
  'http://www.in.gov/cgi-bin/idoa/cgi-bin/bidad.pl?spec=RFP-14-058&desc=Centralized+Production+and+Direct+Distribution+of+License+Plates+and+Registration+Documents&method=RFP&code=7',
  'http://www.in.gov/cgi-bin/idoa/cgi-bin/bidad.pl?spec=RFQ_ISF21&desc=Misc.+Furniture&method=NOTICE TO BIDDERS&code=P',
  'http://www.in.gov/cgi-bin/idoa/cgi-bin/bidad.pl?spec=RFQ_ISF20&desc=End+Tables&method=NOTICE TO BIDDERS&code=P',
  'http://www.in.gov/cgi-bin/idoa/cgi-bin/bidad.pl?spec=RFQ_ISF19&desc=Sofas&method=NOTICE TO BIDDERS&code=P',
  'http://www.in.gov/cgi-bin/idoa/cgi-bin/bidad.pl?spec=300-14-63398&desc=Outright+Purchase+of+Airboat+and+Trailer+for+IDNR&method=NEGOTIATED BID&code=T',
  'http://www.in.gov/cgi-bin/idoa/cgi-bin/bidad.pl?spec=RFI-14-73&desc=Electronic+Media+Destruction+and+Shredding+Services&method=RFI&code=K',
  'http://www.in.gov/cgi-bin/idoa/cgi-bin/bidad.pl?spec=ASA-14-079&desc=QPA+for+Snack+Products+for+Pen+Products&method=NEGOTIATED BID&code=T',
  'http://www.in.gov/cgi-bin/idoa/cgi-bin/bidad.pl?spec=RFP-14-75&desc=Hard+Copy+Book+Collections&method=RFP&code=B',
  'http://www.in.gov/cgi-bin/idoa/cgi-bin/bidad.pl?spec=RFI-14-081&desc=Development,+Operation+and+Maintenance+of+an+Inn+and+Related+Facilities+at+Potato+Creek+State+Park&method=RFI&code=T',
  'http://www.in.gov/cgi-bin/idoa/cgi-bin/bidad.pl?spec=RFP-14-074&desc=Paint+and+Paint+Supplies&method=RFP&code=W',
  'http://www.in.gov/cgi-bin/idoa/cgi-bin/bidad.pl?spec=RFP-14-049&desc=Sustained+Statewide+Public+Relations+and+Marketing+Campaign&method=RFP&code=7',
  'http://www.in.gov/cgi-bin/idoa/cgi-bin/bidad.pl?spec=ASA-14-076&desc=Lab+Supplies&method=NEGOTIATED BID&code=V',
  'http://www.in.gov/cgi-bin/idoa/cgi-bin/bidad.pl?spec=ASA-14-089&desc=QPA+for+Cosmetic+Grade+Soap+Products+for+Pen+Products&method=NEGOTIATED BID&code=T',
  'http://www.in.gov/cgi-bin/idoa/cgi-bin/bidad.pl?spec=570-14-24746&desc=Wireless+Paging+System&method=NEGOTIATED BID&code=8',
  'http://www.in.gov/cgi-bin/idoa/cgi-bin/bidad.pl?spec=RFP-14-83&desc=DNA+Sample+Collection+Services&method=RFP&code=K' ]
Done scraping!
Cached results to scrapers/in/rfps.json
[
  {
    "html_url": "http://www.in.gov/cgi-bin/idoa/cgi-bin/bidad.pl?spec=300-14-63265&desc=Contract+for+Services+for+Invasive+Plant+Control&method=NEGOTIATED BID&code=T",
    "id": "300-14-63265",
    "type": "NEGOTIATED BID",
    "title": "Contract for Services for Invasive Plant Control",
    "responses_open_at": "3/4/2014",
    "contact_name": "Deaton, Teresa"
  },
  {
    "html_url": "http://www.in.gov/cgi-bin/idoa/cgi-bin/bidad.pl?spec=RFP-14-058&desc=Centralized+Production+and+Direct+Distribution+of+License+Plates+and+Registration+Documents&method=RFP&code=7",
    "id": "RFP-14-058",
    "type": "RFP",
    "title": "Centralized Production and Direct Distribution of License Plates and Registration Documents",
    "responses_open_at": "3/4/2014",
    "contact_name": "Thiemann, Adam"
  },
  {
    "html_url": "http://www.in.gov/cgi-bin/idoa/cgi-bin/bidad.pl?spec=RFQ_ISF21&desc=Misc.+Furniture&method=NOTICE TO BIDDERS&code=P",
    "id": "RFQ_ISF21",
    "type": "NOTICE TO BIDDERS",
    "title": "Misc. Furniture",
    "responses_open_at": "3/5/2014",
    "contact_name": "Archer, Mary Beth"
  }
]
13 RFPs not printed for length considerations

Test output:

mba62:openrfps-scrapers mark$ bin/openrfps test scrapers/in/rfps.js 
The scraper returns at least one result: OK
item.id is returned for all items: OK
item.type is valid for all items: Not OK
item.contact_email is a proper address (or blank): OK
download URLs are valid (or blank): OK
item.id is unique for each item: OK
item.title is returned for all items: OK
NIGP codes are digits: OK

@ajb
Copy link
Contributor

ajb commented Mar 4, 2014

Hey Mark, looks awesome. I'll just leave this open for you to add to?

@ajb ajb added the wip label Mar 4, 2014
@marks
Copy link
Author

marks commented Mar 4, 2014

@adamjacobbecker - thanks. In the spirit of having others add to it (it's at a place where it is useful but could be more useful), I'd prefer you merge it so others know it's there and they can add to it before I get to it. Thoughts on that approach?

@ajb
Copy link
Contributor

ajb commented Mar 4, 2014

Not sure if we have a good workflow for that right now.

I might suggest updated the wiki page to remove the link to Indiana from the first list, and add a link to this PR in the "In Progress" section. That make sense?

@marks
Copy link
Author

marks commented Mar 4, 2014

OK - I removed it from the first list (I read too fast and thought that was a complete list) and it is in the in-progress list. I see IL is in the in progress list but in the master branch which makes it a little confusing to a newcomer.

If this were my project, I'd put anything in progress (and working, of course) in the master branch. NBD either way though. Hope to have time to add additional data but working with their HTML gave me enough fun for one night ;)

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants