GitHub - bradyemerson/Open-Textbooks: A library for scraping nearly every college bookstore in the US -- from TextYard

Installation/Configuration

Installing is as simple as downloading the code to your webserver, and importing the structure+data .sql file. This file has over 1,000 working schools for you to begin with. Note that the Follett bookstore schools will require some extra work to start getting complete data from (see Documentation). You can add any of the missing 2-3k schools with online bookstores by following the steps in the documentation Wiki.

There are two necessary config files for the code to run, db_config.php and proxy_config.php. Both are derived from their respective *_template.php files. db_config just requires the credentials for logging into your MySQL dataqbase. proxy_config requires proxies to use for scraping the Follett and Barnes and Nobles, respectively.

You'll see in the content/ folder that I included api.php. That's a REST API that you can use to verify that the scraping is working correctly. See the comments at the top of that file for the GET parameters it takes. I also included runthrough.php, which runs through the api and outputs any schools where the scrapers are non-functional.

Documentation

See the Wiki which explains clearly how all the scrapers work, and how they avoid the anti-scraping techniques employed by the 6 major bookstore website technologies. Additionally, it explains how to add new schools for every bookstore type. I've also added a section on the legality of the scrapers.

Open-Source Volunteering Opportunities

There's a lot of ways this project could be improved. Some of these tasks don't even require coding skill.

Add the schools that are missing from the initial database
Translate this library into another language, esp. Python or Ruby
Help keep the scrapers up to date, as the bookstore software changes every couple of years.
Find any bugs in the code. I'm sure there are some, and there are definitely lots of places where the code can be simplified/improved.

Posting bugs/issues

Please try and provide this info:

Verification that your proxy config is working.
A complete description of when the bug happens. Does it happen with all schools, just schools of this type, or does it only happen for a particular set of schools? Does the issue depend on whether you access the data with the browser vs. my code? Does the issue depend on whether your proxy is enabled?
Complete cURL HTTP request/response logs and side-by-side complete browser HTTP request/response logs (from Chrome Dev Tools or Firebug).

Additional Resources

Getchabooks has open-sourced their code base, which includes a complete textbook comparison site with scraping of Barnes and Nobles and Follett stores. https://github.com/getchabooks/getchabooks

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
includes		includes
tmp		tmp
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
api.php		api.php
book.php		book.php
buyback.php		buyback.php
checkActive.php		checkActive.php
cron.php		cron.php
fullett.php		fullett.php
runthrough.php		runthrough.php
schema_only.sql		schema_only.sql
schema_with_data.sql		schema_with_data.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation/Configuration

Documentation

Open-Source Volunteering Opportunities

Posting bugs/issues

Additional Resources

About

Releases

Packages

Languages

License

bradyemerson/Open-Textbooks

Folders and files

Latest commit

History

Repository files navigation

Installation/Configuration

Documentation

Open-Source Volunteering Opportunities

Posting bugs/issues

Additional Resources

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages