Skip to content

Scraper Documentation

bsgreenb edited this page Feb 13, 2012 · 10 revisions

We know which books are required for which classes based on scraping of the schools’ bookstore websites. These websites let students enter their course details and then provide them with a list of associated books. The book information provided is typically ISBN, Title, Author, Edition, Publisher, and Year. These sites also indicate the Necessity of each book, for example “Required”, or “Recommended.” We also scrape the Bookstore Prices (which can be specified as New, Used, or Rentals) from every bookstore system.

Course details most commonly follow the pattern: Campus -> Term -> Department -> Course -> Section, but there are frequently more complex cases. Follett bookstores have a hierarchy which can extend as far as: Campus -> Program -> Term -> Division -> Department -> Course -> Section. We have built the scraping system with consistency in mind; when a bookstore does not have Divisions (which is usually the case), we place a dummy Division in the database nonetheless to maintain simplicity.

We are able to scrape bookstore data from nearly every bookstore in the country (> 99%), because nearly every bookstore uses one of 6 software packages on their websites:

We have scrapers for each of these 6 major systems, which completely emulate user traffic. To the bookstore, our requests look just like a user traversing the Term/Department/Course/Section dropdowns, and eventually viewing their required books. All scraping is performed live via PHP’s libcurl functions. Our system caches bookstore responses for 1 week. You can modify the cache length in bookstore_functions.php.

IP blocks are avoided using proxies. for low-traffic uses, we recommend using Proxymesh, Amazon EC2, or Rackspace. Proxymesh works by giving you a rotating Rackspace IP address. For higher traffic uses, we recommend using either TrustedProxies or HideMyAss. Both give you thousands of IPs and in our experience you'll be able to scrape tons of data without the possibility of getting IP blocked. For extra measure, the scraping code also randomizes the user-agent (see url_functions.php).

In the following sections, we'll describe how we scrape each of the major systems. My goal here is that you could quickly and easily implement the scraping technology in the language of your choice. You can see how I implemented it in bookstore_functions.php, from which the following code snippets are derived.

##Follett Virtual Bookstore Before you can start fetching class data from a Follet bookstore, you must begin by initializing a session on the storefront’s main page. In the case of Follets, you will be throttled unless you begin by initializing a session at the Storefrontpage like a regular user would.

Once a session has been initialized, it’s time to start scraping. These are what each of the subsequent class data scraping requests look like:

  • Getting Terms --
'http://www.bkstr.com/webapp/wcs/stores/servlet/LocateCourseMaterialsServlet?demoKey=d&programId='. urlencode($valuesArr['Program_Value']) . '&requestType=TERMS&storeId=' . urlencode($valuesArr['Store_Value']);
  • Getting Divisions --
'LocateCourseMaterialsServlet?requestType=DIVISIONS&storeId='. urlencode($valuesArr['Store_Value']) . '&campusId='. urlencode($valuesArr['Campus_Value']) .'&demoKey=d&programId='. urlencode($valuesArr['Program_Value']) .'&termId='. $valuesArr['Term_Value'];
  • Getting Departments--
'http://www.bkstr.com/webapp/wcs/stores/servlet/LocateCourseMaterialsServlet?demoKey=d&divisionName='. urlencode($valuesArr['Division_Value']) .'&campusId='. urlencode($valuesArr['Campus_Value']) .'&programId='. urlencode($valuesArr['Program_Value']) .'&requestType=DEPARTMENTS&storeId='. urlencode($valuesArr['Store_Value']) .'&termId='. urlencode($valuesArr['Term_Value']);
  • Getting Courses --
'http://www.bkstr.com/webapp/wcs/stores/servlet/LocateCourseMaterialsServlet?demoKey=d&divisionName='. urlencode($valuesArr['Division_Value']).'&campusId='. urlencode($valuesArr['Campus_Value']) .'&programId='. urlencode($valuesArr['Program_Value']) .'&requestType=COURSES&storeId='. urlencode($valuesArr['Store_Value']) .'&termId='. urlencode($valuesArr['Term_Value']) .'&departmentName='. urlencode($valuesArr['Department_Code']). '&_=';
  • Getting Sections --
'http://www.bkstr.com/webapp/wcs/stores/servlet/LocateCourseMaterialsServlet?demoKey=d&divisionName='. urlencode($valuesArr['Division_Value']) .'&programId='. urlencode($valuesArr['Program_Value']) .'&requestType=SECTIONS&storeId='. urlencode($valuesArr['Store_Value']) .'&termId='. urlencode($valuesArr['Term_Value']) .'&departmentName='. urlencode($valuesArr['Department_Code']). '&courseName='. urlencode($valuesArr['Course_Code']) .'&_=';

The responses are JSON scripts which contain the next set of dropdown options.

Next comes the part that eluded TextYard's competitors-- getting the class-items. This is scraped from a separate page which Follett created for schools to link to, but does not link to directly from their website:

'http://www.bkstr.com/webapp/wcs/stores/servlet/booklookServlet?bookstore_id-1='. urlencode($valuesArr['Follett_HEOA_Store_Value']) .'&term_id-1='. urlencode($valuesArr['Follett_HEOA_Term_Value']) .'&div-1='. urlencode($valuesArr['Division_Value']) . '&dept-1='. urlencode($valuesArr['Department_Value']) . '&course-1='. urlencode($valuesArr['Course_Value']) .'&section-1='. urlencode($valuesArr['Class_Value']);

The Follett_HEOA_Store_Value is the store_id associated with the given bookstore. You can extract it from the image URL of the logo on that bookstore’s site. The Follet_HEOA_Term_Value, though, is the hardest part in adding a Follett school to the database. It’s set individually by each school, so there are several ways to discover what it is:

  • View the source of the school’s class schedule; much of the time these will match
  • Use Google. If someone has publicly linked to the booklookServlet before, then its Follett_HEOA_Term_Value will be in that link.
  • Try generic/frequent values like “Fall+2011”.
  • Many schools use integer values, so you can use wget or curl to iterate through a range like 1-10,000

We use each of these methods to identify the bookstore’s Follett_Term_Store_Value. Once we have identified it the first time, subsequent values will usually follow the same naming scheme. For example, if they use Fall+2011 the first time, they will almost surely use Spring+2012 for the next.

We test that a Follett_HEOA_Term_Value is correct by testing it out on the booklook url. To illustrate, the URL with the incorrect Follett_HEOA_Term_Value at

http://www.bkstr.com/webapp/wcs/stores/servlet/booklookServlet?bookstore_id-1=303&term_id-1=INCORRECT_VALUE

says **** Unable to find the requested term ****. But when we enter a valid Follett_HEOA_Term_Value it does not give that message:

http://www.bkstr.com/webapp/wcs/stores/servlet/booklookServlet?bookstore_id-1=303&term_id-1=Spring%202011

Moving on, as was previously mentioned, Follet schools have the most complex way of organizing course data. We currently address this problem by hardcoding Program_Value into the database, as it is not subject to change. Division is loaded like any other dropdown, and other systems are forced to store a dummy value for it.

One last thing worth noting is that Follett has the most stringent IP-based blocks. See the beginning of this Wiki for some recommended proxy services.

##ePOS

ePOS is relatively straightforward. It’s worth noting that we use the HTML only version of their site to simplify our scraping. Also, because user-agents are sent via GET in addition to headers, we make a point of varying our user-agent. Here are what our requests look like:

  • Getting terms --
'bookstoreurl.com/ePOS?form=shared3%2ftextbooks%2fno_jscript%2fmain.html&agent='. $user_agent;
  • Getting departments --
'bookstoreurl.com/ePOS?wpd=1&step=2&listtype=begin&form=shared3%2Ftextbooks%2Fno_jscript%2Fmain.html&agent='. $user_agent .'&TERM='. urlencode($valuesArr['Term_Value']) .'&Go=Go';
  • Getting courses --
'bookstoreurl.com/ePOS?wpd=1&step=3&listtype=begin&form=shared3%2Ftextbooks%2Fno_jscript%2Fmain.html&agent='. $user_agent .'&TERM='. urlencode($valuesArr['Term_Value']) .'&department='. urlencode($valuesArr['Department_Value']) .'&Go=Go';
  • Getting sections --
'bookstoreurl.com/ePOS?wpd=1&step=4&listtype=begin&form=shared3%2Ftextbooks%2Fno_jscript%2Fmain.html&agent='. $user_agent .'&TERM='. urlencode($valuesArr['Term_Value']) .'&department='. urlencode($valuesArr['Department_Value']) .'&course='. urlencode($valuesArr['Course_Value']) .'&Go=Go';
  • Getting class-books --
'bookstoreurl.com/ePOS?wpd=1&step=5&listtype=begin&form=shared3%2Ftextbooks%2Fno_jscript%2Fmain.html&agent='. $user_agent .'&TERM='. urlencode($valuesArr['Term_Value']) .'&department='. urlencode($valuesArr['Department_Value']) .'&course='. urlencode($valuesArr['Course_Value']) .'&section='. urlencode($valuesArr['Class_Value']) .'&Go=Go'

The responses are in pure HTML. We process this HTML (and all other markup) with a combination of xpath and built in PHP DOM functions.

##MBS Direct

When we scrape MBS Direct sites we save ourselves a lot of difficulty by scraping from their mobile sites, which can be found at bookstoreurl.com/[mobile/]textbooks.aspx. Every MBS website has a mobile page built in. To completely emulate student traffic, we use an IPhone user agent on these pages.

The most difficult thing about MBS is that they track and enforce state by correlating sessions with hidden input values and GET parameters. That is to say, they make it so any requests must occur in the order that a user would make them. You cannot begin by request the courses for Department X. You must first request the ancestral terms and departments before you do that. So our code always makes such sequential requests, making sure to pass the state-dependent values from one request to the next.

One variation that some MBS Direct mobile sites have that others don’t is a button confirmation page. When our initial request returns a button, we “click” it.

This is what all of the requests for MBS look like (as PHP curl_request options):

$options = array(CURLOPT_URL => http://bookstoreurl.com/mobile/textbooks.aspx, CURLOPT_POST => true, CURLOPT_POSTFIELDS => '__VIEWSTATE='. urlencode($mbs_viewstate) . '&btnRegular=Browse+Course+Listing', CURLOPT_USERAGENT => $useragent);
  • Getting departments --
$options = array(CURLOPT_URL => $mbs_url, CURLOPT_POST => true, CURLOPT_POSTFIELDS => '__VIEWSTATE=' . urlencode($mbs_viewstate) .'&__EVENTTARGET='. $mbs_term_name .'&__EVENTARGUMENT=&'. $mbs_term_name .'='. urlencode($valuesArr['Term_Value']) . '&'. $mbs_dept_name .'=0&'. $mbs_course_name .'=0&'. $mbs_section_name .'=0', CURLOPT_USERAGENT => $useragent);
  • Getting courses --
$options = array(CURLOPT_URL => $mbs_url, CURLOPT_POST => true, CURLOPT_POSTFIELDS => $mbs_term_name .'='. urlencode($valuesArr['Term_Value']) . '&'. $mbs_dept_name .'='. urlencode($valuesArr['Department_Value']) .'&'. $mbs_course_name .'=0&'. $mbs_section_name .'=0&__VIEWSTATE=' . urlencode($mbs_viewstate), CURLOPT_USERAGENT => $useragent);
  • Getting sections --
$options = array(CURLOPT_URL => $mbs_url, CURLOPT_POST => true, CURLOPT_POSTFIELDS => $mbs_term_name .'='. urlencode($valuesArr['Term_Value']) . '&'. $mbs_dept_name .'='. urlencode($valuesArr['Department_Value']) .'&'. $mbs_course_name .'='. urlencode($valuesArr['Course_Value']) .'&'. $mbs_section_name .'=0&__VIEWSTATE=' . urlencode($mbs_viewstate), CURLOPT_USERAGENT => $useragent);
  • Getting class-books --
$options = array(CURLOPT_URL => $mbs_url, CURLOPT_POST => true, CURLOPT_POSTFIELDS => $mbs_term_name .'='. urlencode($valuesArr['Term_Value']) . '&'. $mbs_dept_name .'='. urlencode($valuesArr['Department_Value']) .'&'. $mbs_course_name .'='. urlencode($valuesArr['Course_Value']) .'&'. $mbs_section_name .'='. urlencode($valuesArr['Class_Value']) .'&__VIEWSTATE=' . urlencode($mbs_viewstate), CURLOPT_USERAGENT => $useragent);

MBS gives HTML responses, which we parse using xpath.

##CampusHub CampusHub is relatively straightforward to scrape. Everything comes from its textbooks_xml.asp XHR page. The requests look like:

  • Initial request to get terms --
‘http://bookstoreurl.com/textbooks_xml.asp’
  • Getting departments--
‘http://bookstoreurl.com/textbooks_xml.asp?control=campus&campus='. $campus . '&term='. $term .'&t='. time()
  • Getting courses --
‘http://bookstoreurl.com/textbooks_xml.asp?control=department&dept='. $valuesArr['Department_Value'] . '&term='. $term .'&t='. time()
  • Getting sections--
‘http://bookstoreurl.com/textbooks_xml.asp?control=course&course='. $valuesArr['Course_Value'] . '&term='. $term .'&t='. time()
  • Getting class-books --
'http://bookstoreurl.com/textbooks_xml.asp?control=section&section='. $valuesArr['Class_Value'] . '&t='. time()

We get a <select> response from the CampusHub XHR which we can parse using PHP DOM functions.

##bncollege bncollege requires that you initialize a session before you request the desired field. The requests for class fields are XHR, while the request for class-items is a POST. The requests look like this:

  • Initializing a session and getting terms --
'http://bookstoreurl.com/webapp/wcs/stores/servlet/TBWizardView?catalogId=10001&storeId='. $valuesArr['Store_Value'] .'&langId=-1'
  • Getting departments --
'http://bookstoreurl.com/webapp/wcs/stores/servlet/TextBookProcessDropdownsCmd?campusId='. $valuesArr['Campus_Value'] .'&termId='. $valuesArr['Term_Value'] .'&deptId=&courseId=&sectionId=&storeId='. $valuesArr['Store_Value'] .'&catalogId=10001&langId=-1&dojo.transport=xmlhttp&dojo.preventCache='. time()
  • Getting courses --
'http://bookstoreurl.com/webapp/wcs/stores/servlet/TextBookProcessDropdownsCmd?campusId='. $valuesArr['Campus_Value'] .'&termId='. $valuesArr['Term_Value'] .'&deptId='. $valuesArr['Department_Value'] .'&courseId=&sectionId=&storeId='. $valuesArr['Store_Value'] . '&catalogId=10001&langId=-1&dojo.transport=xmlhttp&dojo.preventCache='. time()
  • Getting sections --
'http://bookstoreurl.com/webapp/wcs/stores/servlet/TextBookProcessDropdownsCmd?campusId='. $valuesArr['Campus_Value'] .'&termId='. $valuesArr['Term_Value'] .'&deptId='. $valuesArr['Department_Value'] .'&courseId='. $valuesArr['Course_Value'] .'&sectionId=&storeId='. $valuesArr['Store_Value'] . '&catalogId=10001&langId=-1&dojo.transport=xmlhttp&dojo.preventCache='. time()
  • Getting class-books --
$options = array(CURLOPT_URL => $url .'TBListView', CURLOPT_REFERER => $referer, CURLOPT_POST => true, CURLOPT_POSTFIELDS => 'storeId='. $valuesArr['Store_Value'] .'&langId=-1&catalogId=10001&savedListAdded=true&clearAll=&viewName=TBWizardView&removeSectionId=&mcEnabled=N&section_1='. $valuesArr['Class_Value'] .'&numberOfCourseAlready=0&viewTextbooks.x='. $x .'&viewTextbooks.y='. $y .'&sectionList=newSectionNumber')

It's worth noting that bncollege employs IP-based blocks. Avoid them by using proxies. See the beginning of this Wiki for some recommended proxy services.

##Neebo

Neebo returns HTML at each successive step, which we can parse with XPath.

  • Initializes the session, and is also used to get terms:
http://www.neebo.com/storeurl
  • Get the Departments
'http://wwww.neebo.com/Course/GetDepartments?termId=' . urlencode($valuesArr['Term_Value']);
  • Get the Courses AND Sections -- Note how this differs from the previous bookstores where these were retrieved seperately:
'http://www.neebo.com/Course/GetCourses?departmentId=' .urlencode($valuesArr['Department_Value']);
  • Get class-books:
'http://www.neebo.com/CourseMaterials/AddSection?sectionId=' .urlencode($valuesArr['Class_Value']);

##Contact us with any questions

We've been through every possible problem in getting this data. Feel free to shoot me a line at [email protected] if you need help with anything.

Clone this wiki locally