-
Notifications
You must be signed in to change notification settings - Fork 9
example usps
We are going to build a script that will take a list of US Zip-codes and tell us what cities they belong to.
To do this, we need to do a bit of detective work. I am using Google Chrome, so I open up the "Developer Tools" on the menu and select the "Network" tab. This will let me see what my web-browser is sending and receiving as I navigate the web. I'm going to check the box that says "Preserve log", so that my logged data will not go away between every page view. There will usually be a LOT of information in there, but for now, we don't care about images, scripts, and stylesheets, etc. so let's select the sub-tab that filters to show only "Documents". Documents are the core web-requests that send/receive the HTML, JSON, or content and the part we are most interested in automating. Now that our browser is setup to record, let's do a transaction.
- Start by going to the USPS site http://www.usps.com.
- On the home page, I see a link in the left-hand menu "Look Up a ZIP Code".
- Clicking on that link takes me to https://tools.usps.com/go/ZipLookupAction_input and there is a form in the middle of the page to fill out.
- The third tab on the form is "Cities by ZIP Code", so let's click that.
- In the form, I see one field labeled "ZIP Code"
- Entering my ZIP code and submitting that form, I see the result I'm looking for, my city is displayed on the page.
Now if we go to the list of requests in our Chrome Developer Tools, I see that there were 3 documents, ignoring any third-party advertising requests. When writing a script, you can take a few different approaches. If your goal is to test the performance of the other system, then you want to reproduce as much of the original session as possible (maybe even including images, scripts, and stylesheets). In this case, our goal is to capture some data, so we want to do as little as possible to get at our data.
Here are the 3 requests that I see:
- GET - https://www.usps.com/
- GET - https://tools.usps.com/go/ZipLookupAction_input
- GET - https://tools.usps.com/go/ZipLookupResultsAction!input.action?resultMode=2&companyName=&address1=&address2=&city=&state=Select&urbanCode=&postalCode=MY_ZIPCODE_HERE&zip=
Examining our requests here, I see that the first request is just for the home page, so we can ignore that. The second request takes us to the page that shows us the form. The third request is the one that actually performs the search and contains the result we are looking for. In this instance, it looks like the third request can just be submitted directly. The url for the third request has the ZIP Code that I entered embedded in the url's query-string and uses the "GET" HTTP method, so it seems likewe can just jump straight to that URL. Let's try it out. I'm going to copy that URL, close my browser (to make sure I don't have any cookies, and paste in the same URL, but modifying the ZIP Code. HEY! Good news, I can see the new results that I expected.
NOTE: If the web-site/application that you are targeting uses session state, it may be important for you to make one or both of those first requests to setup your cookies. In other cases, the urls of the website may actually be dynamic and subject to change each time. In these cases, you need to request the first page(s), capture the URLs that you need to "click" on, and then navigate to the next step.
OK, so now we are almost ready to write our script, but before we do that, I notice that there are a lot of empty parameters in that URL. I wonder if we can remove some of those and still have it work? Let's try it.
Sure enough, that gives me the same result. Now I think we are ready to roll.
Let's start by creating our input file, which has the list of the ZIP Codes that we want to find. We will make a .csv file with a single column and save it as zipcodes_input.csv
.
ZIP_Code
10001
11109
79936
89049
89109
90210
95834
Great! Now let's start on our YAML script (call it usps.yml). In the URL, I'm going to replace the actual ZIP Code with an interpolation token to "plug-in" the zip-code from our input file.
request: https://tools.usps.com/go/ZipLookupResultsAction!input.action?resultMode=2&postalCode=<%+ ZIP_Code %>
So now we have a basic script, but it doesn't do anything with the response. Let's fire up the [CLI](Command-line interface) and get it working first to see how it does. Let's run it with hyperpotamus usps.yml --csv zipcodes_input.csv -vvv
. The --csv option tells hyperpotamus to run the script once for each records in the .csv file initializing the session data from the field values. The -vvv tells hyperpotamus to use triple-verbose output (to show the responses that we receive). When I run this, I see a lot of HTML coming back, one page-worth for each of those records. In the contents of that HTML, I can see the results, so we know we are sending the right requests, now let's work on extracting the data.
There are many ways to extract contents from a page. Named regex matching and JQuery are two of my favorite. In this case, we are going to use JQuery. Normally I would use my chrome inspector to hit JQuery on the live page, but in this instance, it looks like USPS is one of the few sites that actually doesn't already have JQuery included on the page. That's OK, we can eyeball it.
In reviewing the HTML from one of the responses, I see a div with a class of result-cities
, and inside of there is an element with a class of std-address
. Let's use that to find our data. The answer we are looking for is the text of the std-address
field.
Finally, let's add an "emit" action to output the contents. This is a great way to build a data-file with your results. Let's re-run the script now.
Modifying our usps.yml script:
request: https://tools.usps.com/go/ZipLookupResultsAction!input.action?resultMode=2&postalCode=<%+ ZIP_Code %>
response:
- jquery: "#result-cities .std-address"
capture:
city: text
- emit: "<% ZIP_Code %>|<%? city|Error %>"
NOTE: In YAML, the values do not have to be quoted, but # has a special meaning, so it had to be quoted in this instance.
hyperpotamus usps.yml --csv zipcodes_input.csv
And you can see the results (as long as the USPS site is behaving).
Hyperpotamus Documentation - home