Skip to content

Getting CNAM from a Web Page

lgaetz edited this page Jul 25, 2012 · 42 revisions

Getting Caller ID Names (CNAM) from the internet is a pain in the but and worse, frequently requires a fair bit of maintenance as web pages change their output formats. Enough people ask about how to create/debug/understand web lookup sources that this page was created to give them a push in the right direction. These notes are valid for Superfecta 2.x only, though the concepts will work with other versions with minor differences in details. I develop sources by editing files directly on the server, and testing numbers via the debug feature in Superfecta. No reloads are necessary between edits, just save the file and trigger the lookup.

Big Picture

Individual lookup source modules for Superfecta 2.x are stored in the folder /<webroot>/admin/modules/superfecta/bin/ which in Centos is /var/www/html/admin/modules/superfecta/bin/ as individual PHP files. Even with a vague understanding of programming languages, when you look through one of these files, you will get a sense of what is being accomplished. For web lookups there are two main parts, getting the contents of a URL and searching through those contents for the CNAM. The phone number variable is available in the PHP code as the variable $thenumber and when found you set the CNAM to the variable $caller_id.

get_url_contents

Looking at one of the existing web lookup sources, somewhere in the code you will see a line similar to this:

$url = "<URL structured to return CNAM results>";

A simple example would be a site that permits direct searching a URL of the form:

$url = "http://numberlooker.com/search?$thenumber";

Here is a complex example of a URL searching the Australian yellow pages with Google (because the site doesn't have reverse lookup itself) and where the phone number is broken into smaller parts with variables $num1, $num2 and $num3:

$url = http://www.google.com/search?num=30&hl=en&lr=&safe=off&as_qdr=all&q=%22".$num1."+".$num2."+".$num3."%22+site%3Awww.yellowpages.com.au+-site%3Awww.yellowpages.com.au%2Fsearch&btnG=Search

There then follows the actual URL capture which is done by the function get_url_contents which would look like this: $value = get_url_contents($url); Superfecta version 2.2.5 and earlier has no ability to do any logging, so it is necessary for the developer to do it them self. During development, I will add the following lines immediately after get_url_contents line, which will dump the URL contents to a text file I can then examine:

$myFile = "/var/www/html/admin/modules/superfecta/bin/superfecta.txt";
$fh = fopen($myFile, 'w') or die("can't open file");
fwrite($fh, $value);
fclose($fh);

Formerly I would open the URL in a web browser and view the page source in order to see the URL contents, and this will usually, but not always work. The get_url_contents function may return slightly different characters than what the browser source displays. The online regex tester, myregextester.com also allows you to get the HTML output directly from a URL which is handy, but you need to confirm it is seeing the same as what get_url_contents is, be sure to compare output to the superfecta.txt output. Use the contents of the /bin/superfecta.txt file to hone the URL and make sure you are returning text that is actually useful. For the narrative that follows, I will assume that get_url_contents is returning a block of text that includes the following:

<big long string of characters>...www.truelocal.com.au/business/adHJKhHlkjHJ/business_name/
asdfaskldjfasdjf/<business_number>/ ...<another big long string of characters>

So from the possibly huge block of returned characters, we want to identify a unique string that contains the CNAM. In the above example, we want to extract the business name between the two / characters.

Searching through the URL contents

There are several ways to search through a block of text for a specific string of characters, but most of the later web sources do this by means of a regular expression or regex. Regexes are very complex constructions which take more than a bit of time to master. Once you get the hang of it though, it is really the only way to go. Here are a few resources you can use to get introduced and up to speed on regexes:

Once you are ready to actually construct a regex, an online tester is a must. There are many but this one is my favorite: http://myregextester.com With this site you can toggle between preg_match and preg_match_all which I found handy, as well as toggle an explanation of your regex. It also properly displays the text which I discovered many online testers don't and will load the contents of a URL discussed above, which streamlines things a bit.

Here is a sample regex that might be used to search the returned URL contents displayed above:

$pattern = "/www\.truelocal\.com\.au\/business\/.*?\/(.*?)\/.*?$thenumber/i";

This defines the variable $pattern to the regex between the quotes. Strings between double quotes in PHP will automatically get the $variable names substituted with the variable value. The regex must start and end with the / character, and you will notice numerous backslash \ escape characters. In a regex definition, there are metacharacters which have special meanings, an example is the . (period). If I want to match the literal . in the text string like in the URL www.truelocal.com.au, it is necessary to escape each metacharacter by preceding it with a backslash. The other critical part is the part between the ( ), this is a sub regex, and for our purposes defines the area in the text where we would find the CNAM. There is another tip that is rather important for constructing regexes for this purpose, and that is the use of the .*. The . is the metacharacter to match any character and the * which means to repeat the preceding character zero or more times. Used together, it is a replacement for a series of any characters of unknown length, which comes in very handy when searching thru html. The problem is that the * will take as many characters as it can and still get a valid match, which usually gives unpredictable results for this application, the solution is to modify the * with a ? which tells it to use the least number of characters as possible. I have gotten into the habit of using .*? (dot, star, question) for matching a series of unknown characters of unknown length. The regex above is looking for a series of characters starting with www.truelocal.com.au/business/ followed by a string of unknown characters of unknown length, followed by a / followed by another string of unknown characters of unknown length this time in ( ) which indicates the name we are looking for, followed by a / then a some more random characters then the phone number we are searching for. The trailing i makes the search case insensitive. Once the regex is defined, you do the search with: preg_match($pattern, $value, $match); This will return an array variable $match, and if you structured the regex properly with a regex and a single sub regex, $match will be an array of two values, such as:

$match[0] = "www.truelocal.com.au/business/adHJKhHlkjHJ/business_name/asdfaskldjfasdjf/<business_number>/"
$match[1] = "business_name"

Suppose that you manage to structure a regex that is perfect but it matches several places in the URL results. You know that you always want to ignore the first one, and take the second match. In this case you want to use preg_match_all like this preg_match_all($pattern, $value, $match);. In this case the $match variable will be a 2d array, part of the contents might look something like this:

$match[1][0] = "www.truelocal.com.au/business/adHJKhHlkjHJ/business_name/asdfaskldjfasdjf/<business_number>/"
$match[1][1] = "business_name"

From this point it is just a matter of cleaning up the desired value of $match[ ] with:

  • strip_tags
  • trim
  • urldecode <-- This is important because phones don't display the CNAM the same as browsers.

and assigning a value: $caller_id = urldecode(trim(strip_tags($match[ ])));