Skip to content

Commit

Permalink
Implemented Crawlbot, this closes #4 and closes #1
Browse files Browse the repository at this point in the history
  • Loading branch information
Swader committed May 20, 2015
1 parent 4b28488 commit 45cb6db
Show file tree
Hide file tree
Showing 40 changed files with 2,767 additions and 88 deletions.
2 changes: 1 addition & 1 deletion .scrutinizer.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,4 @@ checks:
tools:
external_code_coverage:
timeout: 600
runs: 3
runs: 1
2 changes: 0 additions & 2 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
language: php

php:
- 5.4
- 5.5
- 5.6
- 7.0
- hhvm
Expand Down
23 changes: 23 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,29 @@
#Changelog
All notable changes will be documented in this file

## 0.3 - May 17th, 2015

### Internal changes

- [Internal] DiffbotAware trait now responsible for registering Diffbot parent in children
- [BC Break, Internal] PHP 5.6 is now required (`...` operator)
- [Internal] Updated all API calls to HTTPS

### Features

- [Feature] Implemented Crawlbot API, added usage example to README
- [Feature] Added `Job` abstract entity with `JobCrawl` and `JobBulk` derivations. A `Job` is either a [Bulk API job](https://www.diffbot.com/dev/docs/bulk) or a [Crawl job](https://www.diffbot.com/dev/docs/crawl). A collection of jobs is the result of a Crawl or Bulk API call. When job name is provided, a max of one item is present in the collection.

### Bugs

- [Bug] Fixed [#1](https://github.com/Swader/diffbot-php-client/issues/1)

### Meta

- [Repository] Added TODOs as issues in repo, linked to relevant ones in [TODO file](TODO.md).
- [CI] Stopped testing for 5.4 and 5.5, updated Travis and Scrutinizer file to take this into account
- [Tests] Fully tested Crawlbot implementation

## 0.2 - May 2nd, 2015

- added Discussion API
Expand Down
105 changes: 102 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Right now it only supports Analyze, Product, Image, Discussion and Article APIs,

## Requirements

Minimum PHP 5.4 because Guzzle needs it.
Minimum PHP 5.6 is required. When installed via Composer, the library will pull in Guzzle 5 as well, so it's recommended you have cURL installed, but not required.

## Install

Expand Down Expand Up @@ -59,7 +59,7 @@ Currently available [*automatic*](http://www.diffbot.com/products/automatic/) AP
- [discussion](http://www.diffbot.com/products/automatic/discussion/) (fetches discussion / review / comment threads - can be embedded in the Product or Article return data, too, if those contain any comments or discussions)
- [analyze](http://www.diffbot.com/products/automatic/analyze/) (combines all the above in that it automatically determines the right API for the URL and applies it)

Video is coming soon.
Video is coming soon. See below for instructions on Crawlbot, Search and Bulk API.

There is also a [Custom API](http://www.diffbot.com/products/custom/) like [this one](http://www.sitepoint.com/analyze-sitepoint-author-portfolios-diffbot/) - unless otherwise configured, they return instances of the Wildcard entity)

Expand Down Expand Up @@ -200,7 +200,7 @@ Used just like all others. There are only two differences:
The following is a usage example of my own custom API for author profiles at SitePoint:

```php
$diffbot = new Diffbot('brunoskvorc');
$diffbot = new Diffbot('my_token');
$customApi = $diffbot->createCustomAPI('http://sitepoint.com/author/bskvorc', 'authorFolioNew');

$return = $customApi->call();
Expand All @@ -213,6 +213,105 @@ foreach ($return as $wildcard) {

Of course, you can easily extend the basic Custom API class and make your own, as well as add your own Entities that perfectly correspond to the returned data. This will all be covered in a tutorial in the near future.

## Crawlbot and Bulk API

Basic Crawlbot support has been added to the library.
To find out more about Crawlbot and what, how and why it does what it does, see [here](https://www.diffbot.com/dev/docs/crawl/).
I also recommend reading the [Crawlbot API docs](https://www.diffbot.com/dev/docs/crawl/api.jsp) and the [Crawlbot support topics](http://support.diffbot.com/topics/crawlbot/) just so you can dive right in without being too confused by the code below.

In a nutshell, the Crawlbot crawls a set of seed URLs for links (even if a subdomain is passed to it as seed URL, it still looks through the entire main domain and all other subdomains it can find) and then processes all the pages it can find using the API you define (or opting for Analyze API by default).

### List of all crawl / bulk jobs

A joint list of all your crawl / bulk jobs can be fetched via:

```
$diffbot = new Diffbot('my_token');
$jobs = $diffbot->crawl()->call();
```

This returns a collection of all crawl and bulk jobs. Each type is represented by its own class: `JobCrawl` and `JobBulk`. It's important to note that Jobs only contain the information about the job - not the data. To get the data of a job, use the `downloadUrl` method to get the URL to the dataset:

```
$url = $job->downloadUrl("json");
```

### Crawl jobs: Creating a Crawl Job

See inline comments for step by step explanation

```
// Create new diffbot as usual
$diffbot = new Diffbot('my_token');
// The crawlbot needs to be told which API to use to process crawled pages. This is optional - if omitted, it will be told to use the Analyze API with mode set to auto.
// The "crawl" url is a flag to tell APIs to prepare for consumption with Crawlbot, letting them know they won't be used directly.
$url = 'crawl';
$articleApi = $diffbot->createArticleAPI($url)->setDiscussion(false);
// Make a new crawl job. Optionally, pass in API instance
$crawl = $diffbot->crawl('sitepoint_01', $articleApi);
// Set seeds - seeds are URLs to crawl. By default, passing a subdomain into the crawl will also crawl other subdomains on main domain, including www.
$crawl->setSeeds(['http://sitepoint.com']);
// Call as usual - an EntityIterator collection of results is returned. In the case of a job's creation, only one job entity will always be returned.
$job = $crawl->call();
// See JobCrawl class to find out which getters are available
dump($job->getDownloadUrl("json")); // outputs download URL to JSON dataset of the job's result
```

### Crawl jobs: Inspecting an existing Crawl Job

To get data about a job (this will be the data it was configured with - its flags - and not the results!), use the exact same approach as if creating a new one, only without the API and seeds:

```
$diffbot = new Diffbot('my_token');
$crawl = $diffbot->crawl('sitepoint_01');
$job = $crawl->call();
dump($job->getDownloadUrl("json")); // outputs download URL to JSON dataset of the job's result
```

### Crawl jobs: Modifying an existing Crawl Job

While there is no way to alter a crawl job's configuration post creation, you can still do some operations on it.

Provided you fetched a `$crawl` instance as in the above section on inspecting, you can do the following:

```
// Force start of a new crawl round manually
$crawl->roundStart();
// Pause or unpause (0) a job
$crawl->pause();
$crawl->pause(0)
// Restart removes all crawled data but keeps the job (and settings)
$crawl->restart();
// Delete a job and all related data
$crawl->delete();
```

Note that it is not necessary to issue a `call()` after these methods.

If you would like to extract the generated API call URL for these instant-call actions, pass in the parameter `false`, like so:

```
$crawl->delete(false);
```

You can then save the URL for your convenience and call `call` when ready to execute (if at all).

```
$url = $crawl->buildUrl();
$url->call();
```

## Testing

Just run PHPUnit in the root folder of the cloned project.
Expand Down
13 changes: 9 additions & 4 deletions TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,21 @@ Active todos, ordered by priority

## High

- implement Crawlbot
- implement Search API
- [implement Bulk Processing Support](https://github.com/Swader/diffbot-php-client/issues/3)
- [implement Search API](https://github.com/Swader/diffbot-php-client/issues/2)

## Medium

- add streaming to Crawlbot - make it stream the result (it constantly grows)
- implement Video API (currently beta)
- [add streaming to Crawlbot](https://github.com/Swader/diffbot-php-client/issues/5)
- [implement Video API](https://github.com/Swader/diffbot-php-client/issues/6) (currently beta)
- [implement Webhook](https://github.com/Swader/diffbot-php-client/issues/7) for Bulk / Crawlbot completion
- look into adding async support via Guzzle
- consider alternative solution to 'crawl' setting in Api abstract ([#8](https://github.com/Swader/diffbot-php-client/issues/8)).
- API docs needed ([#9](https://github.com/Swader/diffbot-php-client/issues/3))

## Low

- see what can be done with the [URL report](https://www.diffbot.com/dev/docs/crawl/) - some implementation options?
- add more usage examples
- work on PhpDoc consistency ($param type vs type $param)
- get more mock responses and test against them
Expand Down
63 changes: 27 additions & 36 deletions src/Abstracts/Api.php
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
namespace Swader\Diffbot\Abstracts;

use Swader\Diffbot\Diffbot;
use Swader\Diffbot\Traits\DiffbotAware;

/**
* Class Api
Expand All @@ -28,26 +29,29 @@ abstract class Api implements \Swader\Diffbot\Interfaces\Api
/** @var Diffbot The parent class which spawned this one */
protected $diffbot;

use DiffbotAware;

public function __construct($url)
{
$url = trim((string)$url);
if (strlen($url) < 4) {
throw new \InvalidArgumentException(
'URL must be a string of at least four characters in length'
);
}

$url = (isset(parse_url($url)['scheme'])) ? $url : "http://$url";

$filtered_url = filter_var($url, FILTER_VALIDATE_URL);
if (!$filtered_url) {
throw new \InvalidArgumentException(
'You provided an invalid URL: ' . $url
);
if (strcmp($url, 'crawl') !== 0) {
$url = trim((string)$url);
if (strlen($url) < 4) {
throw new \InvalidArgumentException(
'URL must be a string of at least four characters in length'
);
}

$url = (isset(parse_url($url)['scheme'])) ? $url : "http://$url";

$filtered_url = filter_var($url, FILTER_VALIDATE_URL);
if (!$filtered_url) {
throw new \InvalidArgumentException(
'You provided an invalid URL: ' . $url
);
}
$url = $filtered_url;
}

$this->url = $filtered_url;
$this->url = $url;
}

/**
Expand Down Expand Up @@ -91,14 +95,15 @@ public function call()

public function buildUrl()
{
$url = rtrim($this->apiUrl, '/');
$url = rtrim($this->apiUrl, '/').'?';

// Add Token
$url .= '?token=' . $this->diffbot->getToken();

// Add URL
$url .= '&url=' . urlencode($this->url);
if (strcmp($url,'crawl') !== 0) {
// Add Token
$url .= 'token=' . $this->diffbot->getToken();

// Add URL
$url .= '&url=' . urlencode($this->url);
}

// Add Custom Fields
$fields = $this->fieldSettings;
Expand All @@ -118,18 +123,4 @@ public function buildUrl()

return $url;
}

/**
* Sets the Diffbot instance on the child class
* Used to later fetch the token, HTTP client, EntityFactory, etc
* @param Diffbot $d
* @return $this
*/
public function registerDiffbot(Diffbot $d)
{
$this->diffbot = $d;

return $this;
}

}
Loading

0 comments on commit 45cb6db

Please sign in to comment.