Skip to content

Headless chrome scraper API based on puppeteer. Improved upon l0co's work.

Notifications You must be signed in to change notification settings

mortee/docker-puppeteer-api

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Puppeteer API

Headless chrome scraper API based on puppeteer.

This image contains:

  1. Google Chrome working in headless mode.
  2. ExpressJS application exposing single endpoint allowing to get contents of the website using puppeteer and headless chrome.
  3. Alternative command line interface for the above (CLI).

Features:

  1. Fetching content from URL generated by real browser on load event.
  2. Fetching content from URL generated by real browser on element appearance in DOM. This is especially useful when page you want to get loads some dynamic data through Ajax and pushes it into DOM after load event.
  3. Simple REST API endpoint with hash-based security.
  4. CLI to get data with single command.

How to use

With REST API

Start docker container using following command:

docker run -d -p 8000:8000 --restart unless-stopped --name puppeteer-api -e "SALT=abcdef" l0coful/puppeteer-api

Where:

  1. The container port 8000 is bound to 8000 host port making API accessible at http://localhost:8000 (also through the network).
  2. The SALT environment variable provided should be set to some random string and will be used for requests security.
  3. Alternatively, you can specify a file location to read the salt value from in the SALT_FILE environment variable.
  4. If you want to specify the User-Agent string to use, you can provide it in the USER_AGENT environment variable.

You can check if everything is OK tracing container logs:

$ docker logs puppeteer-api
[...]
Using 'abcdef' as salt
[local] Scraper API version 2.0.0 is listening on port: 8000

Fetching URL content

TL;DR

POST following JSON to http://localhost:8000/scrape:

{
	"url": "http://example.com",
	"selector": "h1",
	"hash": "129f2756eac7b62b5b7f428175e5a4e3"
}

Where:

  1. url is the URL to fetch.
  2. selector is an optional selector. If provided the content will be returned only after this selector returns non-empty elements array. Otherwise the content will be returned on page onload event.
  3. hash is the request signature done by md5(`${url}:${SALT}`).
On page load

By default you can fetch DOM untouched by javascript using CURL:

$ curl -s http://example.com | grep "h1"
<h1>Example Domain</h1>

To do the same but using this API you first need to calculate you URL hash using previously used SALT and URL you want to fetch:

$ echo -n "http://example.com:abcdef" | md5sum
129f2756eac7b62b5b7f428175e5a4e3 -

Having this signature you can now use the /fetch API endopint for the URL:

$ curl \
	-s \
	-X POST \
	-H "Content-Type: application/json" \
	-d '{"url": "http://example.com","hash":"129f2756eac7b62b5b7f428175e5a4e3"}' \
	http://localhost:8000/fetch \
| grep "h1"

<h1>Example Domain</h1>

In logs you can see that HTML has been returned after load event:

$ docker logs puppeteer-api
[...]
[68825600] requesting from: ::ffff:172.17.0.1 to fetch: http://example.com
[68825600] starting chrome browser
[68825600] going to: http://example.com
[68825600] page loaded, resolving content immediately
[68825600] closing chrome browser
[68825600] sending data with: 1262 bytes
On element appearance in DOM

In the second mode the API returns scraped content content only after element with given CSS selector appears in the DOM. This can be done this way using the /scrape API endopint (for h1 selector):

$ curl \
	-s \
	-X POST \
	-H "Content-Type: application/json" \
	-d '{"url": "http://example.com","selector":"h1","hash":"129f2756eac7b62b5b7f428175e5a4e3"}' \
	http://localhost:8000/scrape 

<h1>Example Domain</h1>

Related logs:

$ docker logs puppeteer-api
[51b45bd7] requesting from: ::ffff:172.17.0.1 to fetch: http://example.com
[51b45bd7] starting chrome browser
[51b45bd7] going to: http://example.com
[51b45bd7] page loaded, setting 1000 ms refresh interval
[51b45bd7] element with selector: 'h1' appeared, resolving content
[51b45bd7] clearing refresh interval
[51b45bd7] closing chrome browser
[51b45bd7] sending data with: 1262 bytes

With command line

The same options as above are executable from the command line in the host OS using temporary docker container.

On page load:

$ docker run --rm -it --entrypoint "/bin/bash" l0coful/puppeteer-api puppeteer fetch http://example.com | grep "h1"
<h1>Example Domain</h1>

On h1 element appearance in DOM:

$ docker run --rm -it --entrypoint "/bin/bash" l0coful/puppeteer-api puppeteer scrape -s "h1" http://example.com
<h1>Example Domain</h1>

Go Client Library

https://sneak.berlin/go/puppeteerapiclient

package main

import (
	"context"
	"fmt"
	"log"
	"os"
	"time"

	_ "github.com/joho/godotenv/autoload"
	"sneak.berlin/go/puppeteerapiclient"
)

const url = "https://news.ycombinator.com"
const selector = ".athing .title .titleline"

func main() {
	apiURL := os.Getenv("API_URL")
	apiSalt := os.Getenv("API_SALT")

	client := puppeteerapiclient.NewClient(apiURL, apiSalt)

	ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)

	defer cancel()

	response, err := client.Scrape(ctx, url, selector)

	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(response.Content)
}

How to build

docker build --tag l0coful/puppeteer-api .

About

Headless chrome scraper API based on puppeteer. Improved upon l0co's work.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • JavaScript 73.5%
  • Shell 18.6%
  • Dockerfile 7.9%