Skip to content

jw81/pup-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

pup-scraper

Dockerized pup used for scraping HTML

Usage

If you want to build the Docker image yourself and run it ::

  • run docker build -t my-pup-scraper . to build the Docker image
  • run docker run --rm -e URL='http://www.google.com' -e FILTER='body' my-pup-scraper to create a Docker container and run pup
    • --rm
      • removes/deletes the container after it finishes running
    • -e URL='http://www.google.com
      • sets an environment variable named URL to the value http://www.google.com (change this to whatever url you need to scrape)
    • -e FILTER='body'
      • sets an environment variable named FILTER to the value body (change this to whatever HTML/CSS selectors you need to scrape)

Or if you want to just run the Docker image stored in DockerHub ::

  • run docker run --rm -e URL='http://www.google.com' -e FILTER='body' jeffreywallace81/pup-scraper

Notes

  • If you want to ignore all of the HTML tags and just extract the raw text, you can run the command like this ::
    • docker run --rm -e URL='http://www.google.com' -e FILTER='body text{}' my-pup-scraper
  • For an example of a more complex HTML/CSS selector, go look at the default value for the FILTER environment variable in the Dockerfile. REF

About

Dockerized pup used for scraping HTML

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published