HtmlDumper is a PHP library which downloads a copy of an HTML page and its assets into a target directory.
- Downloads HTML source code and transforms all URIs into relative paths, creating an updated
index.html
file. - Parses HTML and fetches relevant resources
- Stylesheets, scripts, images, videos
- Also works with assets located within CSS files.
- Removes anchor links to external pages.
- Does not crawl pages beyond the initial URL.
$url = "https://example.com";
$targetDirectory = "/tmp/htmldump";
$downloader = new \LanguageWire\HtmlDumper\Service\PageDownloader();
if ($downloader->download($url, $targetDirectory)) {
echo "Sucessfully downloaded $url in $targetDirectory";
}
- PHP 7.2+
- PHP DOM Extension
- Composer
The recommended way to install HtmlDumper is through Composer.
composer require languagewire/html-dumper
In the build/
folder there is a Dockerfile
file which sets up all dependencies needed for local development, runs unit tests and other linters.
Customize build/.env
like this:
cd build
cp .env.template .env
nano .env
And then run ./build.sh
within the build/
folder:
cd build
./build.sh
HtmlDumper is made available under the MIT License (MIT). Please see the LICENSE file for more information.