-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Docker
Running ArchiveBox with Docker allows you to manage it in a container without exposing it to the rest of your system. ArchiveBox generally works the same in Docker as it does outside Docker. You can even use pip
-installed ArchiveBox and Docker ArchiveBox in tandem, as they both share the same data directory format.
- Overview
- Docker Compose βοΈ (recommended)
- Plain Docker
Official Docker Hub image: hub.docker.com/r/archivebox/archivebox
docker pull archivebox/archivebox:latest
Published Docker tags:
-
:latest
,:stable
(latest stable release, the default) -
:x.x
and:x.x.x
for specific versions (e.g.:0.7
or:0.7.2
) -
:dev
for unstable alpha builds (breaks often, only for developers and willing beta testers) -
:sha-xxxxxxx
for builds of specific git commits (to test or pin specific PRs or commits)
Important
Make sure Docker is installed and up-to-date before following any instructions below! β‘οΈ
To check installed version, run: docker --version
(must be >=17.04.0
)
A full docker-compose.yml
file is provided with all the extras included.
You can uncomment sections within it to enable extra features, or run the basic version as-is.
# create a folder to store your data (can be anywhere)
mkdir -p ~/archivebox/data && cd ~/archivebox
# download the compose file into the directory
curl -fsSL 'https://docker-compose.archivebox.io' > docker-compose.yml
# (shortcut for getting https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/stable/docker-compose.yml)
# initialize your collection and create an admin user for the Web UI (or set ADMIN_USERNAME/ADMIN_PASSWORD env vars)
docker compose run archivebox init
docker compose run archivebox manage createsuperuser
To use Sonic for improved full-text search, download this config & uncomment the sonic service in docker-compose.yml
:
# download the sonic config file into your data folder (e.g. ~/archivebox)
curl -fsSL 'https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/dev/etc/sonic.cfg' > sonic.cfg
# then uncomment the sonic-related sections in docker-compose.yml
nano docker-compose.yml
# to backfill any existing archive data into the search index, run:
docker compose run archivebox update --index-only
See the wiki page on Upgrading or Merging Archives: Upgrading with Docker Compose for instructions. β‘οΈ
You can use docker compose run archivebox [subcommand]
just like the non-Docker archivebox [subcommand]
CLI.
First, make sure you're cd
'ed into the same folder as your docker-compose.yml
file (e.g. ~/archivebox
):
docker compose run archivebox help
To add an individual URL, pass it in as an arg or via stdin:
docker compose run archivebox add 'https://example.com'
# OR
echo 'https://example.com' | docker compose run -T archivebox add
To add multiple URLs at once, pipe them in via stdin, or place them in a file inside ./data/sources
so that ArchiveBox can access it from within the container:
# pipe URLs in from a file outside Docker
docker compose run -T archivebox add < ~/Downloads/example_urls.txt
# OR ingest URLs from a file mounted inside Docker
docker compose run archivebox add --depth=1 /data/sources/example_urls.txt
# OR pipe in URLs from a remote source
curl 'https://example.com/some/rss/feed.xml' | docker compose run archivebox add
docker compose run archivebox add --depth=1 'https://example.com/some/rss/feed.xml'
The --depth=1
flag tells ArchiveBox to look inside the provided source and archive all the URLs within:
# this archives just the RSS file itself (probably not what you want)
docker compose run archivebox add 'https://example.com/some/feed.rss'
# this archives the RSS feed file + all the URLs mentioned inside of it
docker compose run archivebox add --depth=1 'https://example.com/some/feed.rss'
The outputted archive data is stored in data/
(relative to the project root), or whatever folder path you specified in the docker-compose.yml
volumes:
section. Make sure the data/
folder on the host has permissions initially set to 777
so that the ArchiveBox command is able to set it to the specified OUTPUT_PERMISSIONS
config setting on the first run.
To access the results directly via the filesystem, open ./data/archive/<timestamp>/index.html
(timestamp is shown in output of previous command).
Alternatively, to use the web UI, start the server with:
docker compose up # add -d to run in the background
Then open http://127.0.0.1:8000
.
ArchiveBox running with docker compose
accepts all the same config options as other ArchiveBox distributions, see the full list of options available on the Configuration page.
The recommended way configure ArchiveBox in Docker Compose is using archivebox config --set ...
or by editing ArchiveBox.conf
.
docker compose run archivebox config --set MEDIA_MAX_SIZE=750mb
# OR
echo 'MAX_MEDIA_SIZE=750mb' >> ./data/ArchiveBox.conf
This will apply the config to all containers or archivebox instances that access the collection.
If you're only running one container, or if you want to scope config options to only apply to a particular container, you can set them in that container's environment:
section:
...
services:
archivebox:
...
environment:
- USE_COLOR=False
- SHOW_PROGRESS=False
- CHECK_SSL_VALIDITY=False
- RESOLUTION=1900,1820
- MEDIA_TIMEOUT=512000
...
You can also specify an env file via CLI when running compose using docker compose --env-file=/path/to/config.env ...
although you must specify the variables in the environment:
section that you want to have passed down to the ArchiveBox container from the passed env file.
If you want to access your archive server with HTTPS, put a reverse proxy like Nginx or Caddy in front of http://127.0.0.1:8000
to do SSL termination. Here is an example ArchiveBox nginx container + nginx.conf
that you can modify to add your preferred TLS settings.
Fetch and run the ArchiveBox Docker image to create your initial archive.
docker pull archivebox/archivebox
mkdkir -p ~/archivebox/data && cd ~/archivebox/data
docker run -it -v $PWD:/data archivebox/archivebox init --setup
(You can create a collection in any directory you want, ~/archivebox/data
is just used as an example here)
If you encounter permissions issues, you may need configure user/group ownership explicitly with PUID
/PGID
.
See the wiki page on Upgrading or Merging Archives: Upgrading with plain Docker for instructions. β‘οΈ
The Docker CLI docker run ... archivebox/archivebox [subcommand]
works just like the non-Docker archivebox [subcommand]
CLI.
First, make sure you're cd
'ed into your collection data folder (e.g. ~/archivebox/data
).
docker run -it -v $PWD:/data archivebox/archivebox help
To add a single URL, pass it as an arg or pipe it in via stdin:
docker run -it -v $PWD:/data archivebox/archivebox add 'https://example.com'
# OR
echo 'https://example.com' | docker run -i -v $PWD:/data archivebox/archivebox add
To archive multiple URLs at once, pass text containing URLs in via stdin:
docker run -i -v $PWD:/data archivebox/archivebox add < urls.txt
# OR
curl 'https://example.com/some/rss/feed.xml' | docker run -i -v $PWD:/data archivebox/archivebox add
You can also use the --depth=1
flag to tell ArchiveBox to recursively archive the URLs within a provided source.
docker run -it -v $PWD:/data archivebox/archivebox add --depth=1 'https://example.com/some/rss/feed.xml'
The docker run
-v /path/on/host:/path/inside/container
flag specifies where your data dir lives on the host.
For example to use a folder on an external USB drive (instead of the current directory $PWD
or ~/archivebox/data
):
docker run -it -v /media/USB-DRIVE/archivebox/data:/data archivebox/archivebox ...
Then to view your data, you can look in the folder on the host /media/USB-DRIVE/archivebox/data
, or use the Web UI:
docker run -it -v /media/USB_DRIVE/archivebox/data:/data -p 8000:8000 archivebox/archivebox
# then open https://127.0.0.1:8000
The easiest way is to use archivebox config --set KEY=value
or edit ./ArchiveBox.conf
(in your collection dir).
For example, this sets MEDIA_TIMEOUT=120
as a persistent setting for the collection:
docker run -it -v $PWD:/data archivebox/archivebox config --set MEDIA_TIMEOUT=120
# OR
echo 'MEDIA_TIMEOUT=120' >> ./ArchiveBox.conf
ArchiveBox in Docker also accepts config as environment variables, see more on the Configuration page.
For example, this applies FETCH_SCREENSHOT=False
to a single run (without persisting for other runs):
docker run -it -v $PWD:/data -e FETCH_SCREENSHOT=False archivebox/archivebox add 'https://example.com'
# OR
echo 'FETCH_SCREENSHOT=False' >> ./.env
docker run ... --env-file=./.env archivebox/archivebox ...
- π’ Quickstart
- π₯οΈ Install
- π³ Docker
- β‘οΈ Supported Sources
- β¬ οΈ Supported Outputs
- οΉ©Command Line
- π Web UI
- 𧩠Browser Extension
- πΎ REST API / Webhooks
- π Python API / REPL / SQL API
- βοΈ Configuration
- π¦ Dependencies
- πΏ Disk Layout
- π Security Overview
- π Developer Documentation
- Upgrading
- Setting up Storage (NFS/SMB/S3/etc)
- Setting up Authentication (SSO/LDAP/etc)
- Setting up Search (rg/sonic/etc)
- Scheduled Archiving
- Publishing Your Archive
- Chromium Install
- Cookies & Sessions Setup
- Merging Collections
- Troubleshooting
- βοΈ Web Archiving Community
- Background & Motivation
- Comparison to Other Tools
- Architecture Diagram
- Changelog & Roadmap