Skip to content

How to build and run in Docker

aponb edited this page Jul 3, 2019 · 23 revisions

This document describes the process of pulling a pre-built version of OpenWayback from DockerHub or building it from source code locally and running to serve WARC files, all in the Docker environment. This can be very handy for development and testing in different environments without polluting the host machine with different versions of dependencies. The OpenWayback source code includes a Dockerfile. Generated Docker image is kept minimal which makes it suitable for running in production as well.

Requirements

Docker (version 17.05 or later is required for building the image).

Official Images

OpenWayback provides up-to-date official Docker images in DockerHub that are automatically built from the source at GitHub. The latest tag points to the latest stable release while the master tag points to an image built from the bleeding edge code at the master branch of the repo. These two tags are overwritten when another corresponding build is completed successfully. On the contrary, versioned tags such as openwayback-2.4.0 are supposed to be permanent.

Running OpenWayback Server Containers

In order to run a test instance of OpenWayback we first need to prepare the environment. The default configuration of the OpenWayback uses the automatic BDB Indexer and expects WARC files at ${WAYBACK_BASEDIR}/files1/ or ${WAYBACK_BASEDIR}/files2/. By default the WAYBACK_BASEDIR is set to /data volume in the Docker image. Create necessary directory structure on the host machine for testing and populate it with some test WARC files.

$ mkdir -p /tmp/owb/files1
$ wget -P /tmp/owb/files1/ https://github.com/iipc/openwayback-sample-overlay/raw/master/sample/warcs/example.com.warc.gz

In the above example, we have created a folder for testing at /tmp/owb/files1 and downloaded a sample WARC file named example.com.warc.gz in that folder using wget. Alternatively, if you have any WARC files available locally, copy them in that folder.

$ cp /path/to/sample/*.warc /tmp/owb/files1/

With WARC files in place, we can pull the iipc/openwayback image from DockerHub. Then run a Docker container with appropriately mounted volumes and port mapping. By default the container would run the Tomcat server.

$ docker pull iipc/openwayback
$ docker container run -it --rm -v /tmp/owb:/data -p 8080:8080 iipc/openwayback

Once the WARC files are indexed, they should be ready for lookup at http://localhost:8080/wayback/. If you have used the sample example.com.warc.gz file above then you can search for the http://example.com/ URL using the search form and expect to find a capture of it, if everything went well.

OpenWayback allows certain configuration overrides using environment variables that can be customized when running a container, but these customization are very limited.

WAYBACK_HOME=/usr/local/tomcat/webapps/ROOT/WEB-INF
WAYBACK_BASEDIR=/data
WAYBACK_URL_SCHEME=http
WAYBACK_URL_HOST=localhost
WAYBACK_URL_PORT=8080
WAYBACK_URL_PREFIX=http://localhost:8080

However, by strategically mounting certain volumes, it is possible to run the OpenWayback server with custom configuration files.

$ docker container run -it --rm -p 8080:8080 \
    -v /tmp/owb:/data \
    -v /path/to/custom/wayback.xml:/usr/local/tomcat/webapps/ROOT/WEB-INF/wayback.xml \
    -v /path/to/custom/CDXCollection.xml:/usr/local/tomcat/webapps/ROOT/WEB-INF/CDXCollection.xml \
    iipc/openwayback

This way of mounting configuration files can be handy for testing. However, for production purposes it is better to create derived image and override configuration files with custom files. For more details on custom configuration, read the basic configuration documentation.

Building Custom OpenWayback Images

While DockerHub-hosted official iipc/openwayback[:<TAG>] images are quicker and easier to use, they used the latest stable versions of Maven, JDK, Tomcat and JRE to build the image at the time they were built. One can locally build a custom image with customized environment while still utilizing the Dockerfile provided in the OpenWayback repo. Local image building is also desired for development and testing with changes in the code that are not push to the upstream repo yet.

First, acquire the source code.

$ git clone https://github.com/iipc/openwayback.git
$ cd openwayback

Make any changes to the source code if needed. Then build the docker image.

$ docker image build -t iipc/openwayback .

This will download dependencies, compile the code, run tests, package, and place necessary components in appropriate places to build a minimal Docker image with the name iipc/openwayback. This process may take a while (depending on the network bandwidth and processor speed). It utilizes Multi-Stage Build feature of Docker to exclude compile-time environment and dependencies from the final image, which makes it both, secure and smaller in size.

By default, the source is built using the latest versions of Maven and JDK then the image is packaged with the latest versions of Tomcat and JRE. However, it is possible to build and package with custom combinations these dependencies using MAVEN_TAG and TOMCAT_TAG build arguments. These variations can be helpful for both testing and production needs without making any changes in the Dockerfile.

$ docker image build \
    --build-arg=MAVEN_TAG=3.5-jdk-7 \
    --build-arg=TOMCAT_TAG=7-jre7-alpine \
    -t iipc/openwayback:custom .

Above command would build an image named iipc/openwayback with tag custom where the source code would be built using Maven 3.5 with JDK 7 and then the built artifacts will be packaged in a small Alpine Linux image with Tomcat 7 and JRE 7. See available values of MAVEN_TAG and TOMCAT_TAG build arguments.

Now, run the OpenWayback server using this custom image and access it from a web browser.

$ docker container run -it --rm -v /tmp/owb:/data -p 8080:8080 iipc/openwayback:custom

Utilities

The Docker image contains various executable utilities with their necessary dependencies that can be used in one-off mode. The following command illustrates one possible usage of the cdx-indexer to index WARC files into CDX files on the host machine with appropriate volume mounting while utilizing a one-off container.

$ docker container run -it --rm -v /tmp/owb:/data iipc/openwayback cdx-indexer /data/files1/sample1.warc > /tmp/owb/index1.cdx

Alternatively, access the bash prompt of the container to run utility scripts inside or perform debugging.

$ docker container run -it --rm -v /tmp/owb:/data iipc/openwayback bash
[CONTAINER ID]# cdx-indexer /data/files1/sample1.warc > /data/index1.cdx

IMPORTANT If you are using the bash sort command to sort CDX files, you must set the environment variable LC_ALL=C. This tells sort how to sort and ensures that it matches how OpenWayback expects CDX indexes to be sorted.

For more details, read the description of all packaged utility scripts.