Skip to content

Commit

Permalink
feat: Solrwayback 5.x
Browse files Browse the repository at this point in the history
This commit upgrades solrwayback to version 5.1.0.

See the release notes
https://github.com/netarchivesuite/solrwayback/releases/tag/5.1.0

The purpose of this repository is to build the container images
used to run solrwayback in a containerized environment. In the
name of keeping it simple this commit also removes all uneccessary
files and abstractions that is not needed for this purpose.
  • Loading branch information
maeb committed Apr 18, 2024
1 parent 67f48b7 commit 4d9df22
Show file tree
Hide file tree
Showing 16 changed files with 155 additions and 375 deletions.
1 change: 0 additions & 1 deletion .gitattributes

This file was deleted.

18 changes: 5 additions & 13 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
name: Test release

on:
push:
on: push

env:
REGISTRY: ghcr.io
Expand All @@ -15,20 +14,19 @@ jobs:
packages: write
strategy:
matrix:
image-type: [solr, solrwayback, warc-indexer]
image: [solrwayback, warc-indexer]
steps:
- name: Code checkout
uses: actions/checkout@v3
with:
lfs: true

- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v2

- name: Extract metadata (tags, labels, version) for Docker
id: meta
uses: docker/metadata-action@v4
with:
images: ${{ env.REGISTRY }}/${{ github.repository_owner }}/${{ matrix.image-type }}
images: ${{ env.REGISTRY }}/${{ github.repository_owner }}/${{ matrix.image }}
tags: |
type=semver,pattern={{version}}
type=ref,event=branch
Expand All @@ -40,10 +38,4 @@ jobs:
push: false
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
file: Dockerfile.${{ matrix.image-type }}
build-args: |
SOLRWAYBACK_VERSION=4.4.2
SOLR_VERSION=7.7.3
SOLRWAYBACK_TOMCAT_VERSION=8.5.60
TOMCAT_TAG=8.5-jdk8-temurin-jammy
ECLIPSE_TEMURIN_TAG=8-jre
file: Dockerfile.${{ matrix.image }}
15 changes: 4 additions & 11 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,20 +19,19 @@ jobs:
packages: write
strategy:
matrix:
image-type: [solr, solrwayback, warc-indexer]
image: [solrwayback, warc-indexer]
steps:
- name: Code checkout
uses: actions/checkout@v3
with:
lfs: true

- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v2

- name: Extract metadata (tags, labels, version) for Docker
id: meta
uses: docker/metadata-action@v4
with:
images: ${{ env.REGISTRY }}/${{ github.repository_owner }}/${{ matrix.image-type }}
images: ${{ env.REGISTRY }}/${{ github.repository_owner }}/${{ matrix.image }}
tags: |
type=semver,pattern={{version}}
type=ref,event=branch
Expand All @@ -52,10 +51,4 @@ jobs:
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
file: Dockerfile.${{ matrix.image-type }}
build-args: |
SOLRWAYBACK_VERSION=4.4.2
SOLR_VERSION=7.7.3
SOLRWAYBACK_TOMCAT_VERSION=8.5.60
TOMCAT_TAG=8.5-jdk8-temurin-jammy
ECLIPSE_TEMURIN_TAG=8-jre
file: Dockerfile.${{ matrix.image }}
42 changes: 0 additions & 42 deletions .github/workflows/test.yml

This file was deleted.

32 changes: 0 additions & 32 deletions Dockerfile.solr

This file was deleted.

55 changes: 28 additions & 27 deletions Dockerfile.solrwayback
Original file line number Diff line number Diff line change
@@ -1,51 +1,52 @@
# This dockerfile configures a vanilla tomcat container
# with solrwayback installed and configured with properties
# from solrwayback bundle.
# This dockerfile builds a tomcat container including the webapps
# from the solrwayback bundle.
#
# See https://hub.docker.com/_/tomcat for details on how
# to configure tomcat.
# See https://hub.docker.com/_/tomcat for details on how to configure tomcat.

ARG SOLRWAYBACK_VERSION=4.4.2
ARG SOLRWAYBACK_TOMCAT_VERSION=8.5.60
ARG TOMCAT_TAG=8.5-jdk8-temurin-jammy
ARG SOLRWAYBACK_VERSION=5.1.0
ARG SOLRWAYBACK_TOMCAT_VERSION=9
ARG TOMCAT_TAG=9-jre17-temurin-jammy

FROM ubuntu:22.04 as solrwayback-bundle

ARG SOLRWAYBACK_VERSION
ARG SOLRWAYBACK_TOMCAT_VERSION
ARG SOLRWAYBACK_VERSION

RUN apt-get update \
&& apt-get install --quiet --assume-yes wget unzip python3
RUN apt-get update && apt-get install -y \
unzip \
wget

WORKDIR /build
COPY fetch_solrwayback_bundle.py .

RUN python3 fetch_solrwayback_bundle.py \
--solrwayback-version ${SOLRWAYBACK_VERSION} \
--destination /app
RUN unzip /app/solrwayback_package_${SOLRWAYBACK_VERSION}/apache-tomcat-${SOLRWAYBACK_TOMCAT_VERSION}/webapps/solrwayback.war \
-d /app/solrwayback/
RUN wget -q https://github.com/netarchivesuite/solrwayback/releases/download/${SOLRWAYBACK_VERSION}/solrwayback_package_${SOLRWAYBACK_VERSION}.zip
RUN unzip solrwayback_package_${SOLRWAYBACK_VERSION}.zip \
&& mkdir /webapps \
&& unzip -d /webapps/solrwayback solrwayback_package_${SOLRWAYBACK_VERSION}_MASTER/tomcat-${SOLRWAYBACK_TOMCAT_VERSION}/webapps/solrwayback.war \
&& cp solrwayback_package_${SOLRWAYBACK_VERSION}_MASTER/tomcat-${SOLRWAYBACK_TOMCAT_VERSION}/webapps/ROOT.war /webapps/ROOT.war

FROM tomcat:${TOMCAT_TAG}

ARG SOLRWAYBACK_TOMCAT_VERSION
ARG SOLRWAYBACK_VERSION
FROM tomcat:${TOMCAT_TAG}

# TODO: install solrwayback dependencies such as ffmpeg, imagemagick, tesseract-ocr, chromium-browser, etc.
# This is not necessary for solrwayback to work.
# It is only necessary if you want to use the page preview feature.
# It increases the size of the image by about 200MB.
RUN apt-get update && apt-get install -y \
chromium-browser \
chromium-codecs-ffmpeg \
&& rm -rf /var/lib/apt/lists/*

# CATALINA_HOME is the folder where catalina is installed.
# The main component of tomcat is called catalina.
# CATALINA_HOME is set by the tomcat image.

# Copy the extracted solrwayback.war file and ROOT.war to the webapps folder of tomcat.
# Copy the extracted solrwayback.war file
# We use the extracted solrwayback.war to be able to customize the web
# application (favicon, etc.) at runtime (using overlays).
COPY --from=solrwayback-bundle \
/app/solrwayback/ \
/webapps/solrwayback \
${CATALINA_HOME}/webapps/solrwayback

# Copy ROOT.war to the webapps folder of tomcat.
COPY --from=solrwayback-bundle \
/app/solrwayback_package_${SOLRWAYBACK_VERSION}/apache-tomcat-${SOLRWAYBACK_TOMCAT_VERSION}/webapps/ROOT.war \
/webapps/ROOT.war \
${CATALINA_HOME}/webapps/ROOT.war

# Set URL icon for the web application
COPY favicon.ico ${CATALINA_HOME}/webapps/solrwayback/
25 changes: 13 additions & 12 deletions Dockerfile.warc-indexer
Original file line number Diff line number Diff line change
@@ -1,25 +1,26 @@
# This Dockerfile creates a vanilla java container
# with warc-indexer from solrwayback bundle.
# This Dockerfile creates a vanilla java container with warc-indexer from the solrwayback bundle.

ARG SOLRWAYBACK_VERSION=4.4.2
ARG ECLIPSE_TEMURIN_TAG=8-jre
ARG SOLRWAYBACK_VERSION=5.1.0
ARG ECLIPSE_TEMURIN_TAG=17-jre

FROM ubuntu:22.04 as solrwayback-bundle

ARG SOLRWAYBACK_VERSION

RUN apt-get update \
&& apt-get install --quiet --assume-yes wget python3
RUN apt-get update && apt-get install -y \
unzip \
wget

WORKDIR /build
COPY fetch_solrwayback_bundle.py .

RUN python3 fetch_solrwayback_bundle.py \
--solrwayback-version ${SOLRWAYBACK_VERSION} \
--destination /app
RUN wget -q https://github.com/netarchivesuite/solrwayback/releases/download/${SOLRWAYBACK_VERSION}/solrwayback_package_${SOLRWAYBACK_VERSION}.zip
RUN mkdir /app \
&& unzip solrwayback_package_${SOLRWAYBACK_VERSION}.zip \
&& mv solrwayback_package_${SOLRWAYBACK_VERSION}_MASTER/* /app


FROM eclipse-temurin:${ECLIPSE_TEMURIN_TAG}

ARG SOLRWAYBACK_VERSION
COPY --from=solrwayback-bundle /app/indexing /opt/warc-indexer

COPY --from=solrwayback-bundle /app/solrwayback_package_${SOLRWAYBACK_VERSION}/indexing /opt/warc-indexer
ENTRYPOINT ["/opt/warc-indexer/warc-indexer.sh"]
107 changes: 105 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,105 @@
# solrwayback-branding
Contains branding for Solrwayback
# SolrWayback container images

This repository builds and publishes container images from [SolrWayback releases](https://github.com/netarchivesuite/solrwayback/releases).

## Images

### Solrwayback

The solrwayback container image does not include the configuration files `solrwayback.properties` and `solrwaybackweb.properties`. Use the ones from the official release bundle as a starting point to create your own.

These files must be placed directly under the `/root` folder using either
overlays at runtime or by building your own image:

```Dockerfile
FROM github.com/nlnwa/solrwayback-adaption:latest

COPY solrwayback.properties solrwaybackweb.properties /root
```

### Warc indexer

```shell
$ docker run ghcr.io/nlnwa/warc-indexer -h

warc-indexer.sh

Parallel processing of WARC files using webarchive-discovery from UKWA:
https://github.com/ukwa/webarchive-discovery

The scripts keeps track of already processed WARCs by keeping the output
logs from processing of each WARC. These are stored in the folder
/opt/warc-indexer/status


Usage: ./warc-indexer.sh [warc|warc-folder]*


Index 2 WARC files:

./warc-indexer.sh mywarcfile1.warc.gz mywarcfile2.warc.gz

Index all WARC files in "folder_with_warc_files" (recursive descend) using
20 threads (this will take 20GB of memory):

THREADS=20 ./warc-indexer.sh folder_with_warc_files

Index all WARC files in "folder_with_warc_files" (recursive descend) using
20 threads and with an alternative Solr as receiver:

THREADS=20 SOLR_URL="http://ourcloud.internal:8123/solr/netarchive" ./warc-indexer.sh folder_with_warc_files

Note:
Each thread starts its own Java process with -Xmx1024M.
Make sure that there is enough memory on the machine.

Tweaks:
SOLR_URL: The receiving Solr end point, including collection
Value: http://localhost:8983/solr/netarchivebuilder

SOLR_CHECK: Check whether Solr is available before processing
Value: true

SOLR_COMMIT: Whether a Solr commit should be issued after indexing to
flush the buffers and make the changes immediately visible
Value: true

THREADS: The number of concurrent processes to use for indexing
Value: 2

STATUS_ROOT: Where to store log files from processing. The log files are
also used to track which WARCs has been processed
Value: /opt/warc-indexer/status

TMP_ROOT: Where to store temporary files during processing
Value: /opt/warc-indexer/status/tmp

INDEXER_JAR: The location of the warc-indexer Java tool
Value: /opt/warc-indexer/warc-indexer-3.3.1-jar-with-dependencies.jar

INDEXER_MEM: Memory allocation for each builder job
Value: 1024M

INDEXER_CONFIG: Configuration for the warc-indexer Java tool
Value: /opt/warc-indexer/config3.conf

INDEXER_CUSTOM: Custom command line options for the warc-indexer tool
Value: ""
Sample: "--collection yearly2020"
```

### Solr

Use the [official image](https://hub.docker.com/_/solr).

To run Solr in kubernetes see [Official Kubernetes operator for Apache Solr](https://github.com/apache/solr-operator).

Previously this repository contained a Dockerfile that built a [Solr](https://solr.apache.org/guide/solr/latest/index.html) container image (based on the official image) with the "netarchivebuilder" configset from the [SolrWayback bundle version 4.4.2](https://github.com/netarchivesuite/solrwayback/releases/tag/4.4.2).

As of SolrWayback version 5.x, Solr is started in cloud mode which store the configsets in [ZooKeeper](https://zookeeper.apache.org/) and the SolrWayback bundle does not include a configset (besides the default). See the [Solr 9 configset in the warc indexer repository](https://github.com/ukwa/webarchive-discovery/tree/master/warc-indexer/src/main/solr/solr9/discovery/conf) for a starting point to create your own.

## TODO

- Docker compose file and examples.
- Kubernetes deployment files and examples.
- Rootless versions of the images.
Loading

0 comments on commit 4d9df22

Please sign in to comment.