-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
This commit upgrades solrwayback to version 5.1.0. See the release notes https://github.com/netarchivesuite/solrwayback/releases/tag/5.1.0 The purpose of this repository is to build the container images used to run solrwayback in a containerized environment. In the name of keeping it simple this commit also removes all uneccessary files and abstractions that is not needed for this purpose.
- Loading branch information
Showing
16 changed files
with
155 additions
and
375 deletions.
There are no files selected for viewing
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,51 +1,52 @@ | ||
# This dockerfile configures a vanilla tomcat container | ||
# with solrwayback installed and configured with properties | ||
# from solrwayback bundle. | ||
# This dockerfile builds a tomcat container including the webapps | ||
# from the solrwayback bundle. | ||
# | ||
# See https://hub.docker.com/_/tomcat for details on how | ||
# to configure tomcat. | ||
# See https://hub.docker.com/_/tomcat for details on how to configure tomcat. | ||
|
||
ARG SOLRWAYBACK_VERSION=4.4.2 | ||
ARG SOLRWAYBACK_TOMCAT_VERSION=8.5.60 | ||
ARG TOMCAT_TAG=8.5-jdk8-temurin-jammy | ||
ARG SOLRWAYBACK_VERSION=5.1.0 | ||
ARG SOLRWAYBACK_TOMCAT_VERSION=9 | ||
ARG TOMCAT_TAG=9-jre17-temurin-jammy | ||
|
||
FROM ubuntu:22.04 as solrwayback-bundle | ||
|
||
ARG SOLRWAYBACK_VERSION | ||
ARG SOLRWAYBACK_TOMCAT_VERSION | ||
ARG SOLRWAYBACK_VERSION | ||
|
||
RUN apt-get update \ | ||
&& apt-get install --quiet --assume-yes wget unzip python3 | ||
RUN apt-get update && apt-get install -y \ | ||
unzip \ | ||
wget | ||
|
||
WORKDIR /build | ||
COPY fetch_solrwayback_bundle.py . | ||
|
||
RUN python3 fetch_solrwayback_bundle.py \ | ||
--solrwayback-version ${SOLRWAYBACK_VERSION} \ | ||
--destination /app | ||
RUN unzip /app/solrwayback_package_${SOLRWAYBACK_VERSION}/apache-tomcat-${SOLRWAYBACK_TOMCAT_VERSION}/webapps/solrwayback.war \ | ||
-d /app/solrwayback/ | ||
RUN wget -q https://github.com/netarchivesuite/solrwayback/releases/download/${SOLRWAYBACK_VERSION}/solrwayback_package_${SOLRWAYBACK_VERSION}.zip | ||
RUN unzip solrwayback_package_${SOLRWAYBACK_VERSION}.zip \ | ||
&& mkdir /webapps \ | ||
&& unzip -d /webapps/solrwayback solrwayback_package_${SOLRWAYBACK_VERSION}_MASTER/tomcat-${SOLRWAYBACK_TOMCAT_VERSION}/webapps/solrwayback.war \ | ||
&& cp solrwayback_package_${SOLRWAYBACK_VERSION}_MASTER/tomcat-${SOLRWAYBACK_TOMCAT_VERSION}/webapps/ROOT.war /webapps/ROOT.war | ||
|
||
FROM tomcat:${TOMCAT_TAG} | ||
|
||
ARG SOLRWAYBACK_TOMCAT_VERSION | ||
ARG SOLRWAYBACK_VERSION | ||
FROM tomcat:${TOMCAT_TAG} | ||
|
||
# TODO: install solrwayback dependencies such as ffmpeg, imagemagick, tesseract-ocr, chromium-browser, etc. | ||
# This is not necessary for solrwayback to work. | ||
# It is only necessary if you want to use the page preview feature. | ||
# It increases the size of the image by about 200MB. | ||
RUN apt-get update && apt-get install -y \ | ||
chromium-browser \ | ||
chromium-codecs-ffmpeg \ | ||
&& rm -rf /var/lib/apt/lists/* | ||
|
||
# CATALINA_HOME is the folder where catalina is installed. | ||
# The main component of tomcat is called catalina. | ||
# CATALINA_HOME is set by the tomcat image. | ||
|
||
# Copy the extracted solrwayback.war file and ROOT.war to the webapps folder of tomcat. | ||
# Copy the extracted solrwayback.war file | ||
# We use the extracted solrwayback.war to be able to customize the web | ||
# application (favicon, etc.) at runtime (using overlays). | ||
COPY --from=solrwayback-bundle \ | ||
/app/solrwayback/ \ | ||
/webapps/solrwayback \ | ||
${CATALINA_HOME}/webapps/solrwayback | ||
|
||
# Copy ROOT.war to the webapps folder of tomcat. | ||
COPY --from=solrwayback-bundle \ | ||
/app/solrwayback_package_${SOLRWAYBACK_VERSION}/apache-tomcat-${SOLRWAYBACK_TOMCAT_VERSION}/webapps/ROOT.war \ | ||
/webapps/ROOT.war \ | ||
${CATALINA_HOME}/webapps/ROOT.war | ||
|
||
# Set URL icon for the web application | ||
COPY favicon.ico ${CATALINA_HOME}/webapps/solrwayback/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,25 +1,26 @@ | ||
# This Dockerfile creates a vanilla java container | ||
# with warc-indexer from solrwayback bundle. | ||
# This Dockerfile creates a vanilla java container with warc-indexer from the solrwayback bundle. | ||
|
||
ARG SOLRWAYBACK_VERSION=4.4.2 | ||
ARG ECLIPSE_TEMURIN_TAG=8-jre | ||
ARG SOLRWAYBACK_VERSION=5.1.0 | ||
ARG ECLIPSE_TEMURIN_TAG=17-jre | ||
|
||
FROM ubuntu:22.04 as solrwayback-bundle | ||
|
||
ARG SOLRWAYBACK_VERSION | ||
|
||
RUN apt-get update \ | ||
&& apt-get install --quiet --assume-yes wget python3 | ||
RUN apt-get update && apt-get install -y \ | ||
unzip \ | ||
wget | ||
|
||
WORKDIR /build | ||
COPY fetch_solrwayback_bundle.py . | ||
|
||
RUN python3 fetch_solrwayback_bundle.py \ | ||
--solrwayback-version ${SOLRWAYBACK_VERSION} \ | ||
--destination /app | ||
RUN wget -q https://github.com/netarchivesuite/solrwayback/releases/download/${SOLRWAYBACK_VERSION}/solrwayback_package_${SOLRWAYBACK_VERSION}.zip | ||
RUN mkdir /app \ | ||
&& unzip solrwayback_package_${SOLRWAYBACK_VERSION}.zip \ | ||
&& mv solrwayback_package_${SOLRWAYBACK_VERSION}_MASTER/* /app | ||
|
||
|
||
FROM eclipse-temurin:${ECLIPSE_TEMURIN_TAG} | ||
|
||
ARG SOLRWAYBACK_VERSION | ||
COPY --from=solrwayback-bundle /app/indexing /opt/warc-indexer | ||
|
||
COPY --from=solrwayback-bundle /app/solrwayback_package_${SOLRWAYBACK_VERSION}/indexing /opt/warc-indexer | ||
ENTRYPOINT ["/opt/warc-indexer/warc-indexer.sh"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,105 @@ | ||
# solrwayback-branding | ||
Contains branding for Solrwayback | ||
# SolrWayback container images | ||
|
||
This repository builds and publishes container images from [SolrWayback releases](https://github.com/netarchivesuite/solrwayback/releases). | ||
|
||
## Images | ||
|
||
### Solrwayback | ||
|
||
The solrwayback container image does not include the configuration files `solrwayback.properties` and `solrwaybackweb.properties`. Use the ones from the official release bundle as a starting point to create your own. | ||
|
||
These files must be placed directly under the `/root` folder using either | ||
overlays at runtime or by building your own image: | ||
|
||
```Dockerfile | ||
FROM github.com/nlnwa/solrwayback-adaption:latest | ||
|
||
COPY solrwayback.properties solrwaybackweb.properties /root | ||
``` | ||
|
||
### Warc indexer | ||
|
||
```shell | ||
$ docker run ghcr.io/nlnwa/warc-indexer -h | ||
|
||
warc-indexer.sh | ||
|
||
Parallel processing of WARC files using webarchive-discovery from UKWA: | ||
https://github.com/ukwa/webarchive-discovery | ||
|
||
The scripts keeps track of already processed WARCs by keeping the output | ||
logs from processing of each WARC. These are stored in the folder | ||
/opt/warc-indexer/status | ||
|
||
|
||
Usage: ./warc-indexer.sh [warc|warc-folder]* | ||
|
||
|
||
Index 2 WARC files: | ||
|
||
./warc-indexer.sh mywarcfile1.warc.gz mywarcfile2.warc.gz | ||
|
||
Index all WARC files in "folder_with_warc_files" (recursive descend) using | ||
20 threads (this will take 20GB of memory): | ||
|
||
THREADS=20 ./warc-indexer.sh folder_with_warc_files | ||
|
||
Index all WARC files in "folder_with_warc_files" (recursive descend) using | ||
20 threads and with an alternative Solr as receiver: | ||
|
||
THREADS=20 SOLR_URL="http://ourcloud.internal:8123/solr/netarchive" ./warc-indexer.sh folder_with_warc_files | ||
|
||
Note: | ||
Each thread starts its own Java process with -Xmx1024M. | ||
Make sure that there is enough memory on the machine. | ||
|
||
Tweaks: | ||
SOLR_URL: The receiving Solr end point, including collection | ||
Value: http://localhost:8983/solr/netarchivebuilder | ||
|
||
SOLR_CHECK: Check whether Solr is available before processing | ||
Value: true | ||
|
||
SOLR_COMMIT: Whether a Solr commit should be issued after indexing to | ||
flush the buffers and make the changes immediately visible | ||
Value: true | ||
|
||
THREADS: The number of concurrent processes to use for indexing | ||
Value: 2 | ||
|
||
STATUS_ROOT: Where to store log files from processing. The log files are | ||
also used to track which WARCs has been processed | ||
Value: /opt/warc-indexer/status | ||
|
||
TMP_ROOT: Where to store temporary files during processing | ||
Value: /opt/warc-indexer/status/tmp | ||
|
||
INDEXER_JAR: The location of the warc-indexer Java tool | ||
Value: /opt/warc-indexer/warc-indexer-3.3.1-jar-with-dependencies.jar | ||
|
||
INDEXER_MEM: Memory allocation for each builder job | ||
Value: 1024M | ||
|
||
INDEXER_CONFIG: Configuration for the warc-indexer Java tool | ||
Value: /opt/warc-indexer/config3.conf | ||
|
||
INDEXER_CUSTOM: Custom command line options for the warc-indexer tool | ||
Value: "" | ||
Sample: "--collection yearly2020" | ||
``` | ||
|
||
### Solr | ||
|
||
Use the [official image](https://hub.docker.com/_/solr). | ||
|
||
To run Solr in kubernetes see [Official Kubernetes operator for Apache Solr](https://github.com/apache/solr-operator). | ||
|
||
Previously this repository contained a Dockerfile that built a [Solr](https://solr.apache.org/guide/solr/latest/index.html) container image (based on the official image) with the "netarchivebuilder" configset from the [SolrWayback bundle version 4.4.2](https://github.com/netarchivesuite/solrwayback/releases/tag/4.4.2). | ||
|
||
As of SolrWayback version 5.x, Solr is started in cloud mode which store the configsets in [ZooKeeper](https://zookeeper.apache.org/) and the SolrWayback bundle does not include a configset (besides the default). See the [Solr 9 configset in the warc indexer repository](https://github.com/ukwa/webarchive-discovery/tree/master/warc-indexer/src/main/solr/solr9/discovery/conf) for a starting point to create your own. | ||
|
||
## TODO | ||
|
||
- Docker compose file and examples. | ||
- Kubernetes deployment files and examples. | ||
- Rootless versions of the images. |
Oops, something went wrong.