Skip to content

Configure Multiple Access Points For Multiple CDX Collections

Sawood Alam edited this page Jun 29, 2014 · 6 revisions

Introduction

This document describes step-by-step configuration of separate access points for individual collections. Every collection is a set of ARC/WARC files that is indexed in CDX files. To save the storage space, ARC/WARC files can be compressed and have file extension .arc.gz or .warc.gz.

To illustrate the step-by-step configuration, we will take an example where we have two collections namely art and news. Each of the collections have couple of .warc.gz files (could be other supported formats as well). Suppose these collections are stored in the following directory structure:

$ tree /archives
/archives
└── collections
    ├── art
    │   ├── art-20140313083412-000.warc.gz
    │   └── art-20140422132637-001.warc.gz
    └── news
        ├── news-20140315112738-000.warc.gz
        └── news-20140418034624-001.warc.gz

Suppose that our Wayback server has a domain name wayback.example.com and we want to setup three access points as follows:

  • /art/ access point only searches in the art collection.
  • /news/ access point only searches in the news collection.
  • /all/ access point searches in all the collections and gives the composite result.

Indexing

Default Wayback server comes pre-configured to use BDB Index (Berkeley Data Base) that enables automatic indexing of small collection which is suitable for single access point. But for large scale collections with multiple access points, manually generated CDX indexing is preferred.

In this case we will need one or more CDX indexes for each collection along with path indexes. Path index is a simple sorted text file that has two columns separated by a TAB; the first column contains ARC/WARC file name and the second column contains corresponding full path to the file (or full path with the domain name if on a remote host). A utility called cdx-indexer is shipped with Wayback download (can be found in the bin directory) to generate CDX index from ARC/WARC files. For large collections we might want to write a script to automate the process of CDX generation while internally calling the shipped cdx-indexer script.

[TODO: Write a separate guide to describe the CDX generation.]

Suppose that we have generated one CDX file and one path index file for the art collection and similarly for the news collection. There can be more than one CDX files for each collection, but for the sake of simplicity, we are keeping one CDX file per collection. We have also created an additional path index file that contains the file and path listing of both the collections (this can be created by merging the two path index files and sorting them). Suppose that our archives directory now has the following directory structure:

$ tree /archives
/archives
├── collections
│   ├── art
│   │   ├── art-20140313083412-000.warc.gz
│   │   └── art-20140422132637-001.warc.gz
│   └── news
│       ├── news-20140315112738-000.warc.gz
│       └── news-20140418034624-001.warc.gz
├── cdx-idx
│   ├── index-art.cdx
│   └── index-news.cdx
└── path-idx
    ├── art-path-idx.txt
    ├── news-path-idx.txt
    └── all-path-idx.txt

Configuration

First of all we need to install Apache Tomcat, if not already installed. Once Tomcat is up and running, it will have a webapps directory. In our case it is located at /var/lib/tomcat7/webapps, but it may differ based on how Tomcat is configured on your machine. Now we need to obtain the latest copy of OpenWayback and install it. Please refer How to Install guide for further details. In this setup we will assume that you have installed Wayback as ROOT application. Although you can choose to name it anything else, but the configurations are easier for ROOT application.

Now we will focus on configuration files available in WEB-INF directory of Wayback application. Now let's have a look at the default wayback.xml file. Comments and unnecessary commented blocks have been removed to reduce the number of lines:

<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.springframework.org/schema/beans
           http://www.springframework.org/schema/beans/spring-beans-3.0.xsd"
       default-init-method="init">

  <bean class="org.springframework.beans.factory.config.PropertyPlaceholderConfigurer">
    <property name="properties">
      <value>
        wayback.basedir=/tmp/wayback
        wayback.urlprefix=http://localhost:8080/wayback/
      </value>
    </property>
  </bean>

  <bean id="waybackCanonicalizer" class="org.archive.wayback.util.url.AggressiveUrlCanonicalizer" />

  <bean id="resourcefilelocationdb" class="org.archive.wayback.resourcestore.locationdb.BDBResourceFileLocationDB">
    <property name="bdbPath" value="${wayback.basedir}/file-db/db/" />
    <property name="bdbName" value="DB1" />
    <property name="logPath" value="${wayback.basedir}/file-db/db.log" />
  </bean>

<!--
  <bean id="resourcefilelocationdb" class="org.archive.wayback.resourcestore.locationdb.FlatFileResourceFileLocationDB">
    <property name="path" value="${wayback.basedir}/path-index.txt" />
  </bean>
-->

  <import resource="BDBCollection.xml"/>
<!--
  <import resource="CDXCollection.xml"/>
  <import resource="RemoteCollection.xml"/>
  <import resource="NutchCollection.xml"/>
-->

  <import resource="ArchivalUrlReplay.xml"/>

  <bean name="+" class="org.archive.wayback.webapp.ServerRelativeArchivalRedirect">
    <property name="matchPort" value="8080" />
    <property name="useCollection" value="true" />
  </bean>

  <bean name="standardaccesspoint" class="org.archive.wayback.webapp.AccessPoint">
    <property name="accessPointPath" value="http://localhost:8080/wayback/"/>
    <property name="internalPort" value="8080"/>
    <property name="serveStatic" value="true" />
    <property name="bounceToReplayPrefix" value="false" />
    <property name="bounceToQueryPrefix" value="false" />
    <property name="replayPrefix" value="${wayback.urlprefix}" />
    <property name="queryPrefix" value="${wayback.urlprefix}" />
    <property name="staticPrefix" value="${wayback.urlprefix}" />

    <property name="collection" ref="localbdbcollection" />
<!--
    <property name="collection" ref="localcdxcollection" />
-->

    <property name="replay" ref="archivalurlreplay" />
    <property name="query">
      <bean class="org.archive.wayback.query.Renderer">
        <property name="captureJsp" value="/WEB-INF/query/CalendarResults.jsp" />
      </bean>
    </property>

    <property name="uriConverter">
      <bean class="org.archive.wayback.archivalurl.ArchivalUrlResultURIConverter">
        <property name="replayURIPrefix" value="${wayback.urlprefix}"/>
      </bean>
    </property>

    <property name="parser">
      <bean class="org.archive.wayback.archivalurl.ArchivalUrlRequestParser">
        <property name="maxRecords" value="10000" />
      </bean>
    </property>
  </bean>

</beans>

[A work in progress...]