Skip to content

Instructions to install and run PDI Docker

rduerr edited this page Nov 21, 2018 · 6 revisions

This is the Dockerized version of the insights portion of the Polar Deep Insights system. Two parts of the project, the insight-generator, a python library used to extract information, and the insight-visualizer, a javascript application used for data visualization, can be installed and run in Docker containers using the instructions below.

Prerequisites

  1. Install docker - if it isn't already installed

  2. If you normally log into a docker registry on your machine, do so now

  3. At a terminal window type git clone https://github.com/USCDataScience/polar-deep-insights.git

  4. Then type cd polar-deep-insights/Docker

  5. Install npm if it isn't installed already, else skip this step.

  6. Install elastic search tools. Depending on your permissions you may have to type

    1. npm install -g elasticsearch-tools or
    2. sudo npm install -g elasticsearch-tools and entering your password at the prompt
  7. Make the sript setup.sh executable by typing chmod +x setup.sh

    1. It should be noted that ./setup.sh creates a data folder and populates it with a variety of other required files and empty folders
    2. If you are planning on analyzing your own files, please put them in the in the data/files folder. Any format is acceptable, though the parsers may not extract all possible data if the format is very unusual.
  8. Export elastic index mappings.

    1. If using polar.usc.edu's elastic search data, type

      es-export-mappings --url http://polar.usc.edu/elasticsearch --file data/polar/polar-data-mappings.json

    2. If using your own database - Replace the http://polar.usc.edu/elasticsearch in the above command with your remote elastic search url or your localhost elastic index's url and run the above command.

  9. Export elastic index data.

    1. If using polar.usc.edu's elastic search data type

      es-export-bulk --url http://polar.usc.edu/elasticsearch --file data/polar/polar-data.json

    2. If using your own database - Replace the http://polar.usc.edu/elasticsearch in the above command with your remote elastic search url or your localhost elastic search url and run the above command.

    PS: This step may take a while depending on the size of your elasticsearch database. The Polar data set contains 100k documents and takes quite a long time (go get coffee).

Insight Generator Installation

  1. Install some necessary files - description can be found here.

    1. For Linux based OS(Ubuntu, MacOS, etc):
      1. chmod +x pre_installation.sh
      2. ./pre_installation.sh This step will install the necessary sh files from the web and uses the wget command. If you encounter an error : wget not found:-
        1. Install wget (eg: for MacOS : brew install wget) OR
        2. Open pre_installation.sh and replace wget with curl -0 filename where filename is the name of the file on each command OR
        3. Refer to point 1.ii.b
    2. For Windows OS:
      1. If you have wget for windows as mentioned here, replace wget in the pre_installation.sh file with wget for windows.
      2. A more hassle-free solution is to manually download the files from their source web pages as mentioned here.
  2. Add files to the following folders according to these instructions:

    1. data/files : Add your data files of any filetype - to generate insights from
    2. data/polar : Contains mappings and data from the elastic search url
    3. data/ingest : Output from pdi insight generator will be saved here under the filename ingest_data.json
    4. data/sparkler/raw : Add Sparkler crawled data from the SOLR index into the sparkler_rawdata.json file in this folder
    5. data/sparkler/parsed : Sparkler data (in data/sparkler/raw/sparkler_rawdata.json) is parsed using parse.py and saved in sparkler_data.json
  3. Build Insight Generator

    1. git clone https://github.com/USCDataScience/polar-deep-insights.git && cd polar-deep-insights/Docker/insight-generator

    2. Build from local

      docker build -t uscdatascience/pdi-generator -f InsightGenDockerfile .

      OR pull from docker hub

      docker pull uscdatascience/pdi-generator

    3. PDI_JSON_PATH=/data/polar docker-compose up -d

  4. This container exposes the following ports:

    8765 - Geo Topic Parser

    9998 - Apache Tika Server

    8060 - Grobid Quantities REST API

Insight Visualizer Installation

  1. git clone https://github.com/USCDataScience/polar-deep-insights.git && cd polar-deep-insights/Docker/insight-visualizer

  2. docker build -t uscdatascience/polar-deep-insights -f PolarDeepInsightsDockerfile . OR docker pull uscdatascience/polar-deep-insights

  3. PDI_JSON_PATH=data/polar docker-compose up -d

  4. Access application at http://localhost/pdi/

  5. Access elasticsearch at http://localhost/elasticsearch/

  6. This container exposes the following ports:

    80 - Apache2/HTTPD server

    9000 - Grunt server servig up the PDI application

    9200 - Elasticsearch 2.4.6 server

    35729 - Auto refresh port for AngularJS apps

PS: You need to add CORS extension to the browser and to enable it in order to download concept ontology and additional precomputed information from http://polar.usc.edu/elasticsearch/ and elsewhere.

Monitoring the Container

docker logs -f container_id - use your docker container's id

Logging onto the Container with a Bash Shell

docker exec -it container_id bash - use your docker container's id