Giant makes it easier for journalists to search, analyse, categorise and share unstructured data. It takes many file formats, indexes them (including converting images to text using OCR) and provides a UI for search. Users can upload their own files but it also scales up to terabytes of data.
Giant is part of the Guardian's "Platform for Investigations" suite, you will see references
to pfi
in the code. Under development since 2017, it's written in Scala and Typescript and
is maintained by the Investigations & Reporting team.
If Giant doesn't fit your needs, check out Aleph from the OCCRP and Datashare from the ICIJ.
Giant has the following pre-requisites for local development:
Giant uses three databases, run locally in Docker through docker-compose.yaml:
- neo4j
- Elasticsearch
- minio (for S3 compatibility)
There are two optional dependencies:
- Tesseract
- To extract text from images (OCR).
brew install tesseract
- Libre Office
- To convert and preview Microsoft Office documents in the UI
- wkhtmltopdf
- To preview html files (such as emails)
brew install wkhtmltopdf
Elasticsearch requires Docker to have at least 4GB of memory from the preferences menu otherwise it will exit with no log output and error 137.
For Guardian developers:
- Janus credentials are not required to run Giant locally.
- The Giant Runbook
Select the correct version of node:
nvm use
Then run the setup script:
./scripts/setup.sh
Seed the configuration:
./scripts/cluster-setup.sh
Run the Scala backend:
./scripts/start-backend.sh
This will also automatically launch the databases in the background by running
docker-compose up -d
.
In a separate terminal, run the Create React App frontend:
./scripts/start-frontend.sh
The frontend script will wait for the backend to start before launching Giant at
http://localhost:3000
.
Once Giant has started, follow the admin quickstart guide.
You can use dev-nginx to more easily access Giant and the backing databases whilst running locally.
dev-nginx setup-app util/nginx-mapping.yml
- Giant: https://pfi.local.dev-gutools.co.uk/
- neo4j: https://neo4j.pfi.local.dev-gutools.co.uk/
- Enter
bob
as the password when prompted
- Enter
- Elasticsearch: https://elasticsearch.pfi.local.dev-gutools.co.uk/
- Cerebro (to manage Elasticsearch): https://cerebro.pfi.local.dev-gutools.co.uk/
- Minio: https://minio.pfi.local.dev-gutools.co.uk/
- Username:
minio-user
- Password:
reallyverysecret
- Username:
To run all unit tests:
sbt test
To run all integration tests:
sbt int:test
To run a specific integration test:
sbt 'int:testOnly controllers.api.WorkspacesITest'
To terminate the databases without losing data:
docker-compose down
To terminate and delete data:
docker-compose down -v
The Guardian welcomes contributions to Giant. We do not yet have a publicly accessible CI server but please ensure all tests pass by running the build script locally:
./scripts/teamcity.sh
We do not yet publish deployment templates for Giant in either cloud hosts or locally. If you are interested in deploying Giant please get in touch by raising a GitHub issue on this repository.
https://docs.google.com/drawings/d/1wcTY9KLhkYqxmwzsyZ3DsWcc0v-ax5kMKWtYb4HZgF0
Giant uses the Apache 2.0 licence. Some libraries used are licensed separately:
.rar
archives (v4 and below).zip
archives.eml
RFC 5322 emails.mbox
email archives.msg
Outlook email files.pst
Outlook email archives.olm
Outlook for Mac email archives/backups.png
,.jpg
,.tiff
images (including OCR).pdf
(including OCR)- Microsoft Office Word, Excel and Powerpoint files
- Various plain text files (see DocumentBodyExtractor)
- Audio files
- fully supported
.wav
.mpeg
.opus
.caf
.mp4
.aac
(tika sometimes has trouble detecting these)
- transcribed but preview doesn't work
.aff
.amr
.wma
- fully supported
- Video files
- fully supported
.mov
,.qt
.m4v
.3gpp
.mp4
- transcribed but preview doesn't work
.flv
.wmv
.msvideo
.mpeg
- fully supported
Experimental features are enabled through feature flags in the Settings page:
- New UI: a simplified UI implemented using the Elastic UI toolkit
- Page Viewer: a unified document viewer showing text, OCR and search highlights inline on the original document
In addition to any contributors named in this repository, the following contributed to Giant whilst it was closed source at the Guardian: