From abae662cbe3ca2ad0414823de4bdff6d83a59b23 Mon Sep 17 00:00:00 2001
From: pedrotei <pedrotei@users.noreply.github.com>
Date: Tue, 21 May 2019 14:53:24 +0100
Subject: [PATCH] Update README.md

Added docker build/run instructions.
Included instructions on how to obtain the pre-pickled vector files to use directly with the image.
---
 README.md | 44 +++++++++++++++++++++++++++++++++-----------
 1 file changed, 33 insertions(+), 11 deletions(-)

diff --git a/README.md b/README.md
index 7f2e4b9..58d255a 100644
--- a/README.md
+++ b/README.md
@@ -18,13 +18,40 @@ There are additional endpoints '/health', and '/reload', which will return the s
 
 # Build and Test
 To run a local build of this project, you will need:
-- Python 3.7
-- pipenv
 - Docker
-- Docker Compose
-
-First, the data needs to be acquired and transformed into the correct format for the word2vec container to be created.
 
+First, the data needs to be acquired and transformed into the correct format for the word2vec container to be created. We use pre-trained word vectors, further pickled to speedup the loading process. You can get the pickled files from our Google public bucket, for the following languages:
+English: [https://storage.googleapis.com/hutoma-datasets/word2vec_service/v2/glove.840B.300d.pkl](https://storage.googleapis.com/hutoma-datasets/word2vec_service/v2/glove.840B.300d.pkl)
+Spanish: [https://storage.googleapis.com/hutoma-datasets/word2vec_service/v2/wiki.es.pkl](https://storage.googleapis.com/hutoma-datasets/word2vec_service/v2/wiki.es.pkl)
+Italian: [https://storage.googleapis.com/hutoma-datasets/word2vec_service/v2/wiki.it.pkl](https://storage.googleapis.com/hutoma-datasets/word2vec_service/v2/wiki.it.pkl)
+(_Note that you can only use one language per Word2Vec service, but you can have multiple instances of the service, each one supporting a different language_)
+
+Create a folder `src/datasets` and move the .pkl file into it.
+Build the docker container with:
+```
+cd src
+docker build -t word2vec .
+```
+To run the image, run:
+```
+docker run \
+    -p 9090:9090 \
+    -v $(pwd)/tests:/tests -v $(pwd)/datasets:/datasets:ro \
+    -e "W2V_SERVER_PORT=9090" -e "W2V_VECTOR_FILE=/datasets/glove.840B.300d.pkl" -e "W2V_LANGUAGE=en"\
+    word2vec
+```
+(for the `W2V_VECTOR_FILE` environment variable, make sure you use the appropriate downloaded .pkl file, and for `W2V_LANGUAGE` the corresponding language)
+
+To check that the service is running, try:
+```
+curl -vv http://localhost:9090/health
+```
+And you should get a 200 OK response.
+
+### Extending to use different pre-trained Word2Vec word vectors
+To use different languages, or different pre-trained Word2Vec word vectors, you will need to generate the .pkl file in a format the service understands. For this you will need:
+- Python 3.7
+- pipenv
 From the Stanford github page, https://github.com/stanfordnlp/GloVe#download-pre-trained-word-vectors, acquire glove.840B.300d.zip and extract the text file somewhere.
 
 To pre-process the file, from the scripts/generate_docker_dataset directory, execute:
@@ -37,13 +64,8 @@ We can now run the script to create the pkl file used for the container.
 
 ```python generate_pickle_data.py {path_to_glove.txt} glove```
 
-The output file glove.840B.300d.pkl will be required to load the word2vec container.
-
-In the src directory create a 'datasets' folder and move the output pkl file into there, and make sure that the 'W2V_VECTOR_FILE' variable in docker-compose.yml matches the name and directory of the pkl file.
-
-To start the service, execute:
+Now you can use the generated file with Word2Vec.
 
-```docker-compose up```
 
 # Contribute
 To contribute to this project you can choose an existing issue to work on, or create a new issue for the bug or improvement you wish to make, assuming it's approval and submit a pull request from a fork into our master branch.