From abae662cbe3ca2ad0414823de4bdff6d83a59b23 Mon Sep 17 00:00:00 2001 From: pedrotei Date: Tue, 21 May 2019 14:53:24 +0100 Subject: [PATCH] Update README.md Added docker build/run instructions. Included instructions on how to obtain the pre-pickled vector files to use directly with the image. --- README.md | 44 +++++++++++++++++++++++++++++++++----------- 1 file changed, 33 insertions(+), 11 deletions(-) diff --git a/README.md b/README.md index 7f2e4b9..58d255a 100644 --- a/README.md +++ b/README.md @@ -18,13 +18,40 @@ There are additional endpoints '/health', and '/reload', which will return the s # Build and Test To run a local build of this project, you will need: -- Python 3.7 -- pipenv - Docker -- Docker Compose - -First, the data needs to be acquired and transformed into the correct format for the word2vec container to be created. +First, the data needs to be acquired and transformed into the correct format for the word2vec container to be created. We use pre-trained word vectors, further pickled to speedup the loading process. You can get the pickled files from our Google public bucket, for the following languages: +English: [https://storage.googleapis.com/hutoma-datasets/word2vec_service/v2/glove.840B.300d.pkl](https://storage.googleapis.com/hutoma-datasets/word2vec_service/v2/glove.840B.300d.pkl) +Spanish: [https://storage.googleapis.com/hutoma-datasets/word2vec_service/v2/wiki.es.pkl](https://storage.googleapis.com/hutoma-datasets/word2vec_service/v2/wiki.es.pkl) +Italian: [https://storage.googleapis.com/hutoma-datasets/word2vec_service/v2/wiki.it.pkl](https://storage.googleapis.com/hutoma-datasets/word2vec_service/v2/wiki.it.pkl) +(_Note that you can only use one language per Word2Vec service, but you can have multiple instances of the service, each one supporting a different language_) + +Create a folder `src/datasets` and move the .pkl file into it. +Build the docker container with: +``` +cd src +docker build -t word2vec . +``` +To run the image, run: +``` +docker run \ + -p 9090:9090 \ + -v $(pwd)/tests:/tests -v $(pwd)/datasets:/datasets:ro \ + -e "W2V_SERVER_PORT=9090" -e "W2V_VECTOR_FILE=/datasets/glove.840B.300d.pkl" -e "W2V_LANGUAGE=en"\ + word2vec +``` +(for the `W2V_VECTOR_FILE` environment variable, make sure you use the appropriate downloaded .pkl file, and for `W2V_LANGUAGE` the corresponding language) + +To check that the service is running, try: +``` +curl -vv http://localhost:9090/health +``` +And you should get a 200 OK response. + +### Extending to use different pre-trained Word2Vec word vectors +To use different languages, or different pre-trained Word2Vec word vectors, you will need to generate the .pkl file in a format the service understands. For this you will need: +- Python 3.7 +- pipenv From the Stanford github page, https://github.com/stanfordnlp/GloVe#download-pre-trained-word-vectors, acquire glove.840B.300d.zip and extract the text file somewhere. To pre-process the file, from the scripts/generate_docker_dataset directory, execute: @@ -37,13 +64,8 @@ We can now run the script to create the pkl file used for the container. ```python generate_pickle_data.py {path_to_glove.txt} glove``` -The output file glove.840B.300d.pkl will be required to load the word2vec container. - -In the src directory create a 'datasets' folder and move the output pkl file into there, and make sure that the 'W2V_VECTOR_FILE' variable in docker-compose.yml matches the name and directory of the pkl file. - -To start the service, execute: +Now you can use the generated file with Word2Vec. -```docker-compose up``` # Contribute To contribute to this project you can choose an existing issue to work on, or create a new issue for the bug or improvement you wish to make, assuming it's approval and submit a pull request from a fork into our master branch.