Added Tutorial for NLP Processing using Gensim in Flyte Workflow (fly…

…teorg#911) * add tutorial for nlp Signed-off-by: Ryan Nazareth <[email protected]> * add script and folder to cookbook Signed-off-by: Ryan Nazareth <[email protected]> * add flytedeck Signed-off-by: Ryan Nazareth <[email protected]> * pin pandas and profiling versions Signed-off-by: Ryan Nazareth <[email protected]> * add docstring to script Signed-off-by: Ryan Nazareth <[email protected]> * trigger ci Signed-off-by: Ryan Nazareth <[email protected]> * typos Signed-off-by: Ryan Nazareth <[email protected]> * typos Signed-off-by: Ryan Nazareth <[email protected]> * add tutorial to panel and toc tree in rst Signed-off-by: Ryan Nazareth <[email protected]> * add loads of descriptions Signed-off-by: Ryan Nazareth <[email protected]> * typos and formatting Signed-off-by: Ryan Nazareth <[email protected]> * correction to flytedeck description Signed-off-by: Ryan Nazareth <[email protected]> * formatting and typos Signed-off-by: Ryan Nazareth <[email protected]> * add requested changes to description and other bits Signed-off-by: Ryan Nazareth <[email protected]> * formatting and add gitignore Signed-off-by: Ryan Nazareth <[email protected]> * add typing to plotting task Signed-off-by: Ryan Nazareth <[email protected]> * add in requested changes and add sklearn to requirements.in Signed-off-by: Ryan Nazareth <[email protected]> * few more Signed-off-by: Ryan Nazareth <[email protected]> * run pip-compile again to correct relative path to requirements-common.in Signed-off-by: Ryan Nazareth <[email protected]> * bump resource for tasks that errored in flyte console Signed-off-by: Ryan Nazareth <[email protected]> * whitespace Signed-off-by: Ryan Nazareth <[email protected]> * switch from np to flyte supported types and model_ser.download Signed-off-by: Ryan Nazareth <[email protected]> * add support for plotly and disable deck in task Signed-off-by: Ryan Nazareth <[email protected]> * return output for word similarity and plotly layout size adjustment Signed-off-by: Ryan Nazareth <[email protected]> * remove returned output from word sim task Signed-off-by: Ryan Nazareth <[email protected]> * remove type and also output in wmd Signed-off-by: Ryan Nazareth <[email protected]> * add workflow outputs and modify comments Signed-off-by: Ryan Nazareth <[email protected]> * fix typing for return value word sim task Signed-off-by: Ryan Nazareth <[email protected]> Signed-off-by: Ryan Nazareth <[email protected]> Co-authored-by: Samhita Alla <[email protected]>
eapolinario · Nov 7, 2022 · fc42be0 · fc42be0
1 parent 9b7d065
commit fc42be0
Show file tree

Hide file tree

Showing 10 changed files with 767 additions and 0 deletions.
diff --git a/cookbook/case_studies/ml_training/nlp_processing/Dockerfile b/cookbook/case_studies/ml_training/nlp_processing/Dockerfile
@@ -0,0 +1,53 @@
+FROM ubuntu:focal
+
+WORKDIR /root
+ENV VENV /opt/venv
+ENV LANG C.UTF-8
+ENV LC_ALL C.UTF-8
+ENV PYTHONPATH /root
+
+RUN : \
+    && apt-get update \
+    && apt install -y software-properties-common \
+    && add-apt-repository ppa:deadsnakes/ppa
+
+RUN : \
+    && apt-get update \
+    && apt-get install -y python3.8 python3-pip python3-venv make build-essential libssl-dev curl vim
+
+# This is necessary for opencv to work
+RUN apt-get update && apt-get install -y libsm6 libxext6 libxrender-dev ffmpeg
+
+# Install the AWS cli separately to prevent issues with boto being written over
+RUN pip3 install awscli
+
+WORKDIR /opt
+RUN curl https://sdk.cloud.google.com > install.sh
+RUN bash /opt/install.sh --install-dir=/opt
+ENV PATH $PATH:/opt/google-cloud-sdk/bin
+WORKDIR /root
+
+# Virtual environment
+ENV VENV /opt/venv
+RUN python3 -m venv ${VENV}
+ENV PATH="${VENV}/bin:$PATH"
+
+# Install Python dependencies
+COPY nlp_processing/requirements.txt /root
+RUN ${VENV}/bin/pip install -r /root/requirements.txt
+
+# Copy the makefile targets to expose on the container. This makes it easier to register.
+COPY in_container.mk /root/Makefile
+COPY nlp_processing/sandbox.config /root
+
+# Copy the actual code
+COPY nlp_processing/ /root/nlp_processing/
+
+# Copy over the helper script that the SDK relies on
+RUN cp ${VENV}/bin/flytekit_venv /usr/local/bin/
+RUN chmod a+x /usr/local/bin/flytekit_venv
+
+# This tag is supplied by the build script and will be used to determine the version
+# when registering tasks, workflows, and launch plans
+ARG tag
+ENV FLYTE_INTERNAL_IMAGE $tag
diff --git a/cookbook/case_studies/ml_training/nlp_processing/Makefile b/cookbook/case_studies/ml_training/nlp_processing/Makefile
@@ -0,0 +1,3 @@
+PREFIX=nlp_processing
+include ../../../common/common.mk
+include ../../../common/leaf.mk
diff --git a/cookbook/case_studies/ml_training/nlp_processing/README.rst b/cookbook/case_studies/ml_training/nlp_processing/README.rst
@@ -0,0 +1,39 @@
+NLP Processing
+--------------
+
+This tutorial will demonstrate how to process text data and generate word embeddings and visualizations
+as part of a Flyte workflow. It's an adaptation of the official Gensim `Word2Vec tutorial <https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html>`__.
+
+
+About Gensim
+============
+
+Gensim is a popular open-source natural language processing (NLP) library used to process
+large corpora (can be larger than RAM).
+It has efficient multicore implementations of a number of algorithms such as `Latent Semantic Analysis <http://lsa.colorado.edu/papers/dp1.LSAintro.pdf>`__, `Latent Dirichlet Allocation (LDA) <https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf>`__,
+`Word2Vec deep learning <https://arxiv.org/pdf/1301.3781.pdf>`__ to perform complex tasks including understanding
+document relationships, topic modeling, learning word embeddings, and more.
+
+You can read more about Gensim `here <https://radimrehurek.com/gensim/>`__.
+
+
+Data
+====
+
+The dataset used for this tutorial is the open-source `Lee Background Corpus <https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_data/lee_background.cor>`__
+that comes with the Gensim library.
+
+
+Step-by-Step Process
+====================
+
+The following points outline the modelling process:
+
+- Returns a preprocessed (tokenized, stop words excluded, lemmatized) corpus from the custom iterator.
+- Trains the Word2vec model on the preprocessed corpus.
+- Generates a bag of words from the corpus and trains the LDA model.
+- Saves the LDA and Word2Vec models to disk.
+- Deserializes the Word2Vec model, runs word similarity and computes word movers distance.
+- Reduces the dimensionality (using tsne) and plots the word embeddings.
+
+Let's dive into the code!
diff --git a/cookbook/case_studies/ml_training/nlp_processing/__init__.py b/cookbook/case_studies/ml_training/nlp_processing/__init__.py
diff --git a/cookbook/case_studies/ml_training/nlp_processing/requirements.in b/cookbook/case_studies/ml_training/nlp_processing/requirements.in
@@ -0,0 +1,7 @@
+-r ../../../common/requirements-common.in
+numpy
+gensim
+nltk
+plotly
+pyemd
+scikit-learn