Skip to content

Commit

Permalink
Added Tutorial for NLP Processing using Gensim in Flyte Workflow (fly…
Browse files Browse the repository at this point in the history
…teorg#911)

* add tutorial for nlp

Signed-off-by: Ryan Nazareth <[email protected]>

* add script and folder to cookbook

Signed-off-by: Ryan Nazareth <[email protected]>

* add flytedeck

Signed-off-by: Ryan Nazareth <[email protected]>

* pin pandas and profiling versions

Signed-off-by: Ryan Nazareth <[email protected]>

* add docstring to script

Signed-off-by: Ryan Nazareth <[email protected]>

* trigger ci

Signed-off-by: Ryan Nazareth <[email protected]>

* typos

Signed-off-by: Ryan Nazareth <[email protected]>

* typos

Signed-off-by: Ryan Nazareth <[email protected]>

* add tutorial to panel and toc tree in rst

Signed-off-by: Ryan Nazareth <[email protected]>

* add loads of descriptions

Signed-off-by: Ryan Nazareth <[email protected]>

* typos and formatting

Signed-off-by: Ryan Nazareth <[email protected]>

* correction to flytedeck description

Signed-off-by: Ryan Nazareth <[email protected]>

* formatting and typos

Signed-off-by: Ryan Nazareth <[email protected]>

* add requested changes to description and other bits

Signed-off-by: Ryan Nazareth <[email protected]>

* formatting and add gitignore

Signed-off-by: Ryan Nazareth <[email protected]>

* add typing to plotting task

Signed-off-by: Ryan Nazareth <[email protected]>

* add in requested changes and add sklearn to requirements.in

Signed-off-by: Ryan Nazareth <[email protected]>

* few more

Signed-off-by: Ryan Nazareth <[email protected]>

* run pip-compile again to correct relative path to requirements-common.in

Signed-off-by: Ryan Nazareth <[email protected]>

* bump resource for tasks that errored in flyte console

Signed-off-by: Ryan Nazareth <[email protected]>

* whitespace

Signed-off-by: Ryan Nazareth <[email protected]>

* switch from np to flyte supported types and model_ser.download

Signed-off-by: Ryan Nazareth <[email protected]>

* add support for plotly and disable deck in task

Signed-off-by: Ryan Nazareth <[email protected]>

* return output for word similarity and plotly layout size adjustment

Signed-off-by: Ryan Nazareth <[email protected]>

* remove returned output from word sim task

Signed-off-by: Ryan Nazareth <[email protected]>

* remove type and also output in wmd

Signed-off-by: Ryan Nazareth <[email protected]>

* add workflow outputs and modify comments

Signed-off-by: Ryan Nazareth <[email protected]>

* fix typing for return value word sim task

Signed-off-by: Ryan Nazareth <[email protected]>

Signed-off-by: Ryan Nazareth <[email protected]>
Co-authored-by: Samhita Alla <[email protected]>
  • Loading branch information
ryankarlos and samhita-alla authored Nov 7, 2022
1 parent 9b7d065 commit fc42be0
Show file tree
Hide file tree
Showing 10 changed files with 767 additions and 0 deletions.
53 changes: 53 additions & 0 deletions cookbook/case_studies/ml_training/nlp_processing/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
FROM ubuntu:focal

WORKDIR /root
ENV VENV /opt/venv
ENV LANG C.UTF-8
ENV LC_ALL C.UTF-8
ENV PYTHONPATH /root

RUN : \
&& apt-get update \
&& apt install -y software-properties-common \
&& add-apt-repository ppa:deadsnakes/ppa

RUN : \
&& apt-get update \
&& apt-get install -y python3.8 python3-pip python3-venv make build-essential libssl-dev curl vim

# This is necessary for opencv to work
RUN apt-get update && apt-get install -y libsm6 libxext6 libxrender-dev ffmpeg

# Install the AWS cli separately to prevent issues with boto being written over
RUN pip3 install awscli

WORKDIR /opt
RUN curl https://sdk.cloud.google.com > install.sh
RUN bash /opt/install.sh --install-dir=/opt
ENV PATH $PATH:/opt/google-cloud-sdk/bin
WORKDIR /root

# Virtual environment
ENV VENV /opt/venv
RUN python3 -m venv ${VENV}
ENV PATH="${VENV}/bin:$PATH"

# Install Python dependencies
COPY nlp_processing/requirements.txt /root
RUN ${VENV}/bin/pip install -r /root/requirements.txt

# Copy the makefile targets to expose on the container. This makes it easier to register.
COPY in_container.mk /root/Makefile
COPY nlp_processing/sandbox.config /root

# Copy the actual code
COPY nlp_processing/ /root/nlp_processing/

# Copy over the helper script that the SDK relies on
RUN cp ${VENV}/bin/flytekit_venv /usr/local/bin/
RUN chmod a+x /usr/local/bin/flytekit_venv

# This tag is supplied by the build script and will be used to determine the version
# when registering tasks, workflows, and launch plans
ARG tag
ENV FLYTE_INTERNAL_IMAGE $tag
3 changes: 3 additions & 0 deletions cookbook/case_studies/ml_training/nlp_processing/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
PREFIX=nlp_processing
include ../../../common/common.mk
include ../../../common/leaf.mk
39 changes: 39 additions & 0 deletions cookbook/case_studies/ml_training/nlp_processing/README.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
NLP Processing
--------------

This tutorial will demonstrate how to process text data and generate word embeddings and visualizations
as part of a Flyte workflow. It's an adaptation of the official Gensim `Word2Vec tutorial <https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html>`__.


About Gensim
============

Gensim is a popular open-source natural language processing (NLP) library used to process
large corpora (can be larger than RAM).
It has efficient multicore implementations of a number of algorithms such as `Latent Semantic Analysis <http://lsa.colorado.edu/papers/dp1.LSAintro.pdf>`__, `Latent Dirichlet Allocation (LDA) <https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf>`__,
`Word2Vec deep learning <https://arxiv.org/pdf/1301.3781.pdf>`__ to perform complex tasks including understanding
document relationships, topic modeling, learning word embeddings, and more.

You can read more about Gensim `here <https://radimrehurek.com/gensim/>`__.


Data
====

The dataset used for this tutorial is the open-source `Lee Background Corpus <https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_data/lee_background.cor>`__
that comes with the Gensim library.


Step-by-Step Process
====================

The following points outline the modelling process:

- Returns a preprocessed (tokenized, stop words excluded, lemmatized) corpus from the custom iterator.
- Trains the Word2vec model on the preprocessed corpus.
- Generates a bag of words from the corpus and trains the LDA model.
- Saves the LDA and Word2Vec models to disk.
- Deserializes the Word2Vec model, runs word similarity and computes word movers distance.
- Reduces the dimensionality (using tsne) and plots the word embeddings.

Let's dive into the code!
Empty file.
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
-r ../../../common/requirements-common.in
numpy
gensim
nltk
plotly
pyemd
scikit-learn
Loading

0 comments on commit fc42be0

Please sign in to comment.