From e06462f9bca9725129ea9499bcf1b69b18b691c3 Mon Sep 17 00:00:00 2001 From: sobrad <> Date: Fri, 4 Dec 2020 16:14:59 -0500 Subject: [PATCH 01/10] added readme --- README.md | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/README.md b/README.md index b3d6525..9a77684 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,14 @@ # 2020-Text-Generation +2020-Text-Generation was developed by [Chenyu Zhang](czhang21@terpmail.umd.edu), [Stefan Obradovic](sobrad@umd.edu), [Joseph Chan](jchan123@terpmail.umd.edu), and [Noah Grinspoon](ngrinspoon@gmail.com) in the [Capital One Machine Learning Stream](https://www.fire.umd.edu/coml) for the University of Maryland First-Year Innovation & Research Experience (FIRE) program with help from Senior peer mentor [Derek Zhang](dzhang21@terpmail.umd.edu) and overseen by [Dr. Raymond Tu](https://huahongtu.me/). + +A [video](https://www.youtube.com/watch?v=-vTMY6ZG2iI) of the project was presented at the College Park [FIRE Summit](https://www.fire.umd.edu/summit) on Monday, November 16, 2020. + + +# Getting Started +## Example Notebooks: +* blah + +## Running the Project: +* blah + A series of products beginning from predicting the rest of a word given a few characters. From d315564e860f35e5a272cfbf812b71cfde1534f5 Mon Sep 17 00:00:00 2001 From: sobrad956 <70787990+sobrad956@users.noreply.github.com> Date: Fri, 4 Dec 2020 18:14:32 -0500 Subject: [PATCH 02/10] Update README.md --- README.md | 41 +++++++++++++++++++++++++++++++++++++---- 1 file changed, 37 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 9a77684..8b46ac6 100644 --- a/README.md +++ b/README.md @@ -1,14 +1,47 @@ # 2020-Text-Generation -2020-Text-Generation was developed by [Chenyu Zhang](czhang21@terpmail.umd.edu), [Stefan Obradovic](sobrad@umd.edu), [Joseph Chan](jchan123@terpmail.umd.edu), and [Noah Grinspoon](ngrinspoon@gmail.com) in the [Capital One Machine Learning Stream](https://www.fire.umd.edu/coml) for the University of Maryland First-Year Innovation & Research Experience (FIRE) program with help from Senior peer mentor [Derek Zhang](dzhang21@terpmail.umd.edu) and overseen by [Dr. Raymond Tu](https://huahongtu.me/). +2020-Text-Generation was developed by [Chenyu Zhang](czhang21@terpmail.umd.edu), [Stefan Obradovic](sobrad@umd.edu), [Joseph Chan](jchan123@terpmail.umd.edu), and [Noah Grinspoon](ngrinspoon@gmail.com) in the [Capital One Machine Learning Stream](https://www.fire.umd.edu/coml) for the University of Maryland First-Year Innovation & Research Experience (FIRE) program with help from senior peer mentor [Derek Zhang](dzhang21@terpmail.umd.edu) and overseen by [Dr. Raymond Tu](https://huahongtu.me/). A [video](https://www.youtube.com/watch?v=-vTMY6ZG2iI) of the project was presented at the College Park [FIRE Summit](https://www.fire.umd.edu/summit) on Monday, November 16, 2020. +# Abstract +As technology continues to advance, it is our job as programmers to figure out how to use and optimize that kind of technology for machine learning. When we talk about machine learning, we look at the way machines operate to perform tasks for advanced security purposes or for every day use. For example, machine learning uses face recognition for added security purposes and can also be used to create self-driving cars using object detection. Machine learning can also be used for detecting and predicting text patterns to help optimize typing time, create chatbots, and automatically generate text. The goal of our research is to create a model that predicts a complete sentence given a text input. We train a Recurrent Neural Network model to learn the English language based on hundreds of news articles (https://archive.ics.uci.edu/ml/datasets/NYSK), and the model considers the order in which words are sequenced in a sentence rather than just the words themselves. All the data gets downloaded and each news article gets parsed into its own txt document, which removes all punctuations and stop words (common words like "the" and "and") and sees if every word is a valid english word by cross-referencing the English dictionary. Every sentence is then converted into a vector representation known as a "bag of words" which shows the frequency of each word. The data generator trains the model on only a couple of text documents to not waste too much memory in RAM. After training, we can give the model a sentence (vector) and it will give a prediction of what the next sentence will be. + +# Description +The aim of this project is to create a Machine Learning Model that generates output sentences given an input sentence by training a [Recurrent Neural Network (RNN) Model](https://en.wikipedia.org/wiki/Recurrent_neural_network) on hundreds of news articles about New York v. Strauss-Kahn, a case relating to allegations of sexual assault against the former IMF director Dominique Strauss-Kahn. After training the model, a user can give the model an input sentence and the machine learning model will automatically generate a sentence as output. + +Recurrent Neural Networks(RNN) are state of the art deep learning algorithms for sequential data like text. This is because they can remember their previous inputs in memory (using Long-Short Term Memory (LSTM)). As such, we train a model to learn the sequence in which words (encodded as vectors) appear in a sentence and then generate text by sequentially predicting the next best word given the previous sequence of words. + +### Model Architecture: +Our model consists of an input layer where each token (character) in the input sentence is used an an input. An LSTM layer that learns the sequence of the tokens (characters) as words. A second input layer, where each learned word is used as an input. A second LSTM layer that learns the sequence of the words. And then a Dense layer with Softmax Activation that transforms the outputs of the model to probability values. + # Getting Started ## Example Notebooks: -* blah +* The file `copy_training_USE_THIS_ONE.ipynb` is a python notebook that demonstrates how to use the data generator to download the data and process it as input to train the model. It also tests the trained model by providing input sentences and outputting the generated sentences from the model. ## Running the Project: -* blah +* Environment +** `environment.yml` +** `env_checker.sh` +** `PackageChecker.ipynb` + +* Data Checker +** `data_checker_script.py` +** `file_downloader.py` +** `word_lookup.py` +** `validation_script.py` +** `validation-script.py` +** `data.zip` +** `eng_dictionary.py` +** `Data_Viz.ipynb` + +* Data Processor +** `data_generator.py` + +* Model Builder +** `model.py` + +* Training the Model +** `training.ipynb` -A series of products beginning from predicting the rest of a word given a few characters. +* Testing the Model From 14c4d6544a76498eb6bd565681cb62c16e965b35 Mon Sep 17 00:00:00 2001 From: sobrad956 <70787990+sobrad956@users.noreply.github.com> Date: Fri, 4 Dec 2020 18:16:36 -0500 Subject: [PATCH 03/10] Update README.md --- README.md | 29 +++++++++++++++-------------- 1 file changed, 15 insertions(+), 14 deletions(-) diff --git a/README.md b/README.md index 8b46ac6..035f62a 100644 --- a/README.md +++ b/README.md @@ -21,27 +21,28 @@ Our model consists of an input layer where each token (character) in the input s ## Running the Project: * Environment -** `environment.yml` -** `env_checker.sh` -** `PackageChecker.ipynb` + * `environment.yml` + * `env_checker.sh` + * `PackageChecker.ipynb` * Data Checker -** `data_checker_script.py` -** `file_downloader.py` -** `word_lookup.py` -** `validation_script.py` -** `validation-script.py` -** `data.zip` -** `eng_dictionary.py` -** `Data_Viz.ipynb` + * `data_checker_script.py` + * `file_downloader.py` + * `word_lookup.py` + * `validation_script.py` + * `validation-script.py` + * `data.zip` + * `eng_dictionary.py` + * `Data_Viz.ipynb` * Data Processor -** `data_generator.py` + * `data_generator.py` * Model Builder -** `model.py` + * `model.py` * Training the Model -** `training.ipynb` + * `training.ipynb` * Testing the Model + * `copy_training_USE_THIS_ONE.ipynb` From 92f830a74fb7ba986c1b9fccd4091aa5424970c1 Mon Sep 17 00:00:00 2001 From: sobrad956 <70787990+sobrad956@users.noreply.github.com> Date: Fri, 4 Dec 2020 18:21:48 -0500 Subject: [PATCH 04/10] Update README.md --- README.md | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index 035f62a..cba207d 100644 --- a/README.md +++ b/README.md @@ -19,12 +19,14 @@ Our model consists of an input layer where each token (character) in the input s ## Example Notebooks: * The file `copy_training_USE_THIS_ONE.ipynb` is a python notebook that demonstrates how to use the data generator to download the data and process it as input to train the model. It also tests the trained model by providing input sentences and outputting the generated sentences from the model. -## Running the Project: +## Files: * Environment * `environment.yml` + * Description * `env_checker.sh` + * Description * `PackageChecker.ipynb` - + * Description * Data Checker * `data_checker_script.py` * `file_downloader.py` @@ -34,15 +36,17 @@ Our model consists of an input layer where each token (character) in the input s * `data.zip` * `eng_dictionary.py` * `Data_Viz.ipynb` - * Data Processor * `data_generator.py` - * Model Builder * `model.py` - * Training the Model * `training.ipynb` - * Testing the Model * `copy_training_USE_THIS_ONE.ipynb` + +## Running the Project: +1. Clone the project locally + * `git clone https://github.com/umd-fire-coml/2020-Text-Generation.git` +2. Create a Conda environment using the environment.yml file + * `conda env create -f environment.yml` From 29be4604226193787c4edb054cd6b70a1ebe78ec Mon Sep 17 00:00:00 2001 From: sobrad956 <70787990+sobrad956@users.noreply.github.com> Date: Fri, 4 Dec 2020 18:47:02 -0500 Subject: [PATCH 05/10] Update README.md --- README.md | 38 +++++++++++++++++++++++++++++++------- 1 file changed, 31 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index cba207d..b2cf90a 100644 --- a/README.md +++ b/README.md @@ -22,31 +22,55 @@ Our model consists of an input layer where each token (character) in the input s ## Files: * Environment * `environment.yml` - * Description + * This yaml file defines the name of a Conda environment along with all necessary installations for the project to work * `env_checker.sh` - * Description + * This shell script checks the current environment to see if it has all required packages installed * `PackageChecker.ipynb` - * Description + * This notebook shows which packages have been installed in the current environment and which packages still need to be installed * Data Checker - * `data_checker_script.py` * `file_downloader.py` + * This file is used to download the dataset used for training + * `data_checker_script.py` + * This file is used to check if the dataset has downloaded correctly * `word_lookup.py` - * `validation_script.py` + * This file is used to check if a word is valid using tqdm * `validation-script.py` + * This file is used to validate whether or not the data downloading and processing step has been performed correctly * `data.zip` + * Description * `eng_dictionary.py` + * This file is used to check if a word is a valid word in the English Dictionary * `Data_Viz.ipynb` + * This file is a basic vizualization for the downloaded data. It represents words in news articles as vectors and computes word similarity using Word2Vec * Data Processor * `data_generator.py` + * This file is a data generator that allows the model to train on small batches of data rather than training with the entire dataset in memory * Model Builder * `model.py` + * This file builds the RNN deep learning model for text generation * Training the Model * `training.ipynb` + * This notebook is used to train the RNN model and test it. It uses the data generator as part of training * Testing the Model * `copy_training_USE_THIS_ONE.ipynb` + * This is a demonstration notebook that shows an example of model training, testing, and output on a small subset of training data ## Running the Project: -1. Clone the project locally +1. Clone the project locally (In a terminal) * `git clone https://github.com/umd-fire-coml/2020-Text-Generation.git` -2. Create a Conda environment using the environment.yml file +2. Enter the 2020-Text-Generation folder + * `cd 2020-Text-Generation` +3. Create a Conda environment using the environment.yml file * `conda env create -f environment.yml` +4. Activate the Conda environment + * `conda activate text-generation` +5. Run the environment checker in the current directory to check if the environment has required packages installed + * `env_checker` +6. Run the file downloader to download the dataset + * `python file_downloader` +7. Run the data checker script to check if the data is correctly downloaded + * `python data_checker_script +8. Run the data validation script to check if the data is valid + * `python validation-script.py` +9. Run the Training notebook. This uses the data generator to generate input data, builds the model, and trains it for 20 epochs. The model testing results are also displayed for a given input sentence. + * `training.ipynb` From d1c6a714c2435c95be2fa0565dd22d98e07bc51f Mon Sep 17 00:00:00 2001 From: sobrad956 <70787990+sobrad956@users.noreply.github.com> Date: Fri, 4 Dec 2020 18:48:11 -0500 Subject: [PATCH 06/10] Update README.md --- README.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index b2cf90a..0d3ee96 100644 --- a/README.md +++ b/README.md @@ -20,14 +20,14 @@ Our model consists of an input layer where each token (character) in the input s * The file `copy_training_USE_THIS_ONE.ipynb` is a python notebook that demonstrates how to use the data generator to download the data and process it as input to train the model. It also tests the trained model by providing input sentences and outputting the generated sentences from the model. ## Files: -* Environment +* **Environment** * `environment.yml` * This yaml file defines the name of a Conda environment along with all necessary installations for the project to work * `env_checker.sh` * This shell script checks the current environment to see if it has all required packages installed * `PackageChecker.ipynb` * This notebook shows which packages have been installed in the current environment and which packages still need to be installed -* Data Checker +* **Data Checker** * `file_downloader.py` * This file is used to download the dataset used for training * `data_checker_script.py` @@ -42,16 +42,16 @@ Our model consists of an input layer where each token (character) in the input s * This file is used to check if a word is a valid word in the English Dictionary * `Data_Viz.ipynb` * This file is a basic vizualization for the downloaded data. It represents words in news articles as vectors and computes word similarity using Word2Vec -* Data Processor +* **Data Processor** * `data_generator.py` * This file is a data generator that allows the model to train on small batches of data rather than training with the entire dataset in memory -* Model Builder +* **Model Builder** * `model.py` * This file builds the RNN deep learning model for text generation -* Training the Model +* **Training the Model** * `training.ipynb` * This notebook is used to train the RNN model and test it. It uses the data generator as part of training -* Testing the Model +* **Testing the Model** * `copy_training_USE_THIS_ONE.ipynb` * This is a demonstration notebook that shows an example of model training, testing, and output on a small subset of training data From e7e7db08e1371d693009efede1e68212a197d8e5 Mon Sep 17 00:00:00 2001 From: sobrad956 <70787990+sobrad956@users.noreply.github.com> Date: Fri, 4 Dec 2020 18:48:50 -0500 Subject: [PATCH 07/10] Update README.md --- README.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 0d3ee96..3bd6666 100644 --- a/README.md +++ b/README.md @@ -65,12 +65,12 @@ Our model consists of an input layer where each token (character) in the input s 4. Activate the Conda environment * `conda activate text-generation` 5. Run the environment checker in the current directory to check if the environment has required packages installed - * `env_checker` + * `env_checker` 6. Run the file downloader to download the dataset - * `python file_downloader` + * `python file_downloader` 7. Run the data checker script to check if the data is correctly downloaded - * `python data_checker_script + * `python data_checker_script 8. Run the data validation script to check if the data is valid - * `python validation-script.py` + * `python validation-script.py` 9. Run the Training notebook. This uses the data generator to generate input data, builds the model, and trains it for 20 epochs. The model testing results are also displayed for a given input sentence. - * `training.ipynb` + * `training.ipynb` From da482c7ee21cd3d5bb3c1075ef712390a1d337a2 Mon Sep 17 00:00:00 2001 From: sobrad956 <70787990+sobrad956@users.noreply.github.com> Date: Fri, 4 Dec 2020 18:49:11 -0500 Subject: [PATCH 08/10] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 3bd6666..013e994 100644 --- a/README.md +++ b/README.md @@ -69,7 +69,7 @@ Our model consists of an input layer where each token (character) in the input s 6. Run the file downloader to download the dataset * `python file_downloader` 7. Run the data checker script to check if the data is correctly downloaded - * `python data_checker_script + * `python data_checker_script` 8. Run the data validation script to check if the data is valid * `python validation-script.py` 9. Run the Training notebook. This uses the data generator to generate input data, builds the model, and trains it for 20 epochs. The model testing results are also displayed for a given input sentence. From 31d337ec5d6287155fa48f59df18c89ae267656a Mon Sep 17 00:00:00 2001 From: sobrad956 <70787990+sobrad956@users.noreply.github.com> Date: Fri, 4 Dec 2020 18:55:34 -0500 Subject: [PATCH 09/10] Update README.md --- README.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/README.md b/README.md index 013e994..f88ee26 100644 --- a/README.md +++ b/README.md @@ -74,3 +74,6 @@ Our model consists of an input layer where each token (character) in the input s * `python validation-script.py` 9. Run the Training notebook. This uses the data generator to generate input data, builds the model, and trains it for 20 epochs. The model testing results are also displayed for a given input sentence. * `training.ipynb` + +## References: +A. Mittal, “Understanding RNN and LSTM,” Medium, 12-Oct-2019. [Online]. Available: https://towardsdatascience.com/understanding-rnn-and-lstm-f7cdf6dfc14e. [Accessed: 04-Dec-2020]. From 8aca768371bcd53af32edec26b0b71126513d5fb Mon Sep 17 00:00:00 2001 From: sobrad956 <70787990+sobrad956@users.noreply.github.com> Date: Fri, 4 Dec 2020 19:13:03 -0500 Subject: [PATCH 10/10] Update README.md --- README.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index f88ee26..74f71a4 100644 --- a/README.md +++ b/README.md @@ -76,4 +76,7 @@ Our model consists of an input layer where each token (character) in the input s * `training.ipynb` ## References: -A. Mittal, “Understanding RNN and LSTM,” Medium, 12-Oct-2019. [Online]. Available: https://towardsdatascience.com/understanding-rnn-and-lstm-f7cdf6dfc14e. [Accessed: 04-Dec-2020]. +1. A. Mittal, “Understanding RNN and LSTM,” Medium, 12-Oct-2019. [Online]. Available: https://towardsdatascience.com/understanding-rnn-and-lstm-f7cdf6dfc14e. [Accessed: 04-Dec-2020]. +2. TensorFlow, "Word2Vec," TensorFlow, 30-Oct-2020. [Online]. Available: https://www.tensorflow.org/tutorials/text/word2vec. [Accessed: 04-Dec-2020]. +3. U. Kumar, "Natural Language Generation using Sequence Models," TowardsDataScience, 29-Jun-2020. [Online]. Available: https://towardsdatascience.com/how-our-device-thinks-e1f5ab15071e. [Accessed: 04-Dec-2020] +4. D. Zhang, "RNN: Training An English Major," Medium, 17-Apr-2019. [Online]. Available: https://medium.com/@dzhang21/rnn-an-english-major-c62175228144. [Accessed: 04-Dec-2020].