Reddit's WallStreetBet comment generation project
Team members: Nhan Phan, Ryoko Noda.
This project was done for Aalto University's Statistical Natural Language Processing course. The project report can be found here.
The repository contains the codes and data we used to create four language models that replicates the posts from Reddit's WallStreetBets. Once put together, the models should be able to generate somewhat WallStreetBets-like sentences like the ones below. These are real sentences generated by our Bi-LSTM model, which we posted on Reddit as an example (the post have been removed by us):
Files here are mainly of four categories:- The datasets
- Codes to scrape the dataset
- Notebook for data cleaning
- Notebooks for language models
The repository contains codes we used for experimental purposes before finalizing the project. The codes we used in the final version are:
The datasets scraped from WallStreetBets can be found in the data folder. This folder contains datasets of various lengths (+ one dataset that we used in an experiment that is not from WallStreetBets). The 2 years dataset size is 63MB and contain about 850,000 sentences was not uploaded to Github.
data_sample.txt: The sample dataset, 20,000 sentences.
data_sample_2x.txt: Double the size of sample dataset, 40,000 sentences. This is our main dataset for the project report
data_sample_4x.txt: Another bigger dataset, containing 80,000 sentences. We have several bigger size dataset (that were not upload) to test the limit of our hardware
data_sample_test.txt: A very small dataset of 1,000 sentences. Can be used for tests.
reddit-cleanjokes.csv: A dataset used to run the sample LSTM models. NOT FROM WALLSTREETBETS.
There are two web scraping codes, one of which we abandoned after we found Pushshift.
WSBpmaw.py: The code used in the final version, which uses the Pushshift wrapper PMAW.
WSBPraw.py: The code that uses the more popular PRAW. This is useful for downloading live data but not historical data so it was not used in our project.
There is only one file that we used when preprocessing the data.
step_01_data_preprocessing.ipynb: Preprocesses the WallStreetBets datasets.
We tried four models in this project: n-grams, GRU, LSTM, and Bi-LSTM. The n-grams model has a Jupyter notebook to itself, and the GRU, LSTM, and Bi-LSTM models can be found in a generic RNN notebook in which you can choose what model to use.
step_02_ngrams.ipynb: The n-grams code for the final version. It contains 5-grams with absolute smoothing.
step_03_RNN.ipynb: The generic RNN model used in the final version. You can choose GRU, LSTM, or Bi-LSTM within the notebook.
test_LSTM_kdnuggets.ipynb: A LSTM model from KDnuggets that we used to learn what LSTM is like.
test_LSTM_kdn_preprocess.ipynb: An experimental model where we added some data preprocessing to test_LSTM_kdnuggets.ipynb.