Skip to content

Usin2705/RetardBot

Repository files navigation

RetardBot

Reddit's WallStreetBet comment generation project
Team members: Nhan Phan, Ryoko Noda.

About the project

This project was done for Aalto University's Statistical Natural Language Processing course. The project report can be found here.

The repository contains the codes and data we used to create four language models that replicates the posts from Reddit's WallStreetBets. Once put together, the models should be able to generate somewhat WallStreetBets-like sentences like the ones below. These are real sentences generated by our Bi-LSTM model, which we posted on Reddit as an example (the post have been removed by us):

Files here are mainly of four categories:
  1. The datasets
  2. Codes to scrape the dataset
  3. Notebook for data cleaning
  4. Notebooks for language models

The repository contains codes we used for experimental purposes before finalizing the project. The codes we used in the final version are:

The datasets

The datasets scraped from WallStreetBets can be found in the data folder. This folder contains datasets of various lengths (+ one dataset that we used in an experiment that is not from WallStreetBets). The 2 years dataset size is 63MB and contain about 850,000 sentences was not uploaded to Github.

data_sample.txt: The sample dataset, 20,000 sentences.
data_sample_2x.txt: Double the size of sample dataset, 40,000 sentences. This is our main dataset for the project report
data_sample_4x.txt: Another bigger dataset, containing 80,000 sentences. We have several bigger size dataset (that were not upload) to test the limit of our hardware
data_sample_test.txt: A very small dataset of 1,000 sentences. Can be used for tests.
reddit-cleanjokes.csv: A dataset used to run the sample LSTM models. NOT FROM WALLSTREETBETS.

Web scraping codes

There are two web scraping codes, one of which we abandoned after we found Pushshift.

WSBpmaw.py: The code used in the final version, which uses the Pushshift wrapper PMAW.
WSBPraw.py: The code that uses the more popular PRAW. This is useful for downloading live data but not historical data so it was not used in our project.

Data preprocessing

There is only one file that we used when preprocessing the data.

step_01_data_preprocessing.ipynb: Preprocesses the WallStreetBets datasets.

Language models

We tried four models in this project: n-grams, GRU, LSTM, and Bi-LSTM. The n-grams model has a Jupyter notebook to itself, and the GRU, LSTM, and Bi-LSTM models can be found in a generic RNN notebook in which you can choose what model to use.

step_02_ngrams.ipynb: The n-grams code for the final version. It contains 5-grams with absolute smoothing.
step_03_RNN.ipynb: The generic RNN model used in the final version. You can choose GRU, LSTM, or Bi-LSTM within the notebook.
test_LSTM_kdnuggets.ipynb: A LSTM model from KDnuggets that we used to learn what LSTM is like.
test_LSTM_kdn_preprocess.ipynb: An experimental model where we added some data preprocessing to test_LSTM_kdnuggets.ipynb.

About

Reddit's WallStreetBet commment generation project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published