RetardBot

Reddit's WallStreetBet comment generation project
Team members: Nhan Phan, Ryoko Noda.

About the project

This project was done for Aalto University's Statistical Natural Language Processing course. The project report can be found here.

The repository contains the codes and data we used to create four language models that replicates the posts from Reddit's WallStreetBets. Once put together, the models should be able to generate somewhat WallStreetBets-like sentences like the ones below. These are real sentences generated by our Bi-LSTM model, which we posted on Reddit as an example (the post have been removed by us):

Files here are mainly of four categories:

The datasets
Codes to scrape the dataset
Notebook for data cleaning
Notebooks for language models

The repository contains codes we used for experimental purposes before finalizing the project. The codes we used in the final version are:

The datasets

The datasets scraped from WallStreetBets can be found in the data folder. This folder contains datasets of various lengths (+ one dataset that we used in an experiment that is not from WallStreetBets). The 2 years dataset size is 63MB and contain about 850,000 sentences was not uploaded to Github.

data_sample.txt: The sample dataset, 20,000 sentences.
data_sample_2x.txt: Double the size of sample dataset, 40,000 sentences. This is our main dataset for the project report
data_sample_4x.txt: Another bigger dataset, containing 80,000 sentences. We have several bigger size dataset (that were not upload) to test the limit of our hardware
data_sample_test.txt: A very small dataset of 1,000 sentences. Can be used for tests.
reddit-cleanjokes.csv: A dataset used to run the sample LSTM models. NOT FROM WALLSTREETBETS.

Web scraping codes

There are two web scraping codes, one of which we abandoned after we found Pushshift.

WSBpmaw.py: The code used in the final version, which uses the Pushshift wrapper PMAW.
WSBPraw.py: The code that uses the more popular PRAW. This is useful for downloading live data but not historical data so it was not used in our project.

Data preprocessing

There is only one file that we used when preprocessing the data.

step_01_data_preprocessing.ipynb: Preprocesses the WallStreetBets datasets.

Language models

We tried four models in this project: n-grams, GRU, LSTM, and Bi-LSTM. The n-grams model has a Jupyter notebook to itself, and the GRU, LSTM, and Bi-LSTM models can be found in a generic RNN notebook in which you can choose what model to use.

step_02_ngrams.ipynb: The n-grams code for the final version. It contains 5-grams with absolute smoothing.
step_03_RNN.ipynb: The generic RNN model used in the final version. You can choose GRU, LSTM, or Bi-LSTM within the notebook.
test_LSTM_kdnuggets.ipynb: A LSTM model from KDnuggets that we used to learn what LSTM is like.
test_LSTM_kdn_preprocess.ipynb: An experimental model where we added some data preprocessing to test_LSTM_kdnuggets.ipynb.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RetardBot

About the project

The datasets

Web scraping codes

Data preprocessing

Language models

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
data		data
illustration		illustration
.gitignore		.gitignore
Project_Report.pdf		Project_Report.pdf
README.md		README.md
WSBPraw.py		WSBPraw.py
WSBpmaw.py		WSBpmaw.py
data_sample_2x.txt		data_sample_2x.txt
step_01_data_preprocessing.ipynb		step_01_data_preprocessing.ipynb
step_02_ngrams.ipynb		step_02_ngrams.ipynb
step_03_RNN.ipynb		step_03_RNN.ipynb
test_LSTM_kdn_preprocess.ipynb		test_LSTM_kdn_preprocess.ipynb
test_LSTM_kdnuggets.ipynb		test_LSTM_kdnuggets.ipynb

Usin2705/RetardBot

Folders and files

Latest commit

History

Repository files navigation

RetardBot

About the project

The datasets

Web scraping codes

Data preprocessing

Language models

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages