A collection of self-trained Neural Networks for classifying Reddit posts and Comments.
Where2Spreddit classifies any sentence into a relevant category on the popular discussion site Reddit. The included trainer allows automated training (with customizable options) on four different neural network architectures.
- Convolutional Neural Networks (CNNs)
- Simple Recurrent Neural Networks (RNNs)
- Long short-term Memory Networks (LSTMs)
- Gated Recurrent Units (GRUs)
The following diagram illustrates the flow of the Where2Spreddit neural networks:
From root directory of repo, run pip3 install -r requirements.txt
. It is highly recommended to use a clean virtual environment to prevent conflicts with other packages.
Run pip install git+https://github.com/crazyfrogspb/RedditScore.git
Run
python -m spacy download en_core_web_lg
python -m spacy download en_core_web_sm
If you are planning to train the models on a GPU, install the following version of PyTorch instead:
Run
pip3 install torch==1.9.0+cu102 torchvision==0.10.0+cu102 torchaudio===0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
For inference, run python predictor.py
in the virtual environment.
The predictor script reads directly from user input to classify and recommend a appropriate subreddit.
Due to the Convolution layer dimensions for the CNN, the input sentence must include at least 4 words.
Example:
Enter a sentence: How much mass does a black hole have?
----- Baseline -----
askscience: 100.0%
askreddit: 0.0%
history: 0.0%
----- CNN -----
askscience: 91.22%
jokes: 8.055%
askreddit: 0.725%
----- RNN -----
askscience: 99.874%
science: 0.126%
history: 0.0%
----- GRU -----
askscience: 100.0%
askreddit: 0.0%
science: 0.0%
----- LSTM -----
askscience: 93.295%
jokes: 6.203%
history: 0.249%
For training, run python main.py
in the virtual environment.
The main script accepts the following options:
model
-- type of model to train ['baseline', 'rnn', 'cnn', 'gru', 'lstm']batch_size
-- size of mini batch to use for trainingepochs
-- number of iterations to train throughlr
-- adjust the learning rate of the current modelemd-dim
-- how many embedded dimension are included in each wordrnn-hidden-dim
-- the number of hidden dimensions in the RNN modelsave
-- save the trained model as a.pt
filetokenzer
-- specify which word tokenizer to use ['spacy', 'crazy', 'nltk']