In this project, we attempt to classify research papers into their subject areas. By splitting abstracts into words and converting each word into a n-dimensional word embedding, we project the text data into a vector space with time-steps. In order to select the words, we use TF-IDF weights to find the most important words in the abstract and sort them based on the TF-IDF values. To classify the data, we use 2 flavors of Recurrent Neural networks (RNN's) namely : LSTM and GRU. We tried various Word Embedding (WE) models such as GloVe, SciBert and FastText. We also use Universal Sentence Encoder (USE) with Multi-layers perceptron (MLP) and char-level CNN to compare the performance of the above model.
- Keras
- tensorflow (tensorflow_gpu recommended)
- nltk
- pandas
- sklearn
- tensorflow_hub(for USE)
RNNs with WE: This model takes text data path (.csv file with data and label) and WE file path as inputs and classifies the data using RNN's.
In order to clean and order the data, use tf_idf sorting module. This module splits the sentences into words and removes stopwords, punctuations and numbers from the words. These words are lemmatized and then sorted based on TF-IDF values. This module takes 3 arguments:
- data_path : path to the text data
- max_len (optional) : maximum length of the words to be retained in each text sequence (in our case, abstract).
- tf_idf ordering (optional) : boolean value. set to 'True' to sort the values based on TF-IDF valuses. 'False' retains the order instead of sorting.
from tfidf_ordering import tfidf_ordering
tfidf_ordering(data_path,tfidf_sorting=True,max_len=80)
The cleaned data will be saved in a csv file named 'final_tfidf_ordered_data.csv'.
After cleaning the data, build and run the model to classify the data. Arguments for the model are:
- abstracts_path : provide the path to the cleaned data, i.e 'final_tfidf_ordered_data.csv'.
- WE_path : provide the path to the WE file.
- max_len (optional) : maximum length of the words to be retained in each text sequence (in our case, abstract). Default value-80.
- nodes (optional) : No of rnn cells in each layer. Default-128
- layers (optional) : No of rnn layers required. Default : 2
- loss (optional) : default='categorical_crossentropy',
- optimizer (optional) : default ='Adam'
- activation (optional) : default = 'tanh'
- dropout fraction (optional) : default = '0.2'
- batch_size (optional) : size of each batch for stochastic gradient descent. default=1000
- epochs (optional) : No of epochs for training
- gpus (optional) : No of gpus in case of multi gpu training. Default : None. If None, triggers cpu model.
Run the following to create a model object which takes abstracts_path and WE_path as input (add optional arguments if required) and prints accuracy and f1-score of the classification.
from RnnModelMain import Model
Model(abstracts_path, WE_path)
character-level CNN: Similarly, to implement character-level CNN model, implement the following:
from char_cnn import char_cnn
char_cnn(abstracts_path)
optional arguments are : batch_size,epochs and gpus
USE with MLP: To implement character-level CNN model, implement the following:
from use_with_mlp import MlpModelWithUSE
MlpModelWithUSE(abstracts_path)
optional arguments are : nodes, layers, loss,optimizer, activation, dropout, batch_size, epochs,gpus