Skip to content

Malay Fake News Classification using CNN, BiLSTM, C-LSTM, RCNN, FT-BERT and BERTCNN.

Notifications You must be signed in to change notification settings

AsyrafAzlan/malay-fake-news-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

malay-fake-news-classification

Malay Fake News Classification using:
1. CNN [3]
2. BiLSTM [4]
3. C-LSTM [5]
4. RCNN [6]
5. FT-BERT [7]
6. BERTCNN (A unique method in this project that uses the sequence output from the last BERT layer to be provided to CNN layers).

The preprocessed Word2Vec collaterals which Item 1 - 4 heavily depended on can be obtained via:
https://www.dropbox.com/s/pm9rrynspp16det/malay_word2vec.zip?dl=0
Please see my "malay-word2vec-tsne" repo to see how they are preprocessed.

The result of this project produced a filtered Malay fake news dataset which can be downloaded from malaya_fake_news_preprocessed_dataframe.pkl
(available via the link in malay-fake-news-dataset.txt or at
https://www.dropbox.com/s/i5yx6e426m8frgs/malaya_fake_news_preprocessed_dataframe.pkl?dl=0).
The news articles from the original dataset [1] that cannot be correctly classified by all the models are treated as outliers and filtered out.

The following command in Python will load and display the dataset:

import pandas as pd
df_allnews_unpickled = pd.read_pickle("./malaya_fake_news_preprocessed_dataframe.pkl")
df_allnews_unpickled

Column descriptions:
news: Original news articles that have been cleaned minimally - lowercased, added space between specific symbols, "hb" & "th" e.g. 4th/13hb -> 4 th/ 13 hb.
tokens: Tokenized words from news column. Changed numbers from digits to ordinal spellings. (See image above)
rejoined: Rejoined sentences from tokens column. Mostly used for BERT models as they have their own tokenizer.
length: Length of sentences based on tokens.
label: Class label. 1 for real news. 0 for fake news.
real: One-hot encoding column for real news.
fake: One-hot encoding column for fake news.
Further information regarding this dataset can be found in the following table.


The following experiments/modifications were done before filtering the outliers to achieve the best result/dataset:

  • Normal: All fake news articles originally from [1] are considered.
  • <1000: Only news articles with less than 1000 words are considered because those with more are very few in numbers.
  • Trunc128: All news articles are truncated to have a maximum sequence length of 128 (the standard for BERT models in this project).
  • Summarized: News articles with more than 200 words are first summarized using TF-IDF scores and Hopfield Network and are then truncated at 128 sequence length. The summarization method can be found in my "article-summarization" project.
  • Filtered: All news articles that cannot be classified by all models are considered as outliers and removed from the original dataset.


Disclaimer: The "how-to" files may display some old results though with accurate process and methodology.

The work done in this project is part of the following publication:
"A Benchmark Evaluation Study for Malay Fake News Classification Using Neural Network Architectures"
Published in Kazan Digital Week 2020. Methodical and Informational Science Journal, Vestnik NTsBZhD(4), pp. 5-13, 2020.
https://ncbgd.tatarstan.ru/rus/file/pub/pub_2610566.pdf
http://www.vestnikncbgd.ru/index.php?id=1&lang=en
https://kazandigitalweek.com/

The original dataset, toolkit and pre-trained BERT model are provided by:
[1] Zolkepli, Husein. “Malay-Dataset.” Github-huseinzol05/Malay-Dataset: Text corpus for Bahasa Malaysia. https://github.com/huseinzol05/Malay-Dataset
[2] Zolkepli, Husein. “Malaya.” Github-huseinzol05/Malaya: Natural-Language-Toolkit for Bahasa Malaysia. https://github.com/huseinzol05/Malaya

The chosen model architectures for this project are applications of the following papers:
[3] Kim, Yoon. "Convolutional neural networks for sentence classification." arXiv preprint arXiv:1408.5882 (2014).
[4] Nowak, Jakub, Ahmet Taspinar, and Rafał Scherer. "LSTM recurrent neural networks for short text and sentiment classification." In International Conference on Artificial Intelligence and Soft Computing, pp. 553-562. Springer, Cham, 2017.
[5] Zhou, Chunting, Chonglin Sun, Zhiyuan Liu, and Francis Lau. "A C-LSTM neural network for text classification." arXiv preprint arXiv:1511.08630 (2015).
[6] Lai, Siwei, Liheng Xu, Kang Liu, and Jun Zhao. "Recurrent convolutionalneural networks for text classification." In Twenty-ninth AAAI conference on artificial intelligence. 2015.
[7] Devlin, Jacob. "Github-google-research/bert: TensorFlow code and pre-trained models for BERT.” Github.com. https://github.com/google-research/bert (accessed March 09, 2020).