Skip to content

Latest commit

 

History

History
64 lines (54 loc) · 4.69 KB

README.md

File metadata and controls

64 lines (54 loc) · 4.69 KB

DISCO: Comprehensive and Explainable Disinformation Detection

"DISCO" is a disinformation detection toolkit. An online demo video is available here, a preprint paper is available here.

1. Function of DISCO

  • Input: A batch of susceptive information

  • Output:

    • The fake news probability and real news probability for an news article query
    • Misleading degree rankings of each word of in that query article

2. Required Library

  • numpy 1.20.1
  • scipy 1.6.2
  • pandas 1.2.4
  • nltk 3.6.2
  • gensim 4.0.1
  • sklearn 0.24.1

3. Quick Start

  • Download the code
  • Download pre-trained word2vec model (here or here) and put it in the "pretrained-word2vec" folder
  • Run the "gui_disco.py" to get the software as shown in the demo video
  • [Optional]: You can train DISCO from the scratch as below
    • First, you can put raw fake news data and raw real news data in "raw-dataset" folder and run "data_preprocessing.py". Then feature_matrix.pkl and label_matrix.pkl will be automatically saved in the "preprocessed-dataset" folder.
    • Then, you can run "model_training.py" to obtain the inner classifier of DISCO, the inner classifier of DISCO will be automatically saved in the "trained-classifier" folder.
    • Now, you get the complete DISCO and could run "gui_disco.py" to get the software as shown in the demo video.

4. Technical Logic of DISCO

  • Building Word Graph. We contrust an undirected word graph for each input news article. Briefly, if two words co-occur in a length-specified sliding window, then there will be an edge connecting these two words. For example, "I eat an apple" and the length of the window is 3, then edges could be {I-eat, I-an, eat-an, eat-apple, an-apple} (with stop words kept). More details of constructing a word graph can be found at TextRank.
  • Geometric Feature Extraction. We use the idea of the SDG to obtain node embeddings. Briefy, a node's representation is aggregated based on its personalized PageRank vector weighted neighours' features. Then we call any pooling function (like sum pooling or mean pooling) to aggregate node embeddings into the graph-level representation vector for each constructed word graph.
  • Neural Detection. We train a model-agnostic classification module as the inner classifier of DISCO.
  • Misleading Degree Analysis. With the support of SDG, we can mask any word node in the contrusted word graph and fast track the new Personalized PageRank to get the new graph-level embedding vector. Without fine-tuning the inner classifier of DISCO, we can investigate each word's contribution (positive or negative) towards the ground-truth label prediction probability.
  • [Optional]: You can access our additional repository for a more thorough disinformation study, such as different inner classifiers, truncated feature dimensions, label noise injection, etc.

Reference

If you use the materials from this repositiory, please refer to our paper.

@inproceedings{DBLP:conf/cikm/FuBTMH22,
  author    = {Dongqi Fu and
               Yikun Ban and
               Hanghang Tong and
               Ross Maciejewski and
               Jingrui He},
  editor    = {Mohammad Al Hasan and
               Li Xiong},
  title     = {{DISCO:} Comprehensive and Explainable Disinformation Detection},
  booktitle = {Proceedings of the 31st {ACM} International Conference on Information
               {\&} Knowledge Management, Atlanta, GA, USA, October 17-21, 2022},
  pages     = {4848--4852},
  publisher = {{ACM}},
  year      = {2022},
  url       = {https://doi.org/10.1145/3511808.3557202},
  doi       = {10.1145/3511808.3557202},
  timestamp = {Wed, 19 Oct 2022 17:09:02 +0200},
  biburl    = {https://dblp.org/rec/conf/cikm/FuBTMH22.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}