Fake news represents one of the most pressing issues faced by social media ecosystems today. While there have been many examples of fake news consumption leading to the adoption of inaccurate beliefs and harmful behaviors, fake news regarding the COVID-19 pandemic is arguably more dangerous since it may lead to worse public health outcomes. To combat this issue, social media companies have employed Machine Learning to augment their ability to distinguish between fake and real news.
While ML has advanced our ability to identify and root out fake news, this approach is not without its limitations. Fake news, by its very nature, is constantly evolving. A ML algorithm capable of detecting fake news with high accuracy at one point in time may become significantly less accurate when applied to news sampled from a later time period. By examining the way in which fake news detection algorithms decay over time, we hope to better understand the limitations of applying ML to the issue of fake news.
Using COVID-19 related social media data, we ask the following question: To what extent does the context-dependent and fast-moving nature of fake news represent a limitation for ML models?
Our project comprises two main components:
- Create a fake news detection algorithm using existing COVID-related real and fake news datasets
- Measure the decay of our fake news detection algorithm by applying it to recent COVID-related news
- We use the COVID-19 Fake News dataset introduced by Patwa et al. in Fighting an Infodemic: COVID-19 Fake News Dataset. The dataset comprises 10,700 COVID-related social media posts from various platforms which have been hand-labelled real or fake.
- Real posts comprise tweets from credible and relevant sources such as the World Health Organization (WHO) and the Centers for Disease Control and Prevention (CDC) among others.
- Fake posts include tweets, posts, and articles which make claims about COVID-19 that have been labeled false by credible fact-checking sites such as PolitiFact.com.
- Additionally, we use methods consistent with those employed by Patwa et al. to create our own dataset containing recent cases of fake and real news.
We use an F-1 score to evaluate our models' performance:
- Python β₯3.5
- numpy
- pandas
- matplotlib
- nltk
- sklearn β₯0.20
- Hannah Schweren
- Marco Schildt
- Steve Kerr