The spread of misinformation and fake news on social media has posed a major challenge to the well-being of individuals and societies. It is imperative that we develop robust and automated solutions for early detection of fake news on social media. Traditional approaches rely purely on the news content to mark an information piece as real or fake news. However, the social context in which the news is published and spread can provide additional insights into the nature of fake news on social media and hence improve the predictive capabilities of fake news detection tools. Here we demonstrate how Amazon Neptune ML can be leveraged to detect fake news based on the content and social context of the news spread on social media.
We use the BuzzFeed dataset from the 2018 version of FakeNewsNet in this work. The BuzzFeed dataset is made of a sample of news articles shared on Facebook from 9 news agencies over one week leading up to the 2016 U.S. election. Every post and the corresponding news article have been fact-checked by BuzzFeed journalists. Table below summarizes key statistics about the dataset:
# of Users: | 15,257 | # of Engagements: | 25,240 | ||
# of Authors: | 126 | # of News: | 182 | ||
# of Publishers: | 28 | # of Fake News: | 91 | ||
# of Social Links: | 634,750 | # of Real News: | 91 |
We process the raw data from FakeNewsNet repository; and convert them into comma-separated-values (CSV) format for vertices (nodes) and edges in a heterogeneous property graph that can be readily loaded into an Amazon Neptune database with Apache TinkerPop Gremlin. The constructed property graph is composed of 4 vertex types and 5 edge types, as demonstrated in the schematic below, which together describe the social context in which each news is published and spread.
Graph Machine Learning with Amazon Neptune ML involves five main steps:
-
Data Export and Configuration – The data-export step uses the Neptune-Export service to export data from Neptune into Amazon S3 in CSV format. A configuration file named training-data-configuration.json is automatically generated which specifies how the exported data can be loaded into a trainable graph.
-
Data Preprocessing – The exported dataset is preprocessed using standard techniques to prepare it for model training. Feature normalization can be performed for numeric data, and text features can be encoded using word2vec. At the end of this step, a DGL (Deep Graph library) graph is generated from the exported dataset for the model training step. This step is implemented using a SageMaker processing job, and the resulting data is stored in an Amazon S3 location that you have specified.
-
Model Training – This step trains the machine learning model that will be used for predictions. Model training is done in two stages: The 1st stage uses a SageMaker processing job to generate a model training strategy configuration set that specifies what type of model and model hyperparameter ranges will be used for the model training. The 2nd stage uses a SageMaker model tuning job to try different hyperparameter configurations and select the training job that produced the best-performing model. The tuning job runs a pre-specified number of model training job trials on the processed data. At the end of this stage, the trained model parameters of the best training job are used to generate model artifacts for inference.
-
Creating Inference Endpoint in Amazon SageMaker – The inference endpoint is a SageMaker endpoint instance that is launched with the model artifacts produced by the best training job. The endpoint is able to accept incoming requests from the graph database and return the model predictions for inputs in the requests.
-
Querying the ML Model using Gremlin – You can use extensions to the Gremlin query language to query predictions from the inference endpoint.
Note: Use this CDK code to quickly spin up the required resources, including:
- an Amazon Sagemaker notebook where you can run this code
- an associated Amazon Neptune cluster
- all the configurations needed to work with Neptune ML in the Amazon Sagemaker.
The CDK code is specifically designed to work for this repository, so you shouldn't have any errors when running the notebooks. Note that to run the 1-create-graph-dataset.ipynb
notebook, a NotebookInstanceType of ml.c5.9xlarge
is configured already to ensure you have enough memory to load the UserFeature.mat
file.
Once you have deployed the above CDK code,
- navigate to the Sagemaker console
- expand the Notebook section on the left hand side.
- Click notebook instances
- Find the notebook deployed by the CDK. It should start with
aws-neptune-notebook-for-neptunedbcluster
. If you don't see any, check to ensure you are in the correct region and the CDK has finished deploying. - Click Open Jupyer Lab
- Launch a new terminal
- Run
cd Sagemaker && git clone https://github.com/aws-samples/amazon-neptune-ml-fake-news-detection.git
- You should see the folder
amazon-neptune-ml-fake-news-detection
on the left hand side - Click into the folder, and you will see the notebooks described below as well as additional files and folders. Follow the steps below to run the notebooks.
- As an initial step, run through
1-create-graph-dataset.ipynb
notebook to generate the graph dataset. - Use the
2-load-graph-dataset.ipynb
notebook to load the graph dataset to an Amazon Neptune cluster. - Use the
3-detect-fake-news-neptune-ml.ipynb
notebook to run through steps 1 to 5 above of graph machine learning with Amazon Neptune ML an interactive way. - Use the
4-inductive-inference.ipynb
notebook to go through an example of real-time inductive inference with Amazon Neptune ML.
If needed, you can run the for-reference_clear-neptune-database.ipynb
notebook to clear the Amazon Neptune database. This can be helpful if you run the sample and then want to create your own data in Amazon Neptune.
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.