Fake News Detector using Python & Machine Learning Techniques

Karthiga BM
4 min readFeb 11, 2021

--

A Data Scientist with a quest to find the fake & real news.

Fake news detection

The ability to distinguish between reliable news stories and deliberate hoaxes or sarcastic news has become increasingly important with the spread of such information over social media networks. Big tech and social media companies are particularly interested in the reliability of content being disseminated on their platforms. These platforms would ideally like to be able to detect and flag articles suspected of being so-called “fake news” automatically.

This project includes the task of train and test a fake news detector using machine learning techniques. The dataset being used for the task is from the recently compiled and released open dataset described in this paper. This particular dataset contains headlines only: decisions about the legitimacy of the news articles must be based on the headline alone.

In the description below, I suggest some Python packages and tools that you can use to complete certain tasks. If you use another programming language for the project you will need to find appropriate replacements.

Preprocessing

The first stage in the pipeline is to preprocess and clean the dataset.

Training and test splits

The very first thing that you will need to do is split the data into training and test sets. Write a Python script to perform the split: 75% of the data for training and the remainder for test. Take appropriate measures to ensure that the test set is not biased in any way. Store the resulting training and test sets in files using any convenient data format that you like. Collect and record statistics on the resulting training and test sets including total numbers of real and fake news headlines in each set.

If you plan to use a validation set (as opposed to cross validation) for model selection, this would be a good time to split off the validation set too.

Feature extraction

The second part of preprocessing will be to extract the features you will need for the remainder of the analysis. You may revisit this stage many times as you become more familiar with the dataset and the kinds of features that may be useful for the classification task. You may want to start by using a bag-of-words model here to transform the documents into a fixed length representation suitable for classification. The sklearn.feature_extraction.text package may be useful here.

The features you choose will affect the performance of the final classifier, and there are many possibilities (e.g. stop word removal, TF-IDF encoding, infrequent word removal, etc.). Choose something you think is reasonable to start with and later you can experiment with alternatives on the validation set.

Find the entire script here:

Exploratory data analysis

Use the training section of the dataset to perform some exploratory data analysis. The goal at this stage is to become accustomed with the data and gain insights into the kinds of features that may be useful for classification.

Consider carefully which subset of the data should be used for exploratory analysis.

Find the top-20 most frequently used words in real and fake headlines and use a bar plot to show their relative frequencies. What can you say about these words? What changes when stop words are removed?

Compare the distribution of headline lengths in real and fake headlines using appropriate plots (e.g. a boxplot). Are fake headlines usually shorter or longer?

Supervised classification

Train a supervised classification model on your features and calculate validation accuracy either on a hold-out validation set or using cross-validation. Record the final accuracy of the classifier. How many of the headlines are correctly classified by the model? How many are misclassified? Investigate the kinds of errors that are being made (e.g. using the sklearn.metrics package). Document all findings. Save the model to the disk (e.g. using the Python pickle module).

Model selection

Select multiple candidate models that you want to compare. This could include different classifiers (e.g. naive Bayes (MultinomialNB), logistic regression, SVMs, etc.), different hyperparameters, or different sets of features. Use a validation set or cross-validation to compare the accuracy of different models. Create plots to compare a subset of the models that you investigated during model selection. Retain the most effective model for evaluation.

It is important that you do a reasonably thorough investigation of different alternatives in this section.

Model evaluation

Estimate the out-of-sample error for the model that you found to be most accurate during model selection by evaluating it on the held-out test set. Use the sklearn.metrics package (or similar) to benchmark the model in several ways. Create an ROC plot for the model. Compute the model’s AUC metric. Generate the confusion matrix for the model on the test set. Comment on the implications of the resulting confusion matrix for a real production classifier.

Code

You can use Python, MATLAB, R, or Julia for implementation, although I recommend using Python for this. You can use any external libraries you like (e.g. scikit-learn, pandas, seaborn, etc.). You can use both Python scripts and IPython notebooks for implementation as you see fit.

Find the entire script here:https://github.com/Karthiga-BM/Data-Analysis-Machine-Learning-Fake-News-Detection

--

--