AlgorithmLast Updated: 03-04-2023

Objective

Welcome to Talewind! This project aims to take text as input, and output a highlighted version of the text, similar to plagiarism-checkers. Instead, our project will highlight any biases that it sees in it. This is done on a sentence-by-sentence basis, and is rendered through displaCy's beautiful span visualizer. We currently support 7 varying types of media bias, listed as follows:

Content Bias

Partisan Bias

False Balance

Ventriloquism

Demographic Bias

Undue Weight

Corporate Bias

We hope to deliver a solution that will make news and academic reporting fairer and cleaner, with an easy-to-use front-end user interface for maximum accessibility. The whole project is currently deployed on Vercel, and as a small prototype we do not expect to handle large amounts of traffic.

Dataset Description

The dataset for this project was impossible to find. As such, in order to finalize a small prototype of the project, we've had to use synthetic data generated by GPT-3 and GPT-4. The prompt given to generate the dataset is given below:

"Synthesize a dataset in plain text delimited by ';'. The dataset's header's are as follows: Sentence, Content Bias, Partisan Bias, False balance, Ventriloquism, Demographic bias, Undue Weight, Corporate

Fill the data by creating sentences from news media, articles and journalist writing pieces and sentences written inside articles. Generate long and short sentences and significantly vary the length of the sentences. Then fill the data with 1 or 0 in the following columns based on if the sentence has that bias present or not. Keep variance high in data. Also include sentences which have no bias at all. Do not use code block to generate the data. Generate as much data as possible."

This generated approximately 16-20 rows per prompt, and had a lot of duplicated. Generating this dataset took over 24 hours of prompting, and we extend our deepest gratitude to OpenAI for its development of such a powerful tool. The final dataset contained ~1800 samples of texts in a vacuum. We realise that this is a relatively small dataset to classify a 7 labeled multi-label classification task, but as a proof of concept and a prototype, the performance has been deemed more than satisfactory. Improving the data collection for this project is definitely the first priority as a future prospect.

Methodology

The FFNN used in this model is relatively simple and a summary() dump can be found below.

Layer (type)	Output Shape	Param #
dense_4 (Dense)	(None, 384)	147840
dropout_2 (Dropout)	(None, 384)	0
dense_5 (Dense)	(None, 128)	49280
dropout_3 (Dropout)	(None, 128)	0
dense_7 (Dense)	(None, 32)	4128
dropout_7 (Dropout)	(None, 384)	231

Category:	Value
Total params:	201,479
Trainable params:	201,479
Non-trainable params:	0

The loss function used was keras' default binary crossentropy. and the model was trained for 50 epochs. We do not recommend training for that long on any model that could be recreated on a dataset this small, which is also why we incorporated two Dropout layer with a 20% rate. However as a prototype the model functions well. This output is then filtered using a threshold value (undisclosed for now), and the sentences are labelled based on this filtered output. These are then passed on to displacy to render as the html, which is then sent back as is to display for the sentence-wise analysis.

Results

The Loss, Accuracy, Precision and Recall charts are shown below. The input data for these was the 384 dimensional embeddings for each sentence, as it is right now in production as well. As for how it works in the real world, We encourage you to try out our demo website. Click on "Home" to get started.

Epoch Accuracy

Epoch Precision

Epoch Recall

Epoch Loss