How we Used Sentiment Analysis to Learn More About our Politicians: Congress Connect

Ryan Rana
7 min readJul 27, 2023

Congress Connect is an app that allows people to learn about congressmen's views on various subjects, based on their tweets using sentiment analysis and machine learning.

Overview

When citizens want to learn more about politicians’ views on subjects they are often overloaded with tons of articles and information which may contain contradictions, bias, and unreliable information. They can also read congressmen's Twitter pages but even then there is a vast amount of posts that they have to simmer through to quickly get what they are looking for. Because of this many are left uneducated as to what the actual views of their congressmen are and forever going a barrier between the people and the Congress. I was inspired to make an app to solve this issue and better connect people with their government.

Congress Connect provides a solution to this issue and it’s super easy to use. All the user has to do is open the app, and enter a congressmen’s Twitter handle, and a word relating to a subject they want to learn about, examples include, masks, abortion, gun policies, etc. Then the app displays a short statement about the political views of said congressmen on said topic, as well as other information.

Photo by MIKE STOLL on Unsplash

Backend

Once an input is made by the user the program uses a Python library called TWINT that scrapes all tweets from that congressmen that also contain the word. Then this data must be stored in a data frame to be analyzed.

However, we still need to find the political views. To do this we are using sentiment analysis which is a machine analysis to find the sentiment of a text combined with recurrent machine learning models.

To do this a dataset of over 14000 tweets data samples classified into 3 types: positive, negative, and neutral is used. Machines don’t understand words and can only understand numbers, so we convert the text into an array of vector embeddings where each different word gets represented numerically and then break the. Each tweet is then padded with filler words so all are at the same length. The machine learning model to be used is called LSTM or Long short-term memory. This is an artificial neural network, and it is optimal because it reuses the data and gives feedback to the model for each part of the tweet. This model is trained and tested with an 80/20 % ratio. Training is essentially calibrating all the calculations of the model from random values to stronger connections for higher accuracy by running the model thousands of times. The testing stage is simply evaluating the accuracy of the model. The LSTM reached 95.43% accuracy.

For example, the original tweet could read,

If true, the SCOTUS ruling would devastate half a century of precedent and progress. Congress has the ability to codify Roe v. Wade into law. We cannot wait.

The program would output,

Congressman Tom Malinowski has a postive stance on Roe v. Wade.
This is proven when he/she said, “If true, the SCOTUS ruling would devastate half a century of precedent and progress. Congress has the ability to codify Roe v. Wade into law. We cannot wait.” Speak out. Organize. And vote!

For the user's input, each tweet in the data frame goes through the model and is given a number as the output, 1: negative, 2:neatrul, and 3:positive. The average of all the tweet's outputs is calculated and it rounds to either negative, neutral, or positive. The program takes the 3 tweets with the highest prediction confidence and formulates them into a proper argumentative stance and displays it to the user, in the format “ Congressman, _____ has a ______ stance on ______, because he said, _______, _______, and ______”. The app can also display additional information such as a table of all the tweets, congressmen with similar viewpoints, a link if there is one on the congressmen's Twitter, as well as suggestions for other topics.

This app used Python for all the backend functions including organizing and altering the data as well, the semantic analysis, and the LSTM. ReactJS was used for the front end, to take in the inputs, to display the timeline, the sentence, as well as the other information, React was specially chosen so it can be launched on all platforms including IOS and Android.

Syntax Breakdown

Importing necessary libraries: This section imports the Python libraries required for various functionalities in the code, such as pandas for data manipulation, NumPy for numerical operations, TensorFlow and Keras for building and training the LSTM model, and Tokenizer and pad_sequences for text preprocessing.

import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout

Data collection with TWINT (Assuming you have collected and stored tweets in a DataFrame named ‘tweets_df’): This section assumes that you have already collected tweets using TWINT (a Twitter scraping library in Python) and stored them in a pandas DataFrame named ‘tweets_df’. TWINT allows you to scrape tweets based on the Twitter handle of a congressman and a specific keyword.

Preparing the dataset for sentiment analysis: Here, we create a new data frame called ‘data’ that consists of the ‘text’ (tweet text) and ‘sentiment’ (positive, negative, or neutral) columns from the ‘tweets_df’. The ‘sentiment’ column contains the sentiment label assigned to each tweet during data collection.

# Preparing the dataset for sentiment analysis
data = pd.DataFrame({'text': tweets_df['tweet_text'], 'sentiment': tweets_df['sentiment']})

Convert sentiment labels to numerical values (1: negative, 2: neutral, 3: positive): Since machines understand numerical values better than text labels, we map the sentiment labels ‘negative’, ‘neutral’, and ‘positive’ to numerical values 1, 2, and 3, respectively. This allows us to use these numerical values for training the machine learning model.

# Convert sentiment labels to numerical values (1: negative, 2: neutral, 3: positive)
data['sentiment'] = data['sentiment'].map({'negative': 1, 'neutral': 2, 'positive': 3})

Splitting the dataset into training and testing sets: The ‘data’ DataFrame is split into two sets: a training set (80% of the data) and a testing set (20% of the data). The training set will be used to train the LSTM model, while the testing set will be used to evaluate the model’s performance.

# Splitting the dataset into training and testing sets
train_size = int(0.8 * len(data))
train_data = data[:train_size]
test_data = data[train_size:]

Tokenizing and padding the text data for LSTM: Tokenization is the process of converting text data into numerical tokens or sequences. The Tokenizer class from Keras is used to tokenize the text data and create a numerical representation for each word. The padding ensures that all sequences have the same length. In this case, the tweet sequences are padded with filler words to make them all of equal in length (50 words in this example).

# Tokenizing and padding the text data for LSTM
max_words = 10000
tokenizer = Tokenizer(num_words=max_words, oov_token='<OOV>')
tokenizer.fit_on_texts(train_data['text'])

train_sequences = tokenizer.texts_to_sequences(train_data['text'])
train_padded = pad_sequences(train_sequences, padding='post', maxlen=50)

test_sequences = tokenizer.texts_to_sequences(test_data['text'])
test_padded = pad_sequences(test_sequences, padding='post', maxlen=50)

LSTM model:

  • The LSTM model is created using the Sequential API from Keras.
  • The Embedding layer converts the numerical token representation of words into dense vectors (embedding vectors).
  • The LSTM layer is a recurrent neural network layer that can capture the sequential nature of the tweet data.
  • The Dense layers are fully connected layers that process the output of the LSTM layer.
  • The output layer has three neurons (one for each sentiment class) and uses the softmax activation function to produce probability distributions over the classes.
# LSTM model
embedding_dim = 128
model = Sequential()
model.add(Embedding(input_dim=max_words, output_dim=embedding_dim, input_length=50))
model.add(LSTM(units=128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(units=64, activation='relu'))
model.add(Dense(units=3, activation='softmax')) # 3 output neurons for 3 sentiment classes
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

Training the model:

  • The model is trained using the training data (tweet sequences and corresponding sentiment labels).
  • The model is compiled with a loss function (sparse_categorical_crossentropy) suitable for multi-class classification tasks, an optimizer (Adam), and the evaluation metric (accuracy).
  • The training is performed over 10 epochs (iterations over the entire training data) with a batch size of 32.
# Training the model
model.fit(train_padded, train_data['sentiment'], epochs=10, batch_size=32, validation_split=0.2)

Evaluating the model: After training, the model is evaluated on the testing data to measure its accuracy.

# Evaluating the model
test_loss, test_accuracy = model.evaluate(test_padded, test_data['sentiment'])
print(f"Test Accuracy: {test_accuracy*100:.2f}%")

Prediction for user input:

  • The function ‘get_congressman_stance’ takes user input as a parameter (tweet text related to a subject) and predicts the sentiment of that input using the trained LSTM model.
  • The input text is tokenized, padded, and fed into the model to get a probability distribution over the three sentiment classes.
  • The index of the class with the highest probability (plus one) is returned as the predicted sentiment (1 for negative, 2 for neutral, 3 for positive).
# Prediction for user input
def get_congressman_stance(user_input):
input_sequence = tokenizer.texts_to_sequences([user_input])
input_padded = pad_sequences(input_sequence, padding='post', maxlen=50)
prediction = model.predict(input_padded)
predicted_sentiment = np.argmax(prediction) + 1

return predicted_sentiment # Returns 1, 2, or 3 for negative, neutral, and positive respectively

In the future, this app can be expanded on by tuning the accuracy to get it close to 100% as well as enhancing the layout and design. This app can have real potential to help educate the people of this nation about their congressmen.

Links

View the UI: https://www.figma.com/file/IAgdleqkwSsdvUKfX57naY/Congress-Connect?type=design&node-id=0%3A1&mode=design&t=4rtQr3YslXDOQSrK-1

View the Source Code: https://github.com/RyanRana/Congress-Connect/tree/main

--

--