import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import numpy as np
import random
import json
from collections import Counter
import numpy as np
import time
# Setup output logging to give us better visibility into progress
import logging
=logging.DEBUG)
logging.basicConfig(level= logging.getLogger(__name__)
logger
# Instatiate a lemmatizer to use in creating the word stem set
= WordNetLemmatizer() lemmatizer
SENTIMENT ANALYSIS OF THE BIBLE USING NLTK AND PYTORCH
Burning Bush by Sébastien Bourdon (augmented with GenAI)
Background
I wanted to run an updated sentiment analysis based on my previous exercise this time using PyTorch which has increased in popularity over the years surpassing Tensorflow. The pre-processing of the training set is similar to last time. The training is where things will diverge a bit. I realize in this day and age there are much more accurate and mature resources that can be leveraged to conduct proper sentiment analysis but this is intended to be a learning exercise, so let’s get on with it.
High Level Approach
- Create an array of the most frequently occuring words from the negative and positive training data sets, aka bag of words model and lemmatize them, i.e. convert them into their simplest form.
- Create feature/label set for positive and negative sentiment data by counting the number of popular words in each sample, from the array created above
- Using the above labelled features as inputs, train a 3 layer feedfoward neural network which will output an array containing probability percentages for True and False
- We will save the model and then run it on the Bible (The Message translation) outputting the results to a sqlite database
- Visualize results
Tools Used
- Python (Primary Scripting Language)
- PyTorch (Neural Network)
- Pandas (Dataframe management)
- NLTK (Natural Language processing utility)
- Altair (Visualization library)
- SQLite3 (Simple file based SQL database engine)
Importing modules and basic setup
Pre-process our labeled movie review data set
Let’s define a method, create_lexicon()
, which will comb through all of our positive and negative training data to extract the most frequently used words which appear at least 50 times. These will be converted into their simplified forms and then duplicates will be removed.
For example it would convert this:
Hello this is a test. Hello world. The sky is grey today and my clown hair is red.
The cats wearing red hats sit back on the mat and put down like a clown who doesn't frown.
The sky is maybe not so grey but actually more red, like blood, like cat blood.
Tomorrow I will drink lots of black coffee mixed with gallons of paint.
into this: (note how some words have been converted from plural to singular)
['hello', 'this', 'is', 'a', 'test', 'hello', 'world', 'the', 'sky', 'is', 'grey', 'today', 'and', 'my', 'clown', 'hair', 'is', 'red', 'the', 'cat', 'wearing', 'red', 'hat', 'sit', 'back', 'on', 'the', 'mat', 'and', 'put', 'down', 'like', 'a', 'clown', 'who', 'doe', 'frown', 'the', 'sky', 'is', 'maybe', 'not', 'so', 'grey', 'but', 'actually', 'more', 'red', 'like', 'blood', 'like', 'cat', 'blood', 'tomorrow', 'i', 'will', 'drink', 'lot', 'of', 'black', 'coffee', 'mixed', 'with', 'gallon', 'of', 'paint']
and then count the words that only appear at least N times, if N=2:
['clown', 'a', 'sky', 'and', 'hello', 'of', 'is', 'blood', 'the', 'like', 'grey', 'red', 'cat']
which will become our reference lexicon array for creating feature maps during training and actual usage. We can also consider removing stop words like “is”,“the”,“are” but in our case I found that our model accuracy actually decreased. This makes some sense in that we’re trying to analyze the Bible, words like He and Him are probably more important than they are in regular texts.
def create_lexicon(pos, neg, filename=None):
"""Create unique list of most frequently used words
(occuring more than 50 times) from negative
and postive corpus
"""
= []
lexicon for file in [pos, neg]:
with open(file, 'r') as f:
= f.readlines()
contents for l in contents:
= word_tokenize(l.lower()) # split words into list
all_words += list(all_words)
lexicon
# lemmatize (simplify) all these words into their core form
= [lemmatizer.lemmatize(i) for i in lexicon]
lexicon = [ word for word in lexicon if word.isalpha() ]
lexicon # Could also append "and word not in stopwords.words('english')" to the above
# if we want to drop stop words
= Counter(lexicon)
w_counts = [ word for word in w_counts if 1000 > w_counts[word] > 50 ]
final
if filename:
'writing lexicon to {0}'.format(filename))
logger.debug(with open(filename, 'w') as lexifile:
json.dump(final, lexifile)
'lexicon contains {0} words'.format(len(final)))
logger.debug('First 25 words:')
logger.debug(25])
logger.debug(lexicon[:return final
Representing phrases as numbers
We’ll need a way to convert our input phrases and sentences into feature arrays, encode_features()
. In other words, how often do our lexicon words appear in a given input phrase?
Taking our previous example:
lexicon
['clown', 'a', 'sky', 'and', 'hello', 'of', 'is', 'blood', 'the', 'like', 'grey', 'red', 'cat']
input phrase
"The blood of a calf is red like the dark sky."
gets encoded as:
output feature array
[0, 0, 1, 0, 0, 0, 1, 1, 2, 1, 0, 1, 0]
def encode_features(phrase, lexicon):
"""Given an input phrase, return an array of
the number of occurences of words from the
lexicon list created prior
"""
= word_tokenize(phrase.lower())
current_words = [lemmatizer.lemmatize(i) for i in current_words]
current_words = np.zeros(len(lexicon))
features for word in current_words:
if word.lower() in lexicon:
= lexicon.index(word.lower())
index_value += 1
features[index_value] = list(features)
features
return features
Split the data into training and test groups
Now we need to split our movie reviews training data into training and testing groups (90%:10%) so PyTorch can train and validate its results. We’ll also encode it with the above method and label it “(1,0)” for positive and “(0,1”) for negative.
def create_feature_sets_and_labels(pos, neg, test_size=0.1):
"""Take positive and negative sentiment files and
generate a list of features and labels from the
positive and negative sentiment data using the methods
above
"""
# Create frequently occuring word list
= create_lexicon(pos, neg, 'lexicon.json')
lexicon = []
featureset
for sentiment_file, sentiment in ((pos, (1,0)),(neg, (0,1))):
with open(sentiment_file, 'r') as f:
= f.readlines()
contents
for line in contents:
featureset.append((encode_features(line, lexicon), sentiment))
#featureset = list(featureset)
random.shuffle(featureset)'features length is {}'.format(len(featureset)))
logger.debug(#logger.debug('First 5 features & labels:\n{0}'.format(featureset[:5]))
#featureset = np.array(featureset)
# https://stackoverflow.com/questions/67183501/setting-an-array-element-with-a-sequence-requested-array-has-an-inhomogeneous-sh
= np.asarray(featureset, dtype='object')
featureset = int(test_size * len(featureset))
testing_size
# x is features, y is labels
= list(featureset[:, 0][:-testing_size])
train_x = list(featureset[:, 1][:-testing_size])
train_y
= list(featureset[:, 0][:-testing_size:])
test_x = list(featureset[:, 1][:-testing_size:])
test_y
return train_x, train_y, test_x, test_y, lexicon
# Create the train/test groups
= create_feature_sets_and_labels('pos.txt', 'neg.txt') train_x, train_y, test_x, test_y, lexicon
DEBUG:__main__:writing lexicon to lexicon.json
DEBUG:__main__:lexicon contains 406 words
DEBUG:__main__:First 25 words:
DEBUG:__main__:['the', 'rock', 'is', 'destined', 'to', 'be', 'the', 'century', 'new', 'conan', 'and', 'that', 'he', 'going', 'to', 'make', 'a', 'splash', 'even', 'greater', 'than', 'arnold', 'schwarzenegger', 'van', 'damme']
DEBUG:__main__:features length is 10662
Setup and train a pytorch model using our movie review dataset
We’re going to create a fully connected neural network (specifically a multilayer perceptron) consisting of 3 hidden layers and train it to output two classes, positive & negative sentiment, expressed as a probability of each (between 0 and 1). This took multiple iterations, with applications of various fine-tuning techniques along the way, and ultimately we settled on this as a balance between model simplicity and decent distribution of output values, i.e. not polarized around +1.0 and -1. More to follow on this below.
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
# Define the parameters
= 500
n_nodes_hl1 = 500
n_nodes_hl2 = 500
n_nodes_hl3 = 2
n_classes = 128
batch_size = 15
n_epochs
# Assuming train_x, train_y, test_x, test_y are already loaded as NumPy arrays
= torch.tensor(np.array(train_x), dtype=torch.float32)
train_x_tensor = torch.tensor(np.array(train_y), dtype=torch.float32)
train_y_tensor = torch.tensor(np.array(test_x), dtype=torch.float32)
test_x_tensor = torch.tensor(np.array(test_y), dtype=torch.float32)
test_y_tensor
# Create a DataLoader to manage batches
= TensorDataset(train_x_tensor, train_y_tensor)
dataset = DataLoader(dataset, batch_size=batch_size, shuffle=True)
train_loader
class NeuralNet(nn.Module):
def __init__(self, input_size):
super(NeuralNet, self).__init__()
self.fc1 = nn.Linear(input_size, n_nodes_hl1)
#self.bn1 = nn.BatchNorm1d(n_nodes_hl1)
self.dropout1 = nn.Dropout(p=0.3)
self.fc2 = nn.Linear(n_nodes_hl1, n_nodes_hl2)
#self.bn2 = nn.BatchNorm1d(n_nodes_hl2)
self.dropout2 = nn.Dropout(p=0.3)
self.fc3 = nn.Linear(n_nodes_hl2, n_nodes_hl3)
self.dropout3 = nn.Dropout(p=0.3)
self.fc4 = nn.Linear(n_nodes_hl3, n_classes)
def forward(self, x):
= F.relu(self.fc1(x))
x = self.dropout1(x)
x = F.relu(self.fc2(x))
x = self.dropout2(x)
x = self.fc4(x)
x return x
# Initialize the model, loss function, and optimizer
= len(train_x[0])
input_size = NeuralNet(input_size)
model = nn.CrossEntropyLoss(label_smoothing=0.1) #added label smoothing
criterion # weight_decay = L2 regularization to prevent overfitting
= optim.Adam(model.parameters(), lr=0.001, weight_decay=0.001)
optimizer
# Training loop
for epoch in range(n_epochs):
= 0.0
epoch_loss for batch_x, batch_y in train_loader:
# Zero the gradients
optimizer.zero_grad()
# Forward pass
= model(batch_x)
outputs
# Calculate loss
= criterion(outputs, batch_y.argmax(dim=1))
loss
# Backward pass and optimization
loss.backward()
optimizer.step()
# Accumulate batch loss
+= loss.item()
epoch_loss
print(f'Epoch {epoch+1}/{n_epochs}, Loss: {epoch_loss:.4f}')
# Evaluation
with torch.no_grad():
eval()
model.= model(test_x_tensor)
test_outputs = torch.argmax(test_outputs, dim=1)
predicted = (predicted == test_y_tensor.argmax(dim=1)).sum().item()
correct = correct / len(test_y_tensor)
accuracy print(f'Accuracy: {accuracy:.4f}')
# Running classification on a specific test example
= test_x_tensor[35].unsqueeze(0)
test_sample = F.softmax(model(test_sample), dim=1)
sentiment print(f'Sentiment: {sentiment.numpy()}')
Epoch 1/15, Loss: 48.9800
Epoch 2/15, Loss: 44.8075
Epoch 3/15, Loss: 43.1938
Epoch 4/15, Loss: 41.1387
Epoch 5/15, Loss: 37.9713
Epoch 6/15, Loss: 33.4742
Epoch 7/15, Loss: 29.8270
Epoch 8/15, Loss: 27.5091
Epoch 9/15, Loss: 25.7234
Epoch 10/15, Loss: 24.5619
Epoch 11/15, Loss: 23.6141
Epoch 12/15, Loss: 23.1657
Epoch 13/15, Loss: 22.6841
Epoch 14/15, Loss: 22.5726
Epoch 15/15, Loss: 22.2421
Accuracy: 0.9611
Sentiment: [[0.05425625 0.94574374]]
Review results of training
Looking at the above, our training achieved 96% accuracy, which is sufficient for this exercise. Let’s move forward.
Load PyTorch model and run a sentiment analysis on the bible
Let’s run each book of the bible through the above model saving the results to a sqlite db as we go along.
import sqlite3
# Clear previous database results
= sqlite3.connect('bible.db')
conn = conn.cursor()
c 'DELETE FROM bible;')
c.execute(
conn.commit()
conn.close()
# Load and use model
def run_prediction(book):
# Parse in MSG bible in JSON format
with open('MSG.json', 'r') as foo:
= json.load(foo)
msg
# Read in the lexicon json defined above
with open('lexicon.json', 'r') as foo:
= json.load(foo)
lexicon
# Create and connect to SQLite to save results
= sqlite3.connect('bible.db')
conn = conn.cursor()
c
print('# (Book, Chapter, Verse, Sentiment (0 for neg, 1 for pos), % pos, % neg)')
= 0
num = int(time.time())
epoch
for chap in msg[book]:
for verse in msg[book][chap].items():
= {'chapter': chap,
verse_info 'verse': verse[0],
'content': verse[1]}
try:
# Convert each verse into a numerical feature set which can be parsed in by the neural net
'content'] = torch.tensor([encode_features(verse_info['content'], lexicon)], dtype=torch.float32)
verse_info[
# Compute the sentiment of the verse
# Enter eval mode (no more training)
eval()
model.with torch.no_grad():
= F.softmax(model(verse_info['content']), dim=1)
sentiment
# Store results in a local set
= (book,
summary int(verse_info['chapter']),
int(verse_info['verse']),
int(torch.argmin(sentiment, dim=1).item()),
0][0].item(),
sentiment[0][1].item())
sentiment[
# Occasionally output progress (every 10th verse)
if num % 10 == 0:
= int(time.time())
cur_epoch = cur_epoch - epoch
delta_epoch = cur_epoch
epoch print(summary, delta_epoch)
+= 1
num
# Store the results to DB
'INSERT INTO bible VALUES (?,?,?,?,?,?)', summary)
c.execute(except Exception as e:
print(f"Error processing verse: {verse_info['content']}, Error: {e}")
conn.commit()
conn.close()
= ("Genesis", "Exodus", "Leviticus", "Numbers", "Deuteronomy",
bible "Joshua", "Judges", "Ruth", "1 Samuel", "2 Samuel",
"1 Kings", "2 Kings", "1 Chronicles", "2 Chronicles", "Ezra",
"Nehemiah", "Esther", "Job", "Psalms", "Proverbs",
"Ecclesiastes", "Song of Solomon", "Isaiah", "Jeremiah", "Lamentations",
"Ezekiel", "Daniel", "Hosea", "Joel", "Amos",
"Obadiah", "Jonah", "Micah", "Nahum", "Habakkuk",
"Zephaniah", "Haggai", "Zechariah", "Malachi",
"Matthew", "Mark", "Luke", "John", "Acts",
"Romans", "1 Corinthians", "2 Corinthians", "Galatians", "Ephesians",
"Philippians", "Colossians", "1 Thessalonians", "2 Thessalonians", "1 Timothy",
"2 Timothy", "Titus", "Philemon", "Hebrews", "James",
"1 Peter", "2 Peter", "1 John", "2 John", "3 John",
"Jude", "Revelation")
for book in bible:
run_prediction(book)
# (Book, Chapter, Verse, Sentiment (0 for neg, 1 for pos), % pos, % neg)
('Genesis', 42, 24, 0, 0.31473326683044434, 0.6852666735649109) 0
('Genesis', 42, 1, 0, 0.07845938950777054, 0.9215406179428101) 0
('Genesis', 42, 11, 0, 0.019253598526120186, 0.980746328830719) 0
('Genesis', 42, 31, 0, 0.337155818939209, 0.662844181060791) 0
('Genesis', 48, 22, 1, 0.6903629302978516, 0.3096370995044708) 0
('Genesis', 48, 11, 1, 0.8064991235733032, 0.19350086152553558) 0
('Genesis', 43, 24, 1, 0.8522595763206482, 0.147740438580513) 0
('Genesis', 43, 1, 1, 0.5361000299453735, 0.4638999402523041) 0
('Genesis', 43, 10, 0, 0.05237709358334541, 0.9476228952407837) 0
('Genesis', 43, 30, 0, 0.08235025405883789, 0.9176497459411621) 0
('Genesis', 49, 22, 1, 0.6194459795951843, 0.3805539906024933) 0
('Genesis', 49, 6, 0, 0.07488685846328735, 0.9251132011413574) 0
('Genesis', 49, 16, 1, 0.7573775053024292, 0.242622509598732) 0
('Genesis', 24, 48, 0, 0.20939424633979797, 0.7906057834625244) 0
('Genesis', 24, 52, 1, 0.6681751012802124, 0.33182492852211) 0
('Genesis', 24, 46, 0, 0.0707147940993309, 0.9292851686477661) 0
('Genesis', 24, 2, 0, 0.30557090044021606, 0.6944290995597839) 0
('Genesis', 24, 38, 1, 0.7063369154930115, 0.29366305470466614) 1
('Genesis', 24, 16, 1, 0.8911573886871338, 0.10884265601634979) 0
('Genesis', 24, 55, 0, 0.22119510173797607, 0.7788048982620239) 0
('Genesis', 25, 22, 1, 0.6228322386741638, 0.3771677613258362) 0
('Genesis', 25, 6, 0, 0.08872702717781067, 0.9112730026245117) 0
('Genesis', 25, 16, 1, 0.8589553833007812, 0.14104463160037994) 0
('Genesis', 26, 26, 0, 0.34340739250183105, 0.656592607498169) 0
('Genesis', 26, 2, 0, 0.019006986171007156, 0.9809930324554443) 0
('Genesis', 26, 12, 0, 0.40904921293258667, 0.5909507870674133) 0
('Genesis', 26, 34, 1, 0.6981646418571472, 0.30183538794517517) 0
('Genesis', 27, 21, 0, 0.20408788323402405, 0.7959120869636536) 0
('Genesis', 27, 1, 0, 0.05529249086976051, 0.9447075128555298) 0
('Genesis', 27, 38, 0, 0.061701204627752304, 0.9382988214492798) 0
('Genesis', 27, 18, 0, 0.06041645631194115, 0.9395835995674133) 0
('Genesis', 20, 10, 0, 0.17973092198371887, 0.8202690482139587) 0
('Genesis', 20, 2, 1, 0.6952648162841797, 0.3047351539134979) 0
('Genesis', 21, 27, 1, 0.6198707818984985, 0.38012927770614624) 0
('Genesis', 21, 5, 0, 0.10042152553796768, 0.8995784521102905) 0
('Genesis', 21, 15, 1, 0.6906400322914124, 0.30935990810394287) 0
('Genesis', 21, 32, 0, 0.36505618691444397, 0.6349438428878784) 0
('Genesis', 22, 4, 0, 0.44092637300491333, 0.5590735673904419) 0
('Genesis', 22, 14, 0, 0.4662764370441437, 0.5337235927581787) 0
('Genesis', 23, 14, 0, 0.317392498254776, 0.6826075315475464) 0
('Genesis', 23, 4, 0, 0.1189899668097496, 0.8810100555419922) 0
('Genesis', 46, 21, 1, 0.5180943608283997, 0.48190557956695557) 0
('Genesis', 46, 7, 1, 0.7374928593635559, 0.2625071108341217) 0
('Genesis', 46, 17, 1, 0.7985047101974487, 0.20149528980255127) 0
('Genesis', 47, 25, 1, 0.8772921562194824, 0.12270782142877579) 0
('Genesis', 47, 3, 0, 0.036432985216379166, 0.9635669589042664) 0
('Genesis', 47, 13, 0, 0.03483176976442337, 0.9651682376861572) 0
('Genesis', 44, 24, 1, 0.6526842713356018, 0.34731563925743103) 0
('Genesis', 44, 1, 1, 0.7267979383468628, 0.2732020318508148) 0
('Genesis', 44, 10, 1, 0.9169881939888, 0.08301184326410294) 0
('Genesis', 44, 30, 1, 0.5791557431221008, 0.42084428668022156) 0
('Genesis', 45, 22, 1, 0.5976513624191284, 0.4023486375808716) 0
('Genesis', 45, 9, 0, 0.13538631796836853, 0.8646136522293091) 0
('Genesis', 45, 19, 1, 0.9202748537063599, 0.07972512394189835) 0
('Genesis', 28, 7, 1, 0.6391882300376892, 0.3608117699623108) 0
('Genesis', 28, 17, 0, 0.09318351745605469, 0.9068164229393005) 0
('Genesis', 29, 22, 1, 0.7644326090812683, 0.23556742072105408) 0
('Genesis', 29, 6, 0, 0.14007945358753204, 0.8599206209182739) 0
('Genesis', 29, 16, 1, 0.5568495988845825, 0.4431504011154175) 0
('Genesis', 40, 21, 0, 0.013059643097221851, 0.986940324306488) 0
('Genesis', 40, 9, 0, 0.22018669545650482, 0.7798133492469788) 0
('Genesis', 40, 19, 1, 0.9508792161941528, 0.04912075400352478) 0
('Genesis', 41, 24, 0, 0.057682204991579056, 0.94231778383255) 0
('Genesis', 41, 44, 1, 0.6582971811294556, 0.34170281887054443) 0
('Genesis', 41, 4, 0, 0.13698235154151917, 0.8630176782608032) 0
('Genesis', 41, 13, 0, 0.0966290682554245, 0.9033708572387695) 0
('Genesis', 41, 37, 0, 0.4296095669269562, 0.5703903436660767) 0
('Genesis', 1, 25, 0, 0.13194411993026733, 0.8680558800697327) 0
('Genesis', 1, 3, 1, 0.5361000299453735, 0.4638999402523041) 0
('Genesis', 1, 13, 0, 0.13031969964504242, 0.8696802854537964) 0
('Genesis', 3, 24, 0, 0.4819332957267761, 0.5180667042732239) 0
('Genesis', 3, 7, 1, 0.7470265030860901, 0.2529734671115875) 0
('Genesis', 3, 17, 0, 0.3102087378501892, 0.6897912621498108) 0
('Genesis', 2, 1, 0, 0.06801030784845352, 0.9319896697998047) 0
('Genesis', 2, 10, 0, 0.43545442819595337, 0.5645455121994019) 0
('Genesis', 5, 25, 0, 0.3923676609992981, 0.6076322793960571) 0
('Genesis', 5, 3, 0, 0.33872270584106445, 0.6612772941589355) 0
('Genesis', 5, 13, 0, 0.4439464211463928, 0.5560535192489624) 0
('Genesis', 5, 32, 0, 0.3923676609992981, 0.6076322793960571) 0
('Genesis', 4, 2, 0, 0.029812293127179146, 0.9701877236366272) 0
('Genesis', 4, 12, 1, 0.7190654873847961, 0.2809344530105591) 0
('Genesis', 7, 22, 0, 0.36084452271461487, 0.639155387878418) 0
('Genesis', 7, 8, 1, 0.627878725528717, 0.37212127447128296) 0
('Genesis', 7, 18, 0, 0.2687183618545532, 0.7312816381454468) 0
('Genesis', 6, 6, 0, 0.13707111775875092, 0.8629289269447327) 0
('Genesis', 6, 16, 1, 0.8929886817932129, 0.1070113256573677) 0
('Genesis', 9, 23, 1, 0.9597183465957642, 0.04028170183300972) 0
('Genesis', 9, 9, 1, 0.9274868965148926, 0.07251307368278503) 0
('Genesis', 9, 19, 1, 0.5752152800559998, 0.42478474974632263) 0
('Genesis', 8, 7, 0, 0.3747083842754364, 0.6252917051315308) 0
('Genesis', 8, 17, 1, 0.7899051904678345, 0.2100948840379715) 0
('Genesis', 39, 2, 0, 0.10039543360471725, 0.8996046185493469) 0
('Genesis', 39, 12, 0, 0.06997539103031158, 0.9300246238708496) 0
('Genesis', 38, 27, 0, 0.06398095190525055, 0.9360190033912659) 0
('Genesis', 38, 5, 0, 0.10673844814300537, 0.8932614922523499) 0
('Genesis', 38, 15, 1, 0.7840783596038818, 0.21592164039611816) 0
('Genesis', 11, 27, 0, 0.35753366351127625, 0.6424664258956909) 0
('Genesis', 11, 5, 0, 0.3154466450214386, 0.6845533847808838) 0
('Genesis', 11, 15, 1, 0.5943444967269897, 0.40565553307533264) 0
('Genesis', 10, 25, 1, 0.9574434161186218, 0.04255654662847519) 0
('Genesis', 10, 3, 1, 0.5361000299453735, 0.4638999402523041) 0
('Genesis', 10, 13, 0, 0.47395142912864685, 0.5260485410690308) 0
('Genesis', 10, 32, 1, 0.7979674935340881, 0.20203247666358948) 0
('Genesis', 13, 1, 0, 0.08716055005788803, 0.9128395318984985) 0
('Genesis', 12, 10, 0, 0.09384032338857651, 0.9061596393585205) 0
('Genesis', 12, 1, 1, 0.9528328776359558, 0.04716715216636658) 0
('Genesis', 15, 10, 1, 0.9160295128822327, 0.08397053182125092) 0
('Genesis', 15, 1, 0, 0.0348999947309494, 0.9651000499725342) 0
('Genesis', 14, 24, 0, 0.031359195709228516, 0.9686408638954163) 0
('Genesis', 14, 7, 1, 0.9620340466499329, 0.03796599432826042) 0
('Genesis', 14, 17, 0, 0.12473555654287338, 0.8752644658088684) 0
('Genesis', 17, 22, 1, 0.5361000299453735, 0.4638999402523041) 0
('Genesis', 17, 8, 1, 0.7164437770843506, 0.2835562825202942) 0
('Genesis', 17, 18, 1, 0.6036794185638428, 0.39632055163383484) 0
('Genesis', 16, 2, 1, 0.8493285775184631, 0.1506713628768921) 0
('Genesis', 19, 27, 0, 0.10283444076776505, 0.8971655368804932) 0
('Genesis', 19, 5, 0, 0.19853056967258453, 0.8014694452285767) 0
('Genesis', 19, 12, 1, 0.5000796318054199, 0.49992039799690247) 0
('Genesis', 19, 36, 1, 0.675009548664093, 0.32499048113822937) 0
('Genesis', 18, 21, 0, 0.03314715996384621, 0.9668529033660889) 0
('Genesis', 18, 7, 0, 0.1406669020652771, 0.8593330979347229) 0
('Genesis', 18, 17, 0, 0.11861494183540344, 0.881385087966919) 0
('Genesis', 31, 48, 0, 0.0997927337884903, 0.9002072215080261) 0
('Genesis', 31, 21, 0, 0.016087831929326057, 0.9839121699333191) 0
('Genesis', 31, 41, 1, 0.639473021030426, 0.36052706837654114) 0
('Genesis', 31, 51, 0, 0.17244765162467957, 0.8275523781776428) 0
('Genesis', 31, 16, 1, 0.7040191292762756, 0.29598090052604675) 0
('Genesis', 31, 55, 0, 0.08255169540643692, 0.9174483418464661) 0
('Genesis', 30, 21, 1, 0.8915579319000244, 0.10844206809997559) 0
('Genesis', 30, 5, 1, 0.5361000299453735, 0.4638999402523041) 0
('Genesis', 30, 13, 0, 0.13160055875778198, 0.8683993816375732) 0
('Genesis', 30, 37, 0, 0.36590564250946045, 0.6340943574905396) 0
('Genesis', 37, 20, 0, 0.28424274921417236, 0.7157572507858276) 0
('Genesis', 37, 4, 0, 0.2605959475040436, 0.739404022693634) 0
('Genesis', 37, 14, 0, 0.20547519624233246, 0.7945247888565063) 0
('Genesis', 37, 33, 1, 0.9379627704620361, 0.062037281692028046) 0
('Genesis', 36, 22, 0, 0.28905922174453735, 0.7109407186508179) 0
('Genesis', 36, 4, 1, 0.6669361591339111, 0.3330638110637665) 0
('Genesis', 36, 12, 1, 0.8958793878555298, 0.10412059724330902) 0
('Genesis', 36, 36, 0, 0.28774622082710266, 0.712253749370575) 0
('Genesis', 35, 21, 0, 0.15672820806503296, 0.843271791934967) 0
('Genesis', 35, 7, 1, 0.5473823547363281, 0.45261770486831665) 0
('Genesis', 35, 17, 0, 0.05087278038263321, 0.9491271376609802) 0
('Genesis', 34, 22, 1, 0.5315243005752563, 0.46847572922706604) 0
('Genesis', 34, 6, 1, 0.9207045435905457, 0.07929543405771255) 0
('Genesis', 34, 16, 1, 0.8754245638847351, 0.12457545101642609) 0
('Genesis', 33, 14, 0, 0.3294190466403961, 0.6705809235572815) 0
('Genesis', 33, 4, 0, 0.12549614906311035, 0.8745038509368896) 0
('Genesis', 32, 21, 0, 0.09435401111841202, 0.905646026134491) 0
('Genesis', 32, 7, 1, 0.6714303493499756, 0.32856959104537964) 0
('Genesis', 32, 17, 1, 0.7983325719833374, 0.2016674280166626) 0
('Genesis', 50, 20, 1, 0.9269453287124634, 0.07305461168289185) 0
('Genesis', 50, 6, 0, 0.42963188886642456, 0.5703681111335754) 0
('Genesis', 50, 16, 0, 0.27898475527763367, 0.7210152745246887) 0
# (Book, Chapter, Verse, Sentiment (0 for neg, 1 for pos), % pos, % neg)
('Exodus', 24, 11, 0, 0.0404474139213562, 0.959552526473999) 0
('Exodus', 24, 3, 0, 0.04048745334148407, 0.95951247215271) 0
('Exodus', 25, 26, 1, 0.7702346444129944, 0.22976532578468323) 0
('Exodus', 25, 3, 1, 0.7048089504241943, 0.29519107937812805) 0
('Exodus', 25, 11, 0, 0.3534781336784363, 0.6465218663215637) 0
('Exodus', 25, 31, 1, 0.8588312864303589, 0.1411687284708023) 0
('Exodus', 26, 26, 1, 0.6406383514404297, 0.3593616485595703) 0
('Exodus', 26, 5, 1, 0.6772359013557434, 0.3227640986442566) 0
('Exodus', 26, 15, 1, 0.5361000299453735, 0.4638999402523041) 0
('Exodus', 26, 35, 0, 0.45115604996681213, 0.5488439798355103) 0
('Exodus', 27, 17, 0, 0.1877947449684143, 0.8122052550315857) 0
('Exodus', 27, 7, 0, 0.45978397130966187, 0.5402160286903381) 0
('Exodus', 20, 22, 1, 0.94534832239151, 0.0546516515314579) 0
('Exodus', 20, 8, 1, 0.9530900716781616, 0.046909939497709274) 0
('Exodus', 20, 18, 0, 0.4750576317310333, 0.5249423980712891) 0
('Exodus', 21, 29, 1, 0.8283783197402954, 0.17162171006202698) 0
('Exodus', 21, 11, 0, 0.10481549799442291, 0.8951845765113831) 0
('Exodus', 21, 31, 0, 0.3098178505897522, 0.690182089805603) 0
('Exodus', 22, 27, 0, 0.10764598101377487, 0.8923540115356445) 0
('Exodus', 22, 5, 0, 0.2105185091495514, 0.789481520652771) 0
('Exodus', 22, 15, 0, 0.04268207773566246, 0.9573178887367249) 0
('Exodus', 23, 26, 1, 0.9677294492721558, 0.03227050229907036) 0
('Exodus', 23, 2, 0, 0.03797952085733414, 0.9620205163955688) 0
.. output truncated ..
Surveying the results
Let’s first examine the results by examining the distribution of sentiment values broken out by chapter. The model outputted sentiment using 2 values, positive and negative, which add up to 1. To simplify things, we’ll convert this into a single net_sentiment
value by subtracting the negative sentiment value from the positive. So the closer we get to +1.0, the more positive the sentiment and the closer to -1.0, vice versa. We’ll average these across each chapter and then visualize the results using a histogram.
import pandas as pd
import altair as alt
# Connect to SQLite3 database
= sqlite3.connect('bible.db') # Update with the path to your SQLite database
conn
# Query to fetch the relevant data
= '''
query SELECT book, chapter, verse, sentiment, pos, neg
FROM bible; -- Replace with your actual table name
'''
# Load data into a pandas DataFrame
= pd.read_sql_query(query, conn)
df
# Close the database connection
conn.close()
# Calculate net sentiment via pos - neg
'net_sentiment'] = df['pos'] - df['neg']
df[
# Calculate the average sentiment for each chapter
= df.groupby(['book', 'chapter']).agg({'net_sentiment': 'mean'}).reset_index()
chapter_avg
# Function to generate a histogram of column values within a dataframe
def gen_histogram(df, column, title, xlabel, ylabel):
# Calculate the mean and median of the specified column
= df[column].mean()
mean_value = df[column].median()
median_value
# Create the histograms
= alt.Chart(df).mark_bar().encode(
histogram f'{column}:Q', bin=alt.Bin(maxbins=30), title=xlabel),
alt.X('count()', title=ylabel),
alt.Y(=['count()']
tooltip
).properties(=900,
width=600,
height=(title, f'(Mean: {mean_value:.2f} Median: {median_value:.2f})')
title
)
# Note that
# Add a vertical solid line for the mean
= alt.Chart(pd.DataFrame({'value': [mean_value]})).mark_rule(color='red').encode(
mean_line ='value:Q',
x=[alt.Tooltip('value:Q', title='Mean')]
tooltip
)
# Add a vertical dashed line for the median
= alt.Chart(pd.DataFrame({'value': [median_value]})).mark_rule(color='red', strokeDash=[4, 4]).encode(
median_line ='value:Q',
x=[alt.Tooltip('value:Q', title='Median')]
tooltip
)
return histogram + mean_line + median_line
# Display the histogram
'net_sentiment', f'Distribution of Avg Chapter Sentiment Values (PyTorch)', 'Average Sentiment (-1 to 1)', 'Count of Chapters') gen_histogram(chapter_avg,
A roughly normal looking distribution, though we can see that the mean and median sentiment skews slightly negative.
Visualizing every book and chapter
Let’s visualize the sentiment analysis of every book in the Bible using a heatmap for each chapter average (as calculated above) ranging from -1 (100% negative (in red)) to +1 (100% positive (in green)).
# Create the heatmap using Altair
= alt.Chart(chapter_avg).mark_rect(stroke='white', strokeWidth=1).encode(
heatmap =alt.X('chapter:O', title='Chapter', axis=alt.Axis(labelAngle=70)),
x=alt.Y('book:N', title='Book', sort=bible),
y=alt.Color('net_sentiment:Q', scale=alt.Scale(domain=[-1, 0, 1], range=['#d73027', '#f7f7f7', '#1a9850']), title='Sentimemnt Rating'),
color=['book', 'chapter', 'net_sentiment']
tooltip
).properties(=1100,
width=800,
height='Bible Heatmap of Sentiment by Book and Chapter (PyTorch)'
title
)
# Display the heatmap
heatmap
1st and 2nd Samuel, the Kings, and the Gospels (Mark, Luke, John, Acts) are looking consistently fiery. On the other hand things are faring more positively with the Proverbs, Thessalonians, and 1st John.
How did our model perform at a granular level?
Let’s go deeper and take a look at the distribution of sentiment across all verses.
# Increase the row limit (in order to bypass Altair's limitation of 5000 items in a dataset)
alt.data_transformers.disable_max_rows()
# Display the histogram
'net_sentiment', 'Distribution of Verse Sentiment Values (PyTorch)', 'Sentiment (-1 to 1)', 'Count of Verses') gen_histogram(df,
We observe a steady increase in negative sentiment going from -0.6 to -1.0 and a small spike in the neutral zone between 0.0 and 0.1.
Takeaways
As I mentioned up top, performing a sentiment analysis on the Bible using a neural network model trained on a set of 11000 rotten tomatoes movie reviews was only meant to be an academic exercise in training a PyTorch neural network. Even though a modern translation of the Bible was used, the rotten tomato lexicon contained words which simply had no relevance given the context such as filmmaker
, hollywood
, cinematic
, and french
so we understood there were limitations going in.
We probably could have benefitted from incorporating a 3rd classification of neutral
since many bible verses don’t necessarily have a strong sentiment.
We went through many iterations of fine tuning to mitigate polarized value outputs that were emerging during the initial phases of training. A variety of techniques were employed including: - Adjusting model complexity (number of layers and nodes) - Adding dropouts to promote more even use of neurons and prevent overfitting - Added L2 regularization to improve generalization performance by normalizing weights (and prevent overfitting)
There were other tweaks that were ultimately not included because they didn’t provide any positive enhancements and it reinforced just how many options one has when fine-tuning a deep neural network, equal part art and science.
Some things to consider for a follow-up would be utilizing tensorboard logging to track results and better parameter management for fine-tuning, as opposed to depending on jupyter notebook snapshots.
This was a fun exercise that’s given me some food for thought for future experiments. Stay tuned!