SENTIMENT ANALYSIS OF THE BIBLE USING NLTK AND PYTORCH

an update to the previous exercise

Published

August 23, 2023

Burning Bush by Sébastien Bourdon (augmented with GenAI)

Background

I wanted to run an updated sentiment analysis based on my previous exercise this time using PyTorch which has increased in popularity over the years surpassing Tensorflow. The pre-processing of the training set is similar to last time. The training is where things will diverge a bit. I realize in this day and age there are much more accurate and mature resources that can be leveraged to conduct proper sentiment analysis but this is intended to be a learning exercise, so let’s get on with it.

High Level Approach

Create an array of the most frequently occuring words from the negative and positive training data sets, aka bag of words model and lemmatize them, i.e. convert them into their simplest form.
Create feature/label set for positive and negative sentiment data by counting the number of popular words in each sample, from the array created above
Using the above labelled features as inputs, train a 3 layer feedfoward neural network which will output an array containing probability percentages for True and False
We will save the model and then run it on the Bible (The Message translation) outputting the results to a sqlite database
Visualize results

Tools Used

Python (Primary Scripting Language)
PyTorch (Neural Network)
Pandas (Dataframe management)
NLTK (Natural Language processing utility)
Altair (Visualization library)
SQLite3 (Simple file based SQL database engine)

Importing modules and basic setup

import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import numpy as np
import random
import json
from collections import Counter
import numpy as np
import time

# Setup output logging to give us better visibility into progress
import logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)

# Instatiate a lemmatizer to use in creating the word stem set
lemmatizer = WordNetLemmatizer()

Pre-process our labeled movie review data set

Let’s define a method, create_lexicon(), which will comb through all of our positive and negative training data to extract the most frequently used words which appear at least 50 times. These will be converted into their simplified forms and then duplicates will be removed.

For example it would convert this:

Hello this is a test. Hello world. The sky is grey today and my clown hair is red.
The cats wearing red hats sit back on the mat and put down like a clown who doesn't frown.
The sky is maybe not so grey but actually more red, like blood, like cat blood.
Tomorrow I will drink lots of black coffee mixed with gallons of paint.

into this: (note how some words have been converted from plural to singular)

['hello', 'this', 'is', 'a', 'test', 'hello', 'world', 'the', 'sky', 'is', 'grey', 'today', 'and', 'my', 'clown', 'hair', 'is', 'red', 'the', 'cat', 'wearing', 'red', 'hat', 'sit', 'back', 'on', 'the', 'mat', 'and', 'put', 'down', 'like', 'a', 'clown', 'who', 'doe', 'frown', 'the', 'sky', 'is', 'maybe', 'not', 'so', 'grey', 'but', 'actually', 'more', 'red', 'like', 'blood', 'like', 'cat', 'blood', 'tomorrow', 'i', 'will', 'drink', 'lot', 'of', 'black', 'coffee', 'mixed', 'with', 'gallon', 'of', 'paint']

and then count the words that only appear at least N times, if N=2:

['clown', 'a', 'sky', 'and', 'hello', 'of', 'is', 'blood', 'the', 'like', 'grey', 'red', 'cat']

which will become our reference lexicon array for creating feature maps during training and actual usage. We can also consider removing stop words like “is”,“the”,“are” but in our case I found that our model accuracy actually decreased. This makes some sense in that we’re trying to analyze the Bible, words like He and Him are probably more important than they are in regular texts.

def create_lexicon(pos, neg, filename=None):
    """Create unique list of most frequently used words 
    (occuring more than 50 times) from negative
    and postive corpus
    """
    lexicon = []
    for file in [pos, neg]:
        with open(file, 'r') as f:
            contents = f.readlines()
            for l in contents:
                all_words = word_tokenize(l.lower())  # split words into list
                lexicon += list(all_words)

    # lemmatize (simplify) all these words into their core form
    lexicon = [lemmatizer.lemmatize(i) for i in lexicon]
    lexicon = [ word for word in lexicon if word.isalpha() ]
    # Could also append "and word not in stopwords.words('english')" to the above
    # if we want to drop stop words
    w_counts = Counter(lexicon)
    final = [ word for word in w_counts if 1000 > w_counts[word] > 50 ]
    
    if filename:
        logger.debug('writing lexicon to {0}'.format(filename))
        with open(filename, 'w') as lexifile:
            json.dump(final, lexifile)
            
    logger.debug('lexicon contains {0} words'.format(len(final)))
    logger.debug('First 25 words:')
    logger.debug(lexicon[:25])
    return final

Representing phrases as numbers

We’ll need a way to convert our input phrases and sentences into feature arrays, encode_features(). In other words, how often do our lexicon words appear in a given input phrase?

Taking our previous example:

lexicon

['clown', 'a', 'sky', 'and', 'hello', 'of', 'is', 'blood', 'the', 'like', 'grey', 'red', 'cat']

input phrase

"The blood of a calf is red like the dark sky."

gets encoded as:

output feature array

[0, 0, 1, 0, 0, 0, 1, 1, 2, 1, 0, 1, 0]

def encode_features(phrase, lexicon):
    """Given an input phrase, return an array of
    the number of occurences of words from the
    lexicon list created prior
    """
    current_words = word_tokenize(phrase.lower())
    current_words = [lemmatizer.lemmatize(i) for i in current_words]
    features = np.zeros(len(lexicon))
    for word in current_words:
        if word.lower() in lexicon:
            index_value = lexicon.index(word.lower())
            features[index_value] += 1
    features = list(features)
    
    return features

Split the data into training and test groups

Now we need to split our movie reviews training data into training and testing groups (90%:10%) so PyTorch can train and validate its results. We’ll also encode it with the above method and label it “(1,0)” for positive and “(0,1”) for negative.

def create_feature_sets_and_labels(pos, neg, test_size=0.1):
    """Take positive and negative sentiment files and
    generate a list of features and labels from the
    positive and negative sentiment data using the methods
    above
    """
    
    # Create frequently occuring word list
    lexicon = create_lexicon(pos, neg, 'lexicon.json')
    featureset = []
    
    for sentiment_file, sentiment in ((pos, (1,0)),(neg, (0,1))):
        with open(sentiment_file, 'r') as f:
            contents = f.readlines()
        
        for line in contents:
            featureset.append((encode_features(line, lexicon), sentiment))
    
    #featureset = list(featureset)
    random.shuffle(featureset)
    logger.debug('features length is {}'.format(len(featureset)))
    #logger.debug('First 5 features & labels:\n{0}'.format(featureset[:5]))
    
    #featureset = np.array(featureset)
    # https://stackoverflow.com/questions/67183501/setting-an-array-element-with-a-sequence-requested-array-has-an-inhomogeneous-sh
    featureset = np.asarray(featureset, dtype='object')
    testing_size = int(test_size * len(featureset))
    
    # x is features, y is labels
    train_x = list(featureset[:, 0][:-testing_size])
    train_y = list(featureset[:, 1][:-testing_size])

    test_x = list(featureset[:, 0][:-testing_size:])
    test_y = list(featureset[:, 1][:-testing_size:])

    return train_x, train_y, test_x, test_y, lexicon

# Create the train/test groups
train_x, train_y, test_x, test_y, lexicon = create_feature_sets_and_labels('pos.txt', 'neg.txt')

DEBUG:__main__:writing lexicon to lexicon.json
DEBUG:__main__:lexicon contains 406 words
DEBUG:__main__:First 25 words:
DEBUG:__main__:['the', 'rock', 'is', 'destined', 'to', 'be', 'the', 'century', 'new', 'conan', 'and', 'that', 'he', 'going', 'to', 'make', 'a', 'splash', 'even', 'greater', 'than', 'arnold', 'schwarzenegger', 'van', 'damme']
DEBUG:__main__:features length is 10662

Setup and train a pytorch model using our movie review dataset

We’re going to create a fully connected neural network (specifically a multilayer perceptron) consisting of 3 hidden layers and train it to output two classes, positive & negative sentiment, expressed as a probability of each (between 0 and 1). This took multiple iterations, with applications of various fine-tuning techniques along the way, and ultimately we settled on this as a balance between model simplicity and decent distribution of output values, i.e. not polarized around +1.0 and -1. More to follow on this below.

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset
import numpy as np

# Define the parameters
n_nodes_hl1 = 500
n_nodes_hl2 = 500
n_nodes_hl3 = 500
n_classes = 2
batch_size = 128
n_epochs = 15

# Assuming train_x, train_y, test_x, test_y are already loaded as NumPy arrays
train_x_tensor = torch.tensor(np.array(train_x), dtype=torch.float32)
train_y_tensor = torch.tensor(np.array(train_y), dtype=torch.float32)
test_x_tensor = torch.tensor(np.array(test_x), dtype=torch.float32)
test_y_tensor = torch.tensor(np.array(test_y), dtype=torch.float32)

# Create a DataLoader to manage batches
dataset = TensorDataset(train_x_tensor, train_y_tensor)
train_loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

class NeuralNet(nn.Module):
    def __init__(self, input_size):
        super(NeuralNet, self).__init__()
        self.fc1 = nn.Linear(input_size, n_nodes_hl1)
        #self.bn1 = nn.BatchNorm1d(n_nodes_hl1)
        self.dropout1 = nn.Dropout(p=0.3)
        self.fc2 = nn.Linear(n_nodes_hl1, n_nodes_hl2)
        #self.bn2 = nn.BatchNorm1d(n_nodes_hl2)
        self.dropout2 = nn.Dropout(p=0.3)
        self.fc3 = nn.Linear(n_nodes_hl2, n_nodes_hl3)
        self.dropout3 = nn.Dropout(p=0.3)
        self.fc4 = nn.Linear(n_nodes_hl3, n_classes)
    
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.dropout1(x)
        x = F.relu(self.fc2(x))
        x = self.dropout2(x)
        x = self.fc4(x)
        return x

# Initialize the model, loss function, and optimizer
input_size = len(train_x[0])
model = NeuralNet(input_size)
criterion = nn.CrossEntropyLoss(label_smoothing=0.1) #added label smoothing
# weight_decay = L2 regularization to prevent overfitting
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=0.001)

# Training loop
for epoch in range(n_epochs):
    epoch_loss = 0.0
    for batch_x, batch_y in train_loader:
        # Zero the gradients
        optimizer.zero_grad()
        
        # Forward pass
        outputs = model(batch_x)
        
        # Calculate loss
        loss = criterion(outputs, batch_y.argmax(dim=1))
        
        # Backward pass and optimization
        loss.backward()
        optimizer.step()
        
        # Accumulate batch loss
        epoch_loss += loss.item()
    
    print(f'Epoch {epoch+1}/{n_epochs}, Loss: {epoch_loss:.4f}')

# Evaluation
with torch.no_grad():
    model.eval()
    test_outputs = model(test_x_tensor)
    predicted = torch.argmax(test_outputs, dim=1)
    correct = (predicted == test_y_tensor.argmax(dim=1)).sum().item()
    accuracy = correct / len(test_y_tensor)
    print(f'Accuracy: {accuracy:.4f}')

    # Running classification on a specific test example
    test_sample = test_x_tensor[35].unsqueeze(0)
    sentiment = F.softmax(model(test_sample), dim=1)
    print(f'Sentiment: {sentiment.numpy()}')

Epoch 1/15, Loss: 48.9800
Epoch 2/15, Loss: 44.8075
Epoch 3/15, Loss: 43.1938
Epoch 4/15, Loss: 41.1387
Epoch 5/15, Loss: 37.9713
Epoch 6/15, Loss: 33.4742
Epoch 7/15, Loss: 29.8270
Epoch 8/15, Loss: 27.5091
Epoch 9/15, Loss: 25.7234
Epoch 10/15, Loss: 24.5619
Epoch 11/15, Loss: 23.6141
Epoch 12/15, Loss: 23.1657
Epoch 13/15, Loss: 22.6841
Epoch 14/15, Loss: 22.5726
Epoch 15/15, Loss: 22.2421
Accuracy: 0.9611
Sentiment: [[0.05425625 0.94574374]]

Review results of training

Looking at the above, our training achieved 96% accuracy, which is sufficient for this exercise. Let’s move forward.

Load PyTorch model and run a sentiment analysis on the bible

Let’s run each book of the bible through the above model saving the results to a sqlite db as we go along.

import sqlite3

# Clear previous database results
conn = sqlite3.connect('bible.db')
c = conn.cursor()
c.execute('DELETE FROM bible;')
conn.commit()
conn.close()

# Load and use model
def run_prediction(book):
    # Parse in MSG bible in JSON format
    with open('MSG.json', 'r') as foo:
        msg = json.load(foo)

    # Read in the lexicon json defined above
    with open('lexicon.json', 'r') as foo:
        lexicon = json.load(foo)

    # Create and connect to SQLite to save results
    conn = sqlite3.connect('bible.db')
    c = conn.cursor()
                
    print('# (Book, Chapter, Verse, Sentiment (0 for neg, 1 for pos), % pos, % neg)')
    
    num = 0
    epoch = int(time.time())
    
    for chap in msg[book]:
        for verse in msg[book][chap].items():
            verse_info = {'chapter': chap,
                          'verse': verse[0],
                          'content': verse[1]}
            try:
                # Convert each verse into a numerical feature set which can be parsed in by the neural net
                verse_info['content'] = torch.tensor([encode_features(verse_info['content'], lexicon)], dtype=torch.float32)
                
                # Compute the sentiment of the verse
                # Enter eval mode (no more training)
                model.eval()
                with torch.no_grad():
                    sentiment = F.softmax(model(verse_info['content']), dim=1)
                
                # Store results in a local set
                summary = (book,
                           int(verse_info['chapter']),
                           int(verse_info['verse']),
                           int(torch.argmin(sentiment, dim=1).item()),
                           sentiment[0][0].item(),
                           sentiment[0][1].item())
                
                # Occasionally output progress (every 10th verse)
                if num % 10 == 0:
                    cur_epoch = int(time.time())
                    delta_epoch = cur_epoch - epoch
                    epoch = cur_epoch
                    print(summary, delta_epoch)
                num += 1

                # Store the results to DB
                c.execute('INSERT INTO bible VALUES (?,?,?,?,?,?)', summary)
            except Exception as e:
                print(f"Error processing verse: {verse_info['content']}, Error: {e}")
    conn.commit()
    conn.close()

bible = ("Genesis", "Exodus", "Leviticus", "Numbers", "Deuteronomy", 
            "Joshua", "Judges", "Ruth", "1 Samuel", "2 Samuel", 
            "1 Kings", "2 Kings", "1 Chronicles", "2 Chronicles", "Ezra", 
            "Nehemiah", "Esther", "Job", "Psalms", "Proverbs", 
            "Ecclesiastes", "Song of Solomon", "Isaiah", "Jeremiah", "Lamentations", 
            "Ezekiel", "Daniel", "Hosea", "Joel", "Amos", 
            "Obadiah", "Jonah", "Micah", "Nahum", "Habakkuk", 
            "Zephaniah", "Haggai", "Zechariah", "Malachi", 
            "Matthew", "Mark", "Luke", "John", "Acts", 
            "Romans", "1 Corinthians", "2 Corinthians", "Galatians", "Ephesians", 
            "Philippians", "Colossians", "1 Thessalonians", "2 Thessalonians", "1 Timothy", 
            "2 Timothy", "Titus", "Philemon", "Hebrews", "James", 
            "1 Peter", "2 Peter", "1 John", "2 John", "3 John", 
            "Jude", "Revelation")

for book in bible:
    run_prediction(book)

# (Book, Chapter, Verse, Sentiment (0 for neg, 1 for pos), % pos, % neg)
('Genesis', 42, 24, 0, 0.31473326683044434, 0.6852666735649109) 0
('Genesis', 42, 1, 0, 0.07845938950777054, 0.9215406179428101) 0
('Genesis', 42, 11, 0, 0.019253598526120186, 0.980746328830719) 0
('Genesis', 42, 31, 0, 0.337155818939209, 0.662844181060791) 0
('Genesis', 48, 22, 1, 0.6903629302978516, 0.3096370995044708) 0
('Genesis', 48, 11, 1, 0.8064991235733032, 0.19350086152553558) 0
('Genesis', 43, 24, 1, 0.8522595763206482, 0.147740438580513) 0
('Genesis', 43, 1, 1, 0.5361000299453735, 0.4638999402523041) 0
('Genesis', 43, 10, 0, 0.05237709358334541, 0.9476228952407837) 0
('Genesis', 43, 30, 0, 0.08235025405883789, 0.9176497459411621) 0
('Genesis', 49, 22, 1, 0.6194459795951843, 0.3805539906024933) 0
('Genesis', 49, 6, 0, 0.07488685846328735, 0.9251132011413574) 0
('Genesis', 49, 16, 1, 0.7573775053024292, 0.242622509598732) 0
('Genesis', 24, 48, 0, 0.20939424633979797, 0.7906057834625244) 0
('Genesis', 24, 52, 1, 0.6681751012802124, 0.33182492852211) 0
('Genesis', 24, 46, 0, 0.0707147940993309, 0.9292851686477661) 0
('Genesis', 24, 2, 0, 0.30557090044021606, 0.6944290995597839) 0
('Genesis', 24, 38, 1, 0.7063369154930115, 0.29366305470466614) 1
('Genesis', 24, 16, 1, 0.8911573886871338, 0.10884265601634979) 0
('Genesis', 24, 55, 0, 0.22119510173797607, 0.7788048982620239) 0
('Genesis', 25, 22, 1, 0.6228322386741638, 0.3771677613258362) 0
('Genesis', 25, 6, 0, 0.08872702717781067, 0.9112730026245117) 0
('Genesis', 25, 16, 1, 0.8589553833007812, 0.14104463160037994) 0
('Genesis', 26, 26, 0, 0.34340739250183105, 0.656592607498169) 0
('Genesis', 26, 2, 0, 0.019006986171007156, 0.9809930324554443) 0
('Genesis', 26, 12, 0, 0.40904921293258667, 0.5909507870674133) 0
('Genesis', 26, 34, 1, 0.6981646418571472, 0.30183538794517517) 0
('Genesis', 27, 21, 0, 0.20408788323402405, 0.7959120869636536) 0
('Genesis', 27, 1, 0, 0.05529249086976051, 0.9447075128555298) 0
('Genesis', 27, 38, 0, 0.061701204627752304, 0.9382988214492798) 0
('Genesis', 27, 18, 0, 0.06041645631194115, 0.9395835995674133) 0
('Genesis', 20, 10, 0, 0.17973092198371887, 0.8202690482139587) 0
('Genesis', 20, 2, 1, 0.6952648162841797, 0.3047351539134979) 0
('Genesis', 21, 27, 1, 0.6198707818984985, 0.38012927770614624) 0
('Genesis', 21, 5, 0, 0.10042152553796768, 0.8995784521102905) 0
('Genesis', 21, 15, 1, 0.6906400322914124, 0.30935990810394287) 0
('Genesis', 21, 32, 0, 0.36505618691444397, 0.6349438428878784) 0
('Genesis', 22, 4, 0, 0.44092637300491333, 0.5590735673904419) 0
('Genesis', 22, 14, 0, 0.4662764370441437, 0.5337235927581787) 0
('Genesis', 23, 14, 0, 0.317392498254776, 0.6826075315475464) 0
('Genesis', 23, 4, 0, 0.1189899668097496, 0.8810100555419922) 0
('Genesis', 46, 21, 1, 0.5180943608283997, 0.48190557956695557) 0
('Genesis', 46, 7, 1, 0.7374928593635559, 0.2625071108341217) 0
('Genesis', 46, 17, 1, 0.7985047101974487, 0.20149528980255127) 0
('Genesis', 47, 25, 1, 0.8772921562194824, 0.12270782142877579) 0
('Genesis', 47, 3, 0, 0.036432985216379166, 0.9635669589042664) 0
('Genesis', 47, 13, 0, 0.03483176976442337, 0.9651682376861572) 0
('Genesis', 44, 24, 1, 0.6526842713356018, 0.34731563925743103) 0
('Genesis', 44, 1, 1, 0.7267979383468628, 0.2732020318508148) 0
('Genesis', 44, 10, 1, 0.9169881939888, 0.08301184326410294) 0
('Genesis', 44, 30, 1, 0.5791557431221008, 0.42084428668022156) 0
('Genesis', 45, 22, 1, 0.5976513624191284, 0.4023486375808716) 0
('Genesis', 45, 9, 0, 0.13538631796836853, 0.8646136522293091) 0
('Genesis', 45, 19, 1, 0.9202748537063599, 0.07972512394189835) 0
('Genesis', 28, 7, 1, 0.6391882300376892, 0.3608117699623108) 0
('Genesis', 28, 17, 0, 0.09318351745605469, 0.9068164229393005) 0
('Genesis', 29, 22, 1, 0.7644326090812683, 0.23556742072105408) 0
('Genesis', 29, 6, 0, 0.14007945358753204, 0.8599206209182739) 0
('Genesis', 29, 16, 1, 0.5568495988845825, 0.4431504011154175) 0
('Genesis', 40, 21, 0, 0.013059643097221851, 0.986940324306488) 0
('Genesis', 40, 9, 0, 0.22018669545650482, 0.7798133492469788) 0
('Genesis', 40, 19, 1, 0.9508792161941528, 0.04912075400352478) 0
('Genesis', 41, 24, 0, 0.057682204991579056, 0.94231778383255) 0
('Genesis', 41, 44, 1, 0.6582971811294556, 0.34170281887054443) 0
('Genesis', 41, 4, 0, 0.13698235154151917, 0.8630176782608032) 0
('Genesis', 41, 13, 0, 0.0966290682554245, 0.9033708572387695) 0
('Genesis', 41, 37, 0, 0.4296095669269562, 0.5703903436660767) 0
('Genesis', 1, 25, 0, 0.13194411993026733, 0.8680558800697327) 0
('Genesis', 1, 3, 1, 0.5361000299453735, 0.4638999402523041) 0
('Genesis', 1, 13, 0, 0.13031969964504242, 0.8696802854537964) 0
('Genesis', 3, 24, 0, 0.4819332957267761, 0.5180667042732239) 0
('Genesis', 3, 7, 1, 0.7470265030860901, 0.2529734671115875) 0
('Genesis', 3, 17, 0, 0.3102087378501892, 0.6897912621498108) 0
('Genesis', 2, 1, 0, 0.06801030784845352, 0.9319896697998047) 0
('Genesis', 2, 10, 0, 0.43545442819595337, 0.5645455121994019) 0
('Genesis', 5, 25, 0, 0.3923676609992981, 0.6076322793960571) 0
('Genesis', 5, 3, 0, 0.33872270584106445, 0.6612772941589355) 0
('Genesis', 5, 13, 0, 0.4439464211463928, 0.5560535192489624) 0
('Genesis', 5, 32, 0, 0.3923676609992981, 0.6076322793960571) 0
('Genesis', 4, 2, 0, 0.029812293127179146, 0.9701877236366272) 0
('Genesis', 4, 12, 1, 0.7190654873847961, 0.2809344530105591) 0
('Genesis', 7, 22, 0, 0.36084452271461487, 0.639155387878418) 0
('Genesis', 7, 8, 1, 0.627878725528717, 0.37212127447128296) 0
('Genesis', 7, 18, 0, 0.2687183618545532, 0.7312816381454468) 0
('Genesis', 6, 6, 0, 0.13707111775875092, 0.8629289269447327) 0
('Genesis', 6, 16, 1, 0.8929886817932129, 0.1070113256573677) 0
('Genesis', 9, 23, 1, 0.9597183465957642, 0.04028170183300972) 0
('Genesis', 9, 9, 1, 0.9274868965148926, 0.07251307368278503) 0
('Genesis', 9, 19, 1, 0.5752152800559998, 0.42478474974632263) 0
('Genesis', 8, 7, 0, 0.3747083842754364, 0.6252917051315308) 0
('Genesis', 8, 17, 1, 0.7899051904678345, 0.2100948840379715) 0
('Genesis', 39, 2, 0, 0.10039543360471725, 0.8996046185493469) 0
('Genesis', 39, 12, 0, 0.06997539103031158, 0.9300246238708496) 0
('Genesis', 38, 27, 0, 0.06398095190525055, 0.9360190033912659) 0
('Genesis', 38, 5, 0, 0.10673844814300537, 0.8932614922523499) 0
('Genesis', 38, 15, 1, 0.7840783596038818, 0.21592164039611816) 0
('Genesis', 11, 27, 0, 0.35753366351127625, 0.6424664258956909) 0
('Genesis', 11, 5, 0, 0.3154466450214386, 0.6845533847808838) 0
('Genesis', 11, 15, 1, 0.5943444967269897, 0.40565553307533264) 0
('Genesis', 10, 25, 1, 0.9574434161186218, 0.04255654662847519) 0
('Genesis', 10, 3, 1, 0.5361000299453735, 0.4638999402523041) 0
('Genesis', 10, 13, 0, 0.47395142912864685, 0.5260485410690308) 0
('Genesis', 10, 32, 1, 0.7979674935340881, 0.20203247666358948) 0
('Genesis', 13, 1, 0, 0.08716055005788803, 0.9128395318984985) 0
('Genesis', 12, 10, 0, 0.09384032338857651, 0.9061596393585205) 0
('Genesis', 12, 1, 1, 0.9528328776359558, 0.04716715216636658) 0
('Genesis', 15, 10, 1, 0.9160295128822327, 0.08397053182125092) 0
('Genesis', 15, 1, 0, 0.0348999947309494, 0.9651000499725342) 0
('Genesis', 14, 24, 0, 0.031359195709228516, 0.9686408638954163) 0
('Genesis', 14, 7, 1, 0.9620340466499329, 0.03796599432826042) 0
('Genesis', 14, 17, 0, 0.12473555654287338, 0.8752644658088684) 0
('Genesis', 17, 22, 1, 0.5361000299453735, 0.4638999402523041) 0
('Genesis', 17, 8, 1, 0.7164437770843506, 0.2835562825202942) 0
('Genesis', 17, 18, 1, 0.6036794185638428, 0.39632055163383484) 0
('Genesis', 16, 2, 1, 0.8493285775184631, 0.1506713628768921) 0
('Genesis', 19, 27, 0, 0.10283444076776505, 0.8971655368804932) 0
('Genesis', 19, 5, 0, 0.19853056967258453, 0.8014694452285767) 0
('Genesis', 19, 12, 1, 0.5000796318054199, 0.49992039799690247) 0
('Genesis', 19, 36, 1, 0.675009548664093, 0.32499048113822937) 0
('Genesis', 18, 21, 0, 0.03314715996384621, 0.9668529033660889) 0
('Genesis', 18, 7, 0, 0.1406669020652771, 0.8593330979347229) 0
('Genesis', 18, 17, 0, 0.11861494183540344, 0.881385087966919) 0
('Genesis', 31, 48, 0, 0.0997927337884903, 0.9002072215080261) 0
('Genesis', 31, 21, 0, 0.016087831929326057, 0.9839121699333191) 0
('Genesis', 31, 41, 1, 0.639473021030426, 0.36052706837654114) 0
('Genesis', 31, 51, 0, 0.17244765162467957, 0.8275523781776428) 0
('Genesis', 31, 16, 1, 0.7040191292762756, 0.29598090052604675) 0
('Genesis', 31, 55, 0, 0.08255169540643692, 0.9174483418464661) 0
('Genesis', 30, 21, 1, 0.8915579319000244, 0.10844206809997559) 0
('Genesis', 30, 5, 1, 0.5361000299453735, 0.4638999402523041) 0
('Genesis', 30, 13, 0, 0.13160055875778198, 0.8683993816375732) 0
('Genesis', 30, 37, 0, 0.36590564250946045, 0.6340943574905396) 0
('Genesis', 37, 20, 0, 0.28424274921417236, 0.7157572507858276) 0
('Genesis', 37, 4, 0, 0.2605959475040436, 0.739404022693634) 0
('Genesis', 37, 14, 0, 0.20547519624233246, 0.7945247888565063) 0
('Genesis', 37, 33, 1, 0.9379627704620361, 0.062037281692028046) 0
('Genesis', 36, 22, 0, 0.28905922174453735, 0.7109407186508179) 0
('Genesis', 36, 4, 1, 0.6669361591339111, 0.3330638110637665) 0
('Genesis', 36, 12, 1, 0.8958793878555298, 0.10412059724330902) 0
('Genesis', 36, 36, 0, 0.28774622082710266, 0.712253749370575) 0
('Genesis', 35, 21, 0, 0.15672820806503296, 0.843271791934967) 0
('Genesis', 35, 7, 1, 0.5473823547363281, 0.45261770486831665) 0
('Genesis', 35, 17, 0, 0.05087278038263321, 0.9491271376609802) 0
('Genesis', 34, 22, 1, 0.5315243005752563, 0.46847572922706604) 0
('Genesis', 34, 6, 1, 0.9207045435905457, 0.07929543405771255) 0
('Genesis', 34, 16, 1, 0.8754245638847351, 0.12457545101642609) 0
('Genesis', 33, 14, 0, 0.3294190466403961, 0.6705809235572815) 0
('Genesis', 33, 4, 0, 0.12549614906311035, 0.8745038509368896) 0
('Genesis', 32, 21, 0, 0.09435401111841202, 0.905646026134491) 0
('Genesis', 32, 7, 1, 0.6714303493499756, 0.32856959104537964) 0
('Genesis', 32, 17, 1, 0.7983325719833374, 0.2016674280166626) 0
('Genesis', 50, 20, 1, 0.9269453287124634, 0.07305461168289185) 0
('Genesis', 50, 6, 0, 0.42963188886642456, 0.5703681111335754) 0
('Genesis', 50, 16, 0, 0.27898475527763367, 0.7210152745246887) 0
# (Book, Chapter, Verse, Sentiment (0 for neg, 1 for pos), % pos, % neg)
('Exodus', 24, 11, 0, 0.0404474139213562, 0.959552526473999) 0
('Exodus', 24, 3, 0, 0.04048745334148407, 0.95951247215271) 0
('Exodus', 25, 26, 1, 0.7702346444129944, 0.22976532578468323) 0
('Exodus', 25, 3, 1, 0.7048089504241943, 0.29519107937812805) 0
('Exodus', 25, 11, 0, 0.3534781336784363, 0.6465218663215637) 0
('Exodus', 25, 31, 1, 0.8588312864303589, 0.1411687284708023) 0
('Exodus', 26, 26, 1, 0.6406383514404297, 0.3593616485595703) 0
('Exodus', 26, 5, 1, 0.6772359013557434, 0.3227640986442566) 0
('Exodus', 26, 15, 1, 0.5361000299453735, 0.4638999402523041) 0
('Exodus', 26, 35, 0, 0.45115604996681213, 0.5488439798355103) 0
('Exodus', 27, 17, 0, 0.1877947449684143, 0.8122052550315857) 0
('Exodus', 27, 7, 0, 0.45978397130966187, 0.5402160286903381) 0
('Exodus', 20, 22, 1, 0.94534832239151, 0.0546516515314579) 0
('Exodus', 20, 8, 1, 0.9530900716781616, 0.046909939497709274) 0
('Exodus', 20, 18, 0, 0.4750576317310333, 0.5249423980712891) 0
('Exodus', 21, 29, 1, 0.8283783197402954, 0.17162171006202698) 0
('Exodus', 21, 11, 0, 0.10481549799442291, 0.8951845765113831) 0
('Exodus', 21, 31, 0, 0.3098178505897522, 0.690182089805603) 0
('Exodus', 22, 27, 0, 0.10764598101377487, 0.8923540115356445) 0
('Exodus', 22, 5, 0, 0.2105185091495514, 0.789481520652771) 0
('Exodus', 22, 15, 0, 0.04268207773566246, 0.9573178887367249) 0
('Exodus', 23, 26, 1, 0.9677294492721558, 0.03227050229907036) 0
('Exodus', 23, 2, 0, 0.03797952085733414, 0.9620205163955688) 0
.. output truncated ..

Surveying the results

Let’s first examine the results by examining the distribution of sentiment values broken out by chapter. The model outputted sentiment using 2 values, positive and negative, which add up to 1. To simplify things, we’ll convert this into a single net_sentiment value by subtracting the negative sentiment value from the positive. So the closer we get to +1.0, the more positive the sentiment and the closer to -1.0, vice versa. We’ll average these across each chapter and then visualize the results using a histogram.

import pandas as pd
import altair as alt

# Connect to SQLite3 database
conn = sqlite3.connect('bible.db')  # Update with the path to your SQLite database

# Query to fetch the relevant data
query = '''
    SELECT book, chapter, verse, sentiment, pos, neg
    FROM bible;  -- Replace with your actual table name
'''

# Load data into a pandas DataFrame
df = pd.read_sql_query(query, conn)

# Close the database connection
conn.close()

# Calculate net sentiment via pos - neg
df['net_sentiment'] = df['pos'] - df['neg']

# Calculate the average sentiment for each chapter
chapter_avg = df.groupby(['book', 'chapter']).agg({'net_sentiment': 'mean'}).reset_index()

# Function to generate a histogram of column values within a dataframe
def gen_histogram(df, column, title, xlabel, ylabel):

    # Calculate the mean and median of the specified column
    mean_value = df[column].mean()
    median_value = df[column].median()

    # Create the histograms
    histogram = alt.Chart(df).mark_bar().encode(
    alt.X(f'{column}:Q', bin=alt.Bin(maxbins=30), title=xlabel),
    alt.Y('count()', title=ylabel),
    tooltip=['count()']
    ).properties(
        width=900,
        height=600,
        title=(title, f'(Mean: {mean_value:.2f} Median: {median_value:.2f})')
    )
 
    # Note that
    
    # Add a vertical solid line for the mean
    mean_line = alt.Chart(pd.DataFrame({'value': [mean_value]})).mark_rule(color='red').encode(
        x='value:Q',
        tooltip=[alt.Tooltip('value:Q', title='Mean')]
    )

    # Add a vertical dashed line for the median
    median_line = alt.Chart(pd.DataFrame({'value': [median_value]})).mark_rule(color='red', strokeDash=[4, 4]).encode(
        x='value:Q',
        tooltip=[alt.Tooltip('value:Q', title='Median')]
    )

    return histogram + mean_line + median_line

# Display the histogram
gen_histogram(chapter_avg, 'net_sentiment', f'Distribution of Avg Chapter Sentiment Values (PyTorch)', 'Average Sentiment (-1 to 1)', 'Count of Chapters')

A roughly normal looking distribution, though we can see that the mean and median sentiment skews slightly negative.

Visualizing every book and chapter

Let’s visualize the sentiment analysis of every book in the Bible using a heatmap for each chapter average (as calculated above) ranging from -1 (100% negative (in red)) to +1 (100% positive (in green)).

# Create the heatmap using Altair
heatmap = alt.Chart(chapter_avg).mark_rect(stroke='white', strokeWidth=1).encode(
    x=alt.X('chapter:O', title='Chapter', axis=alt.Axis(labelAngle=70)),
    y=alt.Y('book:N', title='Book', sort=bible),
    color=alt.Color('net_sentiment:Q', scale=alt.Scale(domain=[-1, 0, 1], range=['#d73027', '#f7f7f7', '#1a9850']), title='Sentimemnt Rating'),
    tooltip=['book', 'chapter', 'net_sentiment']
).properties(
    width=1100,
    height=800,
    title='Bible Heatmap of Sentiment by Book and Chapter (PyTorch)'
)

# Display the heatmap
heatmap

1st and 2nd Samuel, the Kings, and the Gospels (Mark, Luke, John, Acts) are looking consistently fiery. On the other hand things are faring more positively with the Proverbs, Thessalonians, and 1st John.

How did our model perform at a granular level?

Let’s go deeper and take a look at the distribution of sentiment across all verses.

# Increase the row limit (in order to bypass Altair's limitation of 5000 items in a dataset)
alt.data_transformers.disable_max_rows()

# Display the histogram
gen_histogram(df, 'net_sentiment', 'Distribution of Verse Sentiment Values (PyTorch)', 'Sentiment (-1 to 1)', 'Count of Verses')

We observe a steady increase in negative sentiment going from -0.6 to -1.0 and a small spike in the neutral zone between 0.0 and 0.1.

Takeaways

As I mentioned up top, performing a sentiment analysis on the Bible using a neural network model trained on a set of 11000 rotten tomatoes movie reviews was only meant to be an academic exercise in training a PyTorch neural network. Even though a modern translation of the Bible was used, the rotten tomato lexicon contained words which simply had no relevance given the context such as filmmaker, hollywood, cinematic, and french so we understood there were limitations going in.

We probably could have benefitted from incorporating a 3rd classification of neutral since many bible verses don’t necessarily have a strong sentiment.

We went through many iterations of fine tuning to mitigate polarized value outputs that were emerging during the initial phases of training. A variety of techniques were employed including: - Adjusting model complexity (number of layers and nodes) - Adding dropouts to promote more even use of neurons and prevent overfitting - Added L2 regularization to improve generalization performance by normalizing weights (and prevent overfitting)

There were other tweaks that were ultimately not included because they didn’t provide any positive enhancements and it reinforced just how many options one has when fine-tuning a deep neural network, equal part art and science.

Some things to consider for a follow-up would be utilizing tensorboard logging to track results and better parameter management for fine-tuning, as opposed to depending on jupyter notebook snapshots.

This was a fun exercise that’s given me some food for thought for future experiments. Stay tuned!