Who Wrote Hebrews? Using Models to Guess

I ran across an article a while back that used machine learning to estimate who wrote several of The Federalist Papers, since that author was unknown. It occurred to me that it’d be interesting, if not fruitful nor conclusive, to run the same type of analysis to see who wrote Hebrews.

*If you said “The Holy Spirit wrote it”, I wouldn’t argue. But that would also cut this exercise very short.

I could only find a clean, workable version of the Bible in the King James Version. Already, note that we’re going to be running text mining algorithms on a translation, and an older one at that. Not exactly ideal, but learning Greek is out of scope for this project, so we’ll press on.

*I use the ESV daily, but the larger point is that it’s not in the original language, so certain nuances that would be helpful may be lost. This is why we let scholars and theologians, not data analysts, handle this sort of thing.

Note that we’ll limit the analysis to the New Testament only. This is working on the assumption that the author also wrote one of the other books, something we’ll get back to later. On a technical side, I switched the model from the Federalist study from Random Forest to XGBoost, since it’s a bit more robust.

Show the code

library(gt)
library(gtExtras)
library(hrbrthemes)
library(scriptuRs)
library(textrecipes)
library(themis)
library(tidymodels)
library(tidytext)
library(tidyverse)

set.seed(2025050)

new_testament_authors <- tibble(
  book_title = c(
    "Matthew", "Mark", "Luke", "John", "Acts",
    "Romans", "1 Corinthians", "2 Corinthians", "Galatians", "Ephesians",
    "Philippians", "Colossians", "1 Thessalonians", "2 Thessalonians",
    "1 Timothy", "2 Timothy", "Titus", "Philemon", "Hebrews",
    "James", "1 Peter", "2 Peter", "1 John", "2 John", "3 John",
    "Jude", "Revelation"
  ),
  author_name = c(
    "Matthew", "Mark", "Luke", "John", "Luke",
    "Paul", "Paul", "Paul", "Paul", "Paul",
    "Paul", "Paul", "Paul", "Paul",
    "Paul", "Paul", "Paul", "Paul", "Unknown",
    "James", "Peter", "Peter", "John", "John", "John",
    "Jude", "John"
  )
)

nt <- kjv_bible() %>%
  filter(volume_title == 'New Testament') %>%
  select(book_title, verse_title, text) %>%
  inner_join(new_testament_authors)

nt_tidy_text <- nt %>%
  unnest_tokens(word, text)

nt_dtm <- nt_tidy_text %>%
  count(book_title, word, sort = TRUE) %>%
  cast_sparse(book_title, word, n)

data_split <- nt %>%
  filter(author_name != 'Unknown') %>%
  initial_split(strata = author_name)

training_data <- training(data_split)
validation_data <- testing(data_split)

testing_data <- nt %>%
  filter(author_name == "Unknown")

rec <- recipe(author_name ~ text, data = training_data) %>%
  step_tokenize(text, token = "ngrams", options = list(n = 3)) %>%
  step_tokenfilter(text, max_tokens = 250) %>%
  step_tfidf(text) %>%
  step_upsample(author_name) %>%
  prep()


train_data <- juice(rec)
val_data <- bake(rec, new_data = validation_data)
test_data <- bake(rec, new_data = testing_data)

xgb_spec <- boost_tree() %>%
  set_engine("xgboost") %>%
  set_mode("classification")

xgb_model <- xgb_spec %>%
  fit(author_name ~ ., data = train_data)

xgb_predict <- testing_data %>% 
  bind_cols(predict(xgb_model, new_data = test_data)) 

xgb_predict %>%
  count(.pred_class) %>%
  group_by(.pred_class) %>%
  ggplot(aes(x = n, y = reorder(.pred_class, n), fill = 'a')) +
  scale_alpha_ordinal(range = c(0.5, 1)) +
  geom_col(position = "dodge", color = "black") +
  theme_ipsum() +
  #scale_fill_manual(values = c("#304890", "#6A7E50")) +
  guides(alpha = "none") +
  theme(legend.position = "none")+
  scale_fill_manual(values = c('#A0522D'))+
  ylab('Predicted Author')+
  xlab('Number of verses predicted')

XGBoost, like Luther, believes that Paul wrote Hebrews. I’m sure Luther would be relieved to hear it, albeit befuddled about what exactly a machine learning model (or a computer, for that matter) is.

However; the model is missing Apollo as a candidate. Why? Because it was only trained on authors we already know - the idea of it being somebody else is outside of the possibilities we gave it. Maybe - maybe - if we somehow got writing samples from everybody alive at the time, we could build a model that would account for that, but we’d likely get some false positives as well.

The point here, outside of having some fun with modeling, is that your model very much depends on your assumptions. We don’t know for sure who wrote Hebrews - it may have been Paul, it may have been someone else. Our model assumed it was an author we knew, its answer may be right but it’d be right for the wrong reasons. When building models - machine learning or mental models while trying to solve a question - always check your assumptions.