16 Dec 2022

Using Stylometry to Determine The Author of a Quote

It is possible to determine the author of a quote by analyzing the characteristics of the writing style of the quote and comparing it to the writing styles of known authors. It is called stylometry.

Stylometry is the study of linguistic style, which involves analyzing the statistical patterns and characteristics of a person’s writing in order to identify the author or determine the authenticity of a document. It can be used to compare the writing styles of different authors or to determine the authorship of an anonymous or disputed document.

To do this in Python, you can use a library such as nltk (Natural Language Toolkit) or stylo to extract linguistic features from the quotes, such as word frequencies, n-grams, or readability scores. You can then use machine learning techniques, such as supervised learning or unsupervised learning, to train a model on the extracted features and use the model to predict the author of the quote.

To use stylometry to determine the author of a quote using Python, you would need to do the following:

  1. Collect a set of quotes by the two authors that you want to compare. These quotes should be representative of the authors’ typical writing styles.

  2. Preprocess the quotes by removing any non-linguistic elements (e.g. punctuation, numbers, special characters) and converting them to lowercase.

  3. Extract features from the quotes that can be used to represent the authors’ writing styles. These features could include things like word counts, word frequencies, sentence lengths, and grammatical structures.

  4. Use a machine learning algorithm to train a classifier on the feature vectors of the quotes by the two authors. This classifier should be able to predict the author of a quote based on its features.

  5. Use the trained classifier to predict the author of the quote that you want to identify.

Here is an example of a Python script that uses nltk and supervised learning to determine the author of a quote:


import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('stopwords')

# Function to extract features from a quote
def extract_features(quote):
    # Tokenize the quote and remove stop words
    tokens = word_tokenize(quote)
    filtered_tokens = [word for word in tokens if word not in stopwords.words('english')]
    
    # Compute word frequencies
    word_freq = nltk.FreqDist(filtered_tokens)
    
    # Extract features
    features = {word: word_freq[word] for word in word_freq}
    
    return features

# Training quotes
quotes = [
    ('Author 1', 'This is a quote by Author 1.'),
    ('Author 1', 'Another quote by Author 1.'),
    ('Author 2', 'This is a quote by Author 2.'),
    ('Author 2', 'Another quote by Author 2.'),
]

# Extract features from the training quotes
featuresets = [(extract_features(quote), label) for label, quote in quotes]

# Split the data into training and test sets
train_set, test_set = featuresets[:3], featuresets[3:]

# Train a classifier using the training data
classifier = nltk.NaiveBayesClassifier.train(train_set)

# Test the classifier on the test data
accuracy = nltk.classify.accuracy(classifier, test_set)
print(f'Accuracy: {accuracy:.2f}')

# Predict the author of a new quote
new_quote = 'This is a new quote.'
prediction = classifier.classify(extract_features(new_quote))
print(f'Predicted author: {prediction}')

Here is another example of a Python script that demonstrates how to use stylometry to determine the author of a quote using the above steps:

import string
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

# List of quotes by two different authors
quotes = [
    ("Author 1", "This is a quote by Author 1."),
    ("Author 1", "Another quote by Author 1."),
    ("Author 2", "This is a quote by Author 2."),
    ("Author 2", "Another quote by Author 2.")
]

# Preprocess the quotes by removing non-linguistic elements and converting to lowercase
def preprocess(quote):
    # Remove punctuation and numbers
    quote = quote.translate(str.maketrans('', '', string.punctuation + string.digits))
    # Convert to lowercase
    return quote.lower()

# Extract word frequencies from the quotes
def extract_features(quotes):
    features = []
    for _, quote in quotes:
        word_counts = Counter(quote.split())
        features.append(word_counts)
    return features

# Train a classifier on the feature vectors of the quotes
def train_classifier(features, labels):
    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform(features)
    classifier = LinearSVC()
    classifier.fit(X, labels)
    return classifier

# Preprocess the quotes
preprocessed_quotes = [(author, preprocess(quote)) for author, quote in quotes]

# Extract word frequencies from the quotes
features = extract_features(preprocessed_quotes)

# Get the labels for the quotes
labels = [author for author, _ in preprocessed_quotes]

# Train a classifier on the feature vectors
classifier = train_classifier(features, labels)

# Predict the author of a quote
quote = "This is an unknown quote."
prediction = classifier.predict([extract_features([("", quote)])[0]]