CAtN: Nanogenmo

Posted on
catn narrative computational-approaches-to-narrative markov-chains tape

When the winter nights came on, they used to bring home turf from the precipice Of someone immense in passion, pulse, and power, The genius of poets of old lands, they do not think they slighted upon any account; and they did not see, but being blind, believed.

Nanogenmo (National Novel Generation Month) is an annual collaborative activity where participants are encouraged to write code to generate a novel, then share the output and code.

The nanogenmo definition of novel is very broad, the only requirement is that the output more than 50k words.

The quoted section above is the opening of my generated novel.

Methodology

My methodology was inspired by the world of audio engineering, specifically audio mixing and modulation (cyclical) effects.

I focused on the ability to make smaller Markov models and combine them.

  • Input text is preprocessed to remove names and gendered pronouns to make models overlap more
  • An input text is treated as a sort of tape by splitting it into many small, sequential models
  • A moving window is used as a sort of tape head
    • The window size for each track is normalized to the document length
  • Several tapes are read from in parallel and mixed them down into a single model
    • Each track is weighted by total sentence length, longer documents have more weight
  • A single sentence is printed at a time from the mixed model

For inputs, I used available Guttenberg texts in NLTK, a Python package for NLP.

  • ‘austen-emma.txt’,
  • ‘austen-persuasion.txt’,
  • ‘austen-sense.txt’,
  • ‘blake-poems.txt’,
  • ‘bryant-stories.txt’,
  • ‘burgess-busterbrown.txt’,
  • ‘carroll-alice.txt’,
  • ‘chesterton-ball.txt’,
  • ‘chesterton-brown.txt’,
  • ‘chesterton-thursday.txt’,
  • ‘edgeworth-parents.txt’,
  • ‘milton-paradise.txt’,
  • ‘shakespeare-caesar.txt’,
  • ‘shakespeare-hamlet.txt’,
  • ‘shakespeare-macbeth.txt’,
  • ‘whitman-leaves.txt’

Next Steps

The next step would be to develop modulation effects.

First, I would need modulation sources eg a sine or triangle wave signal. Then, I could modulate variables such as tape head width and model weight.

Alternatively, it might be interesting to try replicating time based effects such as delay since this might introduce some interesting recursive effects.

Appendix: Code

from collections import Counter
import random
import itertools
import datetime

from nltk.corpus import gutenberg
import spacy
import markovify


def doc_to_text(doc) -> str:
    """preprocess text to remove names and gendered pronouns"""

    text_parts = []

    for tok in doc:
        if tok.tag_ == "NNP":
            new_part = "someone" + tok.whitespace_
            text_parts.extend(new_part)
        elif tok.tag_ == "NNPS":
            new_part = "they" + tok.whitespace_
            text_parts.extend(new_part)
        elif tok.tag_ == "PRP":
            new_part = "they" + tok.whitespace_
            text_parts.extend(new_part)
        elif tok.tag_ == "PRP$":
            new_part = "their" + tok.whitespace_
            text_parts.extend(new_part)
        else:
            new_part = tok.text_with_ws
            text_parts.extend(new_part)

    anon_text = "".join(text_parts)

    split_words = anon_text.split(" ")
    no_consec_duplicates = [i[0] for i in itertools.groupby(split_words)]
    output_text = " ".join(no_consec_duplicates)

    return output_text


sentence_target = 3500

nlp = spacy.load("en_core_web_lg")

nltk_gutenberg_text_names = [
    "austen-emma.txt",
    "austen-persuasion.txt",
    "austen-sense.txt",
    "blake-poems.txt",
    "bryant-stories.txt",
    "burgess-busterbrown.txt",
    "carroll-alice.txt",
    "chesterton-ball.txt",
    "chesterton-brown.txt",
    "chesterton-thursday.txt",
    "edgeworth-parents.txt",
    "milton-paradise.txt",
    "shakespeare-caesar.txt",
    "shakespeare-hamlet.txt",
    "shakespeare-macbeth.txt",
    "whitman-leaves.txt",
]

data = [
    {"name": name, "raw": gutenberg.raw(name)} for name in nltk_gutenberg_text_names
]

# parse each text document with spacy
for record in data:
    doc = nlp(record["raw"])
    record.update(dict(doc=doc))

# break down each document into a list of mixable, one sentence models
for record in data:
    doc = record["doc"]

    sents = list(doc.sents)

    sent_texts = [doc_to_text(sent) for sent in sents]

    single_sentence_models = []

    for sent_text in sent_texts:
        try:
            model = markovify.Text(sent_text, state_size=2)
            single_sentence_models.append(model)
        except:
            pass

    record["single_sentence_models"] = single_sentence_models

outputs = []

# scale weight for each document by sentence weight
max_len = max([len(record["single_sentence_models"]) for record in data])
weights = [len(record["single_sentence_models"]) / max_len for record in data]

for i in range(sentence_target):
    progress = i / sentence_target
    end_window_norm = (i + 50) / sentence_target
    book_models = []
    for record in data:
        sentence_count = len(record["single_sentence_models"])
        start = int(progress * sentence_count)
        end = int(end_window_norm * sentence_count)
        end = end if end > start else start + 1
        combined_model = markovify.combine(record["single_sentence_models"][start:end])
        book_models.append(combined_model)
    multi_model = markovify.combine(book_models, weights)
    new_sent = multi_model.make_sentence(tries=1000)
    if new_sent:
        outputs.append(new_sent)

output_text = " ".join(outputs)

timestamp = str(int(datetime.datetime.now().timestamp()))
filename = "novel_" + timestamp + ".txt"

with open(filename, "w") as text_file:
    text_file.write(output_text)