Automated generation of coherent text is an area of Natural Language Processing (NLP) that has garnered a great deal of attention over the past several years. Several state-of-the-art language models have been developed that are capable of automatically generating text at a quality that approaches that of human-generated text. The possibilities for automated text generation are endless, and while many potential use cases are seemingly benign (i.e. automated summarization of long texts, generation of sporting event recaps, generation of text for entertainment purposes, etc.), other applications such as generation of propaganda and fake news stories have been identified as real risks associated with this technology.
For this project, I will be using a powerful language model known as GPT-2 to automatically generate poetry. In the spirit of Halloween, I will be training the model using the works of one of America’s spookiest poets: Edgar Allan Poe. The goal of this project is to automatically generate grouped lines of poetry (known as stanzas) that closely resemble the writing style of Edgar Allan Poe. In an ideal world, an expert would not be able to distinguish the automatically generated poems from real poems written by Poe. Let’s see how close to that goal we can get. Code for this project is saved in this GitHub repository.
The source for all of the poems we will be using to fine-tune the pre-trained model is The Complete Poetical Works of Edgar Allan Poe, a collection of poems and essays written by Poe. An HTML version of this poetry collection is available in the public domain via Project Gutenberg. Poem text and titles were scraped from the collection using BeautifulSoup. The code I used to scrape the poems can be found in the GitHub repository for this project.
We want our model to generate complete stanzas of poetry, rather than prose. As a result, we will remove prose entries scraped from the collection from our training data. These include essays, dialogue for plays, notes/commentary, and prose poems. While prose poems are technically a form of poetry, we are omitting them to try and maintain a lyrical style to the poem stanzas that are generated.
Poetry text was scraped and split into stanzas containing multiple lines. Each stanza of text will serve as an individual text sequence when input into the model for training. The cleaned stanza data is saved into CSV that we will load for model training:
GPT-2 & Transformers
Generative Pre-Trained Transformer 2 (GPT-2) is a transformer-based language model developed by OpenAI. The model generated a buzz after its creators’ initial decision not to release the full trained model due to fear of its potential misuse. Smaller versions of the model were released beginning in January 2019, and the full model was eventually released towards the end of 2019.
Text used to train the model was obtained by scraping outgoing URLs posted on Reddit with a minimum karma value of 3. Over 8 million websites were used to train GPT-2, and the full model (“extra large”) includes over 1.5 billion parameters. Smaller versions of the model contain 762 million (“large”), 345 million (“medium”), and 117 million (“small”) parameters.
GPT-2 is an unsupervised transformer language model. Let’s break that phrase apart to get a better understanding of how GPT-2 works. Language models are simply machine learning models that take text as an input and attempt to predict the next word in a sequence using probability distributions. The goal of a language model is to identify contextual relationships between different words and to use this contextual information to develop text predictions.
A Transformer is a type of NLP machine learning learning model. The Transformer model was originally proposed in 2017 in a paper entitled Attention is All You Need. You can read more about how transformers work in this article. Transformers currently form the basis for essentially all state-of-the-art language models, although their architecture has been modified since their original development.
GPT-2 and other language models utilize unsupervised learning methods on a large quantity of data during training. Unsupervised learning methods look for patterns within a set of data rather than trying to identify a relationship between data and an associated set of labels. This latter method is known as supervised learning, and it is common throughout other branches of machine learning. Using unsupervised learning methods allows allows for language models to develop a high number of features that abstractly represent basic rules of grammar and spelling. These features generalize well to unseen text, allowing them to be fine-tuned on target selections of text and applied to a number of different NLP tasks, including automated text generation.
This article only provides a very basic description of how GPT-2 was created. You can find more detailed information about GPT-2 in the fantastic article The Illustrated GPT-2 (Visualizing Transformer Language Models).
Also, it should be noted that more powerful language models are being developed constantly, and OpenAI announced development of GPT-3 in 2020. At this time GPT-3 has been licensed to Microsoft and is not available for public use.
To begin, we will load all of the libraries will will need to train our model and define some constants that we will be using.
The MAX_LEN constant refers to the maximum acceptable length of each text sample provided to the model for training. Text samples containing fewer tokens than this value will be padded to meet this length, while samples that exceed this value will be truncated. Selection of this value is critical, as some pre-trained text models (including GPT-2) are only able to process text sequences up to a specific length.
Tokenization is the process of converting a string of characters or words into a series of sub-components (tokens), or “building-blocks” that can be used to reconstruct the original string of characters or words. A simple example of a set of tokens is the English language alphabet. Each character represents a token, and these tokens can be arranged in meaningful ways to form words. In the same way, the English language vocabulary represents a series of tokens that can be combined to form sentences.
Both character and word-based tokenization methods are used in NLP tasks, but Transformer-based language models often use a different type of tokenization: byte-pair encoding. Byte-pair encoding splits a piece of text into character sequences, rather than individual characters or full words. This method develops tokens for combinations of characters that frequently occur together. Use of byte-pair encoding for language modeling has been found to produce superior results compared to other tokenization methods.
Let’s load the GPT-2 tokenizer that we will be using and add a few special tokens. These tokens will be used to represent the beginning (<BOS>) and end (<EOS>) of each sequence of text (in this case, each poem stanza). An additional padding token (<PAD>) is created so that all entries can be padded to the same length for model training. Note that we are loading the tokenizer using the small version of the pre-trained GPT-2 model due to memory limitations.
In order to train our model, we will need to load all of our poem stanza data into a customized Dataset object that inherits from PyTorch’s Dataset class. The Dataset will take a list of poem stanzas as input and apply the GPT-2 tokenizer to each one. It will automatically insert the special beginning-of-sequence and end-of-sequence tokens at the beginning and end of each stanza. Additionally, it will pad or truncate each stanza based on the MAX_LEN value that was defined previously. Finally, it will define input IDs and attention masks for each poem stanza. Now all of this data can be accessed easily via the Dataset object
Train / Validation Split
Now that we have created a Dataset with all of our poem stanzas, we will need to split the full combined Dataset into individual training and validation Datasets. Training data will be used to train the model, while validation data will be used to validate its performance over the training process. We will be using an 80/20 training/validation split.
Some of the elements of our model training process have a stochastic element, meaning that our training results will differ every time we run the training script. In order to ensure that our results our repeatable, we will need to define a random seed value to be used for all stochastic processes employed during training. This ensures that random values are generated in the same manner each time the training script runs, providing repeatable results. We will need to set this random seed value in several places prior to model training.
Each of our Datasets (training and validation) will need to be divided into individual batches for training. This is easily accomplished by instantiating objects using PyTorch’s DataLoader class. Each DataLoader object takes a Dataset as input and samples it into individual batches based on a user-defined batch size.
We will utilize PyTorch’s RandomSampler to randomly distribute training text among batches. This improves the probability that the stanzas contained in each batch will be from different poems, instead of from the same poem. Doing so will improve the model’s ability to generalize.
Since our validation Dataset will be used only to evaluate the model performance, we do not need to worry about randomizing the data when creating batches. As a result, we will use PyTorch’s SequentialSampler when defining our validation DataLoader. This sampler simply creates batches based on the input sequence rather than applying randomization.
Before we start training our model, we will need to define a few additional elements. We will be reviewing the amount of time it takes to complete each training epoch as a means of evaluating overall model performance. To help make the elapsed time easier to interpret we will define a helper function that formats the time for improved readability.
We will also need to define a few hyperparameters prior to model training. We will be using the AdamW optimizer to update model weights during training. AdamW is a modified version of Adam, a very popular adaptive gradient descent algorithm used for training machine learning models. AdamW was originally proposed in this paper, and it is designed to decouple weight decay from gradient-based updates to the model weights. This improves optimizer regularization and increases the ability of the trained model to generalize.
The learning rate hyperparameter used by the optimizer establishes the magnitude by which the model weights are updated during each pass through the gradient descent algorithm. The higher the learning rate, the larger the adjustment to the model weights during each training pass.
Research has shown that reducing the learning rate gradually over the training process can improve overall model performance as well as reduce training time. There are several methods that can be used to decrease learning rate, but we will be using a linear step-wise method. To do this, we will be implementing a linear learning rate scheduler. This type of learning rate scheduler begins with a learning rate of 0 and increases it linearly until it reaches a user-defined learning rate after a number of training steps known as the warm-up period. We will need to specify this warm-up period when defining our optimizer. Once the warm-up period has passed, the learning rate scheduler will decrease the learning rate linearly over the remaining training steps until the value reaches 0.
We will also define an epsilon value (eps) for the optimizer. The epsilon value is a very small quantity that is added to the denominator during optimization calculations. Including an epsilon value can help prevent potential divide-by-zero errors that can be encountered when using gradient descent algorithms.
Finally, we will need to create a text seed that can be used to generate text using our trained model. We will be using our beginning-of-sequence token (<BOS>) as a seed for all generated text.
Now that we have all of our hyperparameters defined, we are ready to load the pre-trained GPT-2 model and begin fine-tuning it using our poem stanza training data. First, we need to load the model configuration using the GPT2Config class provided in Hugging Face’s transformers library. To set up the model configuration, we need to specify the overall vocabulary size of our corpus of text as determined by our tokenizer as well as the maximum sequence length of each text sequence based on the MAX_LEN value defined previously. Finally, we must be sure to load configuration data from the pre-trained GPT-2 model.
Next, we will load the pre-trained GPT-2 language model using the configuration data we just defined. Since we defined 3 special tokens, we need to resize the token embeddings immediately after creating our model. Otherwise we will encounter size mismatch issues since the number of token embeddings used by the pre-trained model will differ from the number used by the tokenizer defined previously.
Now let’s train our model!
Generate Poem Stanzas
Now that the model has been trained, let’s use it to generate some “original” Poe pome stanzas.
Here’s an example of some poem stanzas generated by the model:
Pretty good, but not perfect. However, Poe’s style is definitely detectable in many of the generated lines, as well as in the general stanza formats that have been generated. Keep in mind we are using the smallest version of GPT-2 available, which has less than a tenth of the parameters available in the full-scale (XL) version. Also, this model was developed simply to illustrate the power of GPT-2 for novel text generation. There is still plenty of room for improvement and optimization that can be used to generate even more convincing Poe poems.
Thanks for Reading!
Be sure to check out the code for this project in its GitHub repository. Also make sure give this article some claps if you found it interesting or helpful!