Textual Style Transfer with GPT-2

Group 8 - Michael Downs, Cameron Hickert, Wen Rui Liau, Wisoo Song

Description of problem and the need for HPC and/or Big Data

Description of solution and comparison with existing work on the problem

Description of your model and/or data in detail

Technical description of the parallel application, programming models, platform and infrastructure

Links to repository with source code, evaluation data sets and test cases

Technical description of the software design, code baseline, dependencies, how to use the code, and system and environment needed to reproduce your tests

Performance evaluation (speed-up, throughput, weak and strong scaling) and discussion about overheads and optimizations done

Description of advanced features

Final discussion about goals achieved, improvements suggested, lessons learnt, future work, interesting insights

Citations

Problem

In our project, we will be exploring the problem of Textual Style Transfer with GPT-2. Two facets of any written work are style and content. Content pertains to the events or ideas being described whilst Style pertains to the manner in which the events or ideas are being described. Different authors and/or mediums have different styles. Textual style transfer aims to modify the style of one work to mimic the style of another while keeping the content relevant to the original. We will be using the GPT-2 model developed by OpenAI to perform textual style transfer on diverse and large datasets. As the field of textual style transfer is still very nascent, our approach is naive and has no formal mechanism to preserve the content of the original text.

Need for HPC and Big Data

There is a need for big compute in our project. On the Big Compute front, we will need a lot of compute resources to train the GPT-2 model for textural style transfer. GPT-2 comes in various sizes and all have a significant number of parameters to train. The "small" model that we are working with has 117 million parameters to train and our rough time estimate shows that we require 6 hours on single GPU to get reasonable (conditional) initial results using 7MB of text and 33k iterations. HPC will allow us to speed this up.

On a tangent Big Data front, the original GPT-2 model is originally trained with 40GB of internet text which includes WebText. Other common textual datasets includes WikiText, which contains 100 million tokens from verified “good” and “featured” Wikipedia articles as well as Treebank-2 which contains 1 million tokens from WSJ articles. In our project, we did not require the use of these "Big Data" datasets and stayed within the realms of gigabytes with our chosen datasets.

OpenAI Existing Work

In scoping the existing work on this problem, we refer to the paper and associated codebase written by OpenAI linked here. OpenAI provides the GPT-2 model architecture that they tested for various different Natural Language Processing (NLP) tasks. In their paper, they outlined their efforts in using the model for zero-shot reading comprehension, summarization, translation and question answering. However, they do not outline the effectiveness of the model on Textual Style Transfer

Our Solution

In our project, we will be using the original GPT-2 117 million parameter model to perform textual style transfer from a source text to a target text. Our solution involves re-training this huge model in order to do textual style transfer. Our solution will also train the parameters in parallel which we will further elaborate in our technical descriptions below. Overall, we find that our application of GPT-2 is largely novel compared to existing work done in the field and contributed to interesting findings which we describe in the sections below.

Description of Data

For our source textual data, we used the Zothique stories obtained from this link and Project Gutenberg books obtained using from this repository. The raw project gutenberg corpus is almost 9GB compressed that requires processing. We used the super_cleaner functionality from this repository to strip of all of the project gutenberg boilerplate. The pgcorpus repository also generates a metadata file while we used to separate the project gutenberg books into the genres: juvenile, 19th century, science fiction, adventure, fairy tales, and poetry. Of those, we only used the first three.

Description of Model

The core model we used for textual style transfer is GPT-2. GPT-2 is a transformer-based Neural Net for unsupervised Multitask Learning. In contrast to other transformers, this model consists of Decoder blocks only (as opposed to a combination of Encoders and Decoders) and the model architecture is seen below. In the original paper by OpenAI, this model is able to perform zero-shot reading comprehension, summarization, translation and question answering to varying degrees of success.

Parallel Application

Our parallel application involves exploiting a range of different parallelism levels including many-node and many-core parallelism as we used multiple GPUs on a single-node and as well as multiple single-GPU nodes. There are two main areas that are possible to parallelize: the training phase and the generation phase.

We chose to parallelize the training phase of our project: the transformer computations and gradient updates in GPT-2 fine-tuning will be parallelized.

Programming Model

In deciding on the proper model for data parallelism, we used the ring all-reduce algorithm via Horovod that utilizes open MPI. Compared to the alternative of distributed tensorflow's worker-parameter server model, Horovod's ring all-reduce algorithm is superior in terms of reducing network bandwith bottleneck due to the absence of the need for a parameter server. An overview of this parallel algorithm is shown below.

On the other hand, we also explored the possibility of model parallelism by exploiting the transformer architecture. A key property of gpt-2 is that in the decoder blocks, the input word in each position flows through its own path. Up to the self-attention layer, there exist positional dependencies between embedded word vectors, but the inputs to the feed-forward layer do not have those dependencies. Thus, the each of the paths that the attention-weighted vectors take can be executed in parallel while flowing through the feed-forward layer and the projection layer. It follows from this observation that we define ‘c’ in Amdahl’s law to be the ratio of the parameters in the feedforward neural network of each encoder multiplied by the number of encoders in the GPT-2. In case of the smallest GPT-2 model, we have 28.3 million parameters that can be updated in parallel. This is about 23% of the total number of parameters in GPT-2. Even more promising is the fact that the bigger versions of GPT-2 differ mostly in the number of Decoder stacks, meaning the parallelizable portion only increases with bigger models.

Platform and Infrastructure

Our language of choice will be Python which allows us to access a wide range of different libraries for our project. Our machine learning framework used is Tensorflow and Keras as well as Horovod as mentioned above. Our platform operating system used is Deep Learning Amazon Machine Image (Ubuntu) which came with Tensorflow and Horovod pre-configured. Our infrastructure of choice is Amazon's Elastic Compute 2 (EC2). Originally, we planned to use Harvard's FAS Cannon supercomputer as well but had issues with the parallel implementation which we will outline later. For the single-node training on AWS, we used the g4dn.12xlarge EC2 instance. This instance has four GPUs, each with 16GB of memory. For the multinode training, we used four g4dn.xlarge instances, which each have one GPU with 16GB of memory. To do so, we established a virtual private cluster (VPC) using the process outlined in Lab 7. In line with the lab, we configured a NFS (Network File System) so the master could export a directory to which the client mounts in order to exchange data. Since the instances we employed are both members of AWS’ G4 family of instances, the GPUs they use are NVIDIA T4 Tensor Core GPUs. These decisions were motivated by a need for increased GPU memory on the one hand, and AWS’ VCPU limits on the other.

Source Code

Our entire codebase for this project is hosted on our team's github repository linked here. In that repository, we have also linked a folder titled results which contain our outputs from training our GPT-2 model on a variety of different nodes on AWS. These outputs allow us to better understand the accuracy and speed of scaling of GPT-2 which we will present in the plots below. We have also included several style-transfered texts that our generated from our trained GPT-2 model located under the style_transferred folder. The target texts for style transfer are a Wikipedia article on MC Escher, a news article on hurricanes in the Gulf of Mexico, and a short story by Jorge Luis Borges. The filenames follow the following naming convention style_numepochs_target_window_step. For example: 19th_1200_escher_9_4. These style transfered texts are how we qualitatively evaluated the effectiveness of our model in performing the textual style transfer. As an example, here is an excerpt taken from juv_900_hurricance_9_4.

May is the country in which most of these men in America travel in the late-season and in a level sense of the country, so strong an effort to move across to reinforce new shipping during the autumn will appear with gaining strength at extreme hours when both men are already together. Thus far only, that time before the 1915 sick-leave of the African coast of Mexico has been fostering with the passage between San Diego and the Rio Guayana Warinar region.

Excerpt from juv_900_hurricance_9_4

To see the effect of the source text on style transfer, here is a an excerpt taken from the style-transferred generated text in zothique_500_hurricane_9_4.

The Black was said to him; the sea foisted on its interior shores, and barring the desert of the Silver Death, the vessels entered the flooded cavernous haven in the flooded baskamber, turning their camels to foam on the beach. Then, with unbroken arches, the winds thinned out, the highest isle to the west, and the shore of the haven was a wilderness of green trees and grass. Then, with many crags and deep-shelved recesses, it entered the haven, following the orangey paths of gentle waters to the south, and to the east, it entered the haven for many miles, following that orange shore which was the white mistress of the orient sea. Forgetting the haven's fragments that had come upon them, the Mior Lumivix peered at his littered shoulders, longing for long vermilion in the sanguine boughs.

Excerpt from zothique_500_hurricane_9_4

The reader will notice a clear narrative difference between the two samples, despite being generated using the same base hurricane article. For example, the “juvenile” sample uses simpler sentence structures, shorter words and a more matter-of-fact tone On the other hand, the “Zothique” sample is replete with fantastical, ominous descriptions, more complex structures, and a more inventive style overall.

During our training phase, we will be timing our training of GPT-2 on two different sources of text from Eldritchdark and Project Gutenberg. We will be using the Zothique text as well as the Juvenile text. The Zothique text represents the smallest source of text (201Kb zipped) available on Project Gutenberg whilst the Juvenile text represents the largest source of text (484Mb zipped). Using this spectrum of sizes allows us to better appreciate the performance of our model outlined in Section 7 below.

Software Design

Data Parallelism

Since data parallelism is explained in detail in section 4, section 8 and owes a lot to the implementations of Horovod library, we elaborate on the software designs relevant to the model parallelism.

Model Parallelism

Model parallelism refers to the parallelization paradigm that uses the same data for every thread, but splits the model among threads. As mentioned briefly in section 4, the multilayer perceptrons(feedforward layers) and the projection layers can be extracted from each of the 12 decoder blocks and distributed evenly among processes / nodes. Next, these layers can be applied independently to the attention-weighted matrix chunks to process them in a parallel manner. Following this method, after the initialization of the full architecture and distribution, the master process is only responsible for embedding the corpus matrix, layer normalization and gathering processed vectors. To illustrate the architecture, we provide the following image.

The red box indicates the weights and bias parameters contained in the first decoder block. the 'c_fc' stands for perceptron layer, while 'c_proj' stands for the projection layer. Both of these can be distributed as a package to exploit model parallelism. The below code section demonstrates how we implemented this idea.

From line 56 to line 61, the master process serializes the decoder blocks' MultiLayerPerceptron instances. Then, in line 64, the serialized MLPs are broadcast to all the nodes using OpenMPI (mpi4py).

The worker nodes selectively reconstruct the MLP portion of the decoder blocks and receive attention-weighted matrix chunks from the master node. The workers execute forward runs and project the results back to the embedding dimensions. Finally, they report the results back to the master process. More details can be found in /src/parallel_test.py.

Dependencies

Python

python=3.7
numpy
tensorflow==1.15
keras
horovod
re
mpi4py
nltk
json
pickle
fire==0.3.1
gpt-2-simple==0.7.1
pandas==1.0.3
pytz==2019.3
regex==2020.4.4
tqdm==4.45.0
requests==2.21.0
toposort==1.5
tensorflow-gpu==1.15.2

System and Environment

AWS instance: The g4dn.12xlarge

Operating System / Image: Deep Learning Amazon Machine Image (Ubuntu)

Usage Instructions

The following are the usage instructions for parallel training on a single-node and multi-nodes:

Single-node:

Launch a g4dn.12xlarge instance (which has 4 GPUs) using AWS’ EC2 service. When doing so, launch the instance using the Ubuntu 16.04 Deep Learning Amazon Machine Image (DLAMI)
Once the instance has launched, clone the “singlenode_parallelism” folder from the project repo into the newly-created instance
Activate the “tensorflow_p36" Conda environment that comes with the DLAMI – this should be as simple as running the command `source activate tensorflow_p36`
Install the necessary packages using the command pip install -r requirements.txt
Download the model run the command python download_model.py 117M in the same directory
Now you are ready to encode your dataset. Navigate to the src directory, and make sure you have your raw text data in .txt files in a folder in the src directory. To generate the training data, run the command python encode.py foldername foldername.npz where “foldername” is the name of the folder where you have stored your .txt files of retraining data
To train the model (still in the src directory), run the command below, replacing the “4" in -np 4 -H localhost:4 with the number of GPUs you want to use, and replacing the dataset with your dataset, as well as substituting your own run name for the -run_name flag. The fresh argument here ensures that each time we retrain, we start from the same base GPT-2 model. You may also alter the flags associated with saving and sampling as you would like.
```
python mpirun -np 4 -H localhost:4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x PYTHONPATH=src -mca pml ob1 -mca btl ^openib train-horovod.py --dataset foldername.npz --restore_from fresh --run_name my_name --sample_every 100 --save_every 100
```
This should commence the training and output the epochs and loss to your terminal. It will save checkpoints at the intervals you have specified, and also save a checkpoint when you stop training using Ctrl+C. You can use these saved models to generate text, as is specified further below on this webpage. Be sure to stop your AWS EC2 instance once you are finished to avoid unwanted charges.

Multi node (training the 117 million-parameter GPT2 model multi-node parallelism):

Launch four g4dn.xlarge instances (each has one GPU) using AWS’ EC2 service. When doing so, launch the instances using the Ubuntu 16.04 Deep Learning Amazon Machine Image (DLAMI).
Configure them in a virtual private cluster (VPC) using the instructions outlined in Lab I7. For our tests (and due to AWS VCPU limits), we used a single master nodes and three worker nodes.
Once you have launched the instances and configured the VPC, clone the multinode_parallelism folder from the project repo into the NFS (corresponding to “cloud” in the lab) to which the master and all nodes have access.
On each instance, activate the “tensorflow_p36" Conda environment that comes with the DLAMI and install the necessary packages using the command pip install -r requirements.txt
Follow the procedure as outlined above to download the model and generate the .npz encoding of your raw .txt training data.
Navigate to the directory that simply contains the src directory – in this case, you do not have to enter the src directory, as we have copied the relevant files up one level.
Change the path argument given to the “--dataset” flag in the train_horovod.sh file to the absolute path to your .npz training file. You may also change the other options as mentioned in the single-node parallelism case, if you wish.
Finally, simply enter the command ./train_horovod.sh 4 to train the model in parallel on the four GPU nodes. To train on a different number, simply change the “4" in this command to the number of nodes on which you wish to train (up to the limit imposed by the number of nodes you have configured). This should commence the training and output the epochs and loss to your terminal. It will save checkpoints at the intervals you have specified, and also save a checkpoint when you stop training using Ctrl+C. You can use these saved models to generate text, as is specified further below on this webpage. Be sure to stop your AWS EC2 instance once you are finished to avoid unwanted charges.

The following are the usage instructions for text generation:

Given the metadata file obtained from pgcorpus, to obtain the distribution of subjects: python ./bin/parse_metadata.py generate_subjects --infile "./data/metadata.csv.gz" --outfile "./data/subjects.csv"
Given the metadata file and a subject, to get the works that correspond to that subject (for example, juvenile): python ./bin/parse_metadata.py get_subject_records --search_phrase "juvenile" --infile "./data/metadata.csv.gz" --outfile "./data/juvenile.csv"
Given the zipped output of pgcorpus and a csv file derived from metadata.csv (see juvenile.csv above), obtain a compressed file consisting only of those works: python ./bin/make_gutenberg_subset.py --genre_file "./data/juvenile.csv" --pg_raw "./data/project_gutenberg_raw.zip" --out_zip "./data/juvenile.zip"
To separate the sentences in a target text file into newlines: python ./bin/generate_prompts.py --infile "./data/target_texts/the_circular_ruins.txt" --outfile "./data/target_texts/the_circular_ruins_lines.txt"
Given a pre-trained GPT-2, generate text using the sliding window approach: python ./bin/do_style_transfer.py --output_file ./output/juv_1500_cr_9_4 --checkpoint_dir /home/msdowns/cs205_project/data/checkpoint/juvenile_gen1/1500/ --prompts_file "./data/target_texts/the_circular_ruins_lines.txt"

Text Generation

This the process of generating new text from our corpus of data outlined in above sections. The algorithm for generating text takes as input a pre-trained GPT-2 and a text file. The text file is chunked into contiguous groups of sentences which are used as prompts for GPT-2. The contiguous group is then shifted by some step size. For this project, the chunk/window size was 9 and the step size was 4. 4 sentences were generated at a time using GPT-2. We used nltk as the sentence tokenizer. A diagram showing this algorithm is shown below:

Speedup/Scaling

As mentioned above, we ran our parallel model on an AWS EC2 g4dn.12xlarge instance. This model had 4 Nvidia T4 Tensor Core GPUs, which allowed us to test and exploit parallelism in our code. As a baseline, we first ran our code on a single-node and increased the number of processes on that node. Afterwards, we plot the multinode parallel results of our code. Instead of varying the number of processes on each node, we now vary the number of nodes for our training phase. As the means of comparison, we plot the metric of Loss vs Time Taken for the training phase of our code. This gives us an idea of how quickly we are able to achieve a desired level of accuracy with our different experimental setups. We present our plots for this on Zothique and Juvenile below:

We also include a table to highlight the speed-up results from our experiments on both a single-node and multi-node system on the Juvenile dataset.

Discussion of Overheads and Optimizations

By both varying the problem size and number of nodes, we have shown the weak and strong scaling for our parallel application. As expected, running our code in parallel allowed us to speed-up our training process in achieving lower training errors in a shorter period of time. Interestingly, our multi-node case is slower than our single-node case for the same amount of overall processes. In comparing the two plots, there is almost a 2x slowdown when comparing the time needed to perform training on the Juvenile and Zotique datasets when 4 nodes/processes are involved (to achieve similar training accuracies). This implies that there is significant overheads for our multi-node experiment compared to our single-node experiment. Upon further research investigation, we noticed that the communication costs for our application is immense. As supported by this article here, data parallelism does not yield a good result due to the amount of data that needs to be passed around the nodes. During each backward pass of our training phase, we need to broadcast all gradient values to all other GPUs. With 117 million parameters to train in our smallest example of GPT-2, this represents a huge amount of communication overhead which mitigates the computational benefit that we got from scaling the number of processors. Theoretically, there are optimizations that we could have pursued to reduce communication costs in our project. We could have reduced the parameters of the number of parameters with layers such as max pooling or using convolution layers. However, that will edit the fundamental structure of GPT-2 that is empirically tested by OpenAI. Moreover with the initial 117 million parameters to train, it is not feasible to decide on which layers to modify given the timeframe that we have.

Horovod + Ring all-reduce algorithm

The main advanced feature in our parallel programming model is our API/programming model usage of Horovod and the Ring-All-Reduce algorithm. This algorithm was not discussed in class but a solution which we found to have great potential in speeding-up parallel applications. Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. This framework is based on MPI4Py and developed/supported by Uber Engineering. In class, we mostly explored Master-Worker models (all-reduce) that is commonly seen in the vanilla implementation of OpenMPI. However, there are advantages of Horovod and the Ring-All-Reduce algorithm over this vanilla implementation. As part of our exploration of this model, we also performed an asymptotic analysis of these two algorithms and found that the ring-all-reduce algorithm was superior in terms of preventing the network bandwidth bottleneck. This is outlined in the image below:

Adopting this advanced feature was a challenging task for our team as there was less support and documentation available for us compared to vanilla solutions. This included less academic papers or tools accessible to us for this project. This became especially relevant during the scaling phase of our project, as we were aware that from internal benchmarking tools and peer-reviewed papers, Horovod’s communication overhead increases significantly as the number of nodes increases (especially with a model as large as ours).

Assessing Textual Style Transfer

Another advanced feature that we explored was the possibility of incorporating quality metrics in our final project. Quality metrics will provide us with a structured methodology in accessing the effectiveness of our textural style transfer from a source to a target text. We explored different approaches to do this from academic papers such as one from the MIT media lab linked here. This assessment would have been based on 3 facets of evaluation, namely:

Style transfer intensity: How different are the texts?
Content Preservation: How similar is the content?
Naturalness: How natural is the output?

After further investigation into this, we chose to adopt a qualitative approach to assessing style transfer instead. Reimplementing the textual style transfer metrics from the MIT academic paper would have involved a lot of extra effort and we wanted our overall project to be mostly focused on the computational aspect rather than the Natural Language Processing aspect. The metrics would have involved more data conditioning and training+debugging at least 3 additional auxiliary models on top of our existing GPT-2 model. A qualitative approach to assessing style transfer based on a 1-5 scale allowed us to save time whilst giving us a general idea of how our models are performing.

Goals Achieved

In our project, we have achieved most of the goals that we set out to achieve. We ran our parallel application for textual style transfer on both a single-node as well as a multi-node system. We have also investigated how model's training scales with different number of nodes and processes. This led us to interesting findings about the disadvantages of Horovod in data-intensive applications. In this project, we have also shown the use of the novel parallel programming model of Horovod's Ring-all-reduce algorithm on an AWS EC2 cluster.

Lessons learnt/interesting insights/challenges with running Horovod on Harvard Cannon

Our original plan was to use Harvard’s Cannon research computing cluster to run the training. We were able to implement the serialized training of the GPT2 language model on the Cannon cluster, but the Horovod parallelization was significantly more difficult. The team spent a considerable amount of time pursuing this option, but ultimately opted to utilize AWS compute resources. The error messages were slightly cryptic, and after we realized the segmentation fault on the AWS implementation was a simple Out-Of-Memory case that could be cured by boosting RAM, we attempted to boost the memory available to the job we were running on Cannon’s GPU partition. However, the issue persisted. Eventually, it appeared that the source of the segmentation fault lay in the specific versions of OpenMPI and GCC that we were installing via the modules ready for easy installation on the cluster. Horovod relies on GCC version 4.9 and either version 3.1.2 or version 4.0.0 of OpenMPI. Unfortunately, installing the correct OpenMPI module required an upgrade to a more recent version of GCC. Thus, the fault persisted. One potential path to deal with the situation would be to manually install the correct versions of GCC and OpenMPI from source into our home directory. We pursued this course in parallel with our implementation on AWS. Eventually, we were successful in the AWS implementation and found that the speedup produced allowed us to train the model in a reasonable amount of wall-clock time. Given this success in combination with ongoing obstacles in the manual installation of the correct OpenMPI and GCC versions (and the recognition that the bug perhaps would persist regardless of the new installation), we opted for the AWS path. Additionally, we noted that the time per training step for our serial training implementations on both Cannon and AWS system were comparable, leading us to conclude that investing the additional hours to re-implement the parallelized Horovod training on the Cannon cluster would not be necessary.

Future Work

There are several areas of future work that we propose for our project. Given more time, we can use the larger GPT-2 models as well as experiment with the vanilla all-reduce algorithm on MPI. On an ambitious level, we can also modify the core GPT-2 model from OpenAI to be more suited for this task of textual style transfer. We would also like to look further into running the parallelized Horovod model training on Harvard’s Cannon cluster, which would save money and allow us to scale further. Lastly, we can also frame the textual style transfer problem more rigorously and use the metrics we originally proposed from the MIT Media Lab academic paper.

MPI on AWS - Harvard CS205 - Spring 2020 - Infrastructure Guide - I7
Blog Post by OpenAI on GPT-2 and language models
Associated repository for the paper on GPT-2
Orange Erotic Bible repository for initial project inspiration (Warning: this repository is potentially offensive)
Article by Tim Dettmers on parallelizing deep learning on GPUs
Paper by MIT Media Lab on assessing textual style transfer.
Project Gutenberg text used as a source text
EldritchDark text used as a source text
Cleaner functionality repository used to clean the source text
GPT-2 in Keras and Tensorflow2.0 modified and used to develop /src/parallel_test.py
Illustrated GPT-2 detailed analysis on GPT-2 structure and photos used for illustration

Textual Style Transfer with GPT-2

Overview

1. Description of problem and the need for HPC and/or Big Data