CS360 Lab 8: Transformers

Due: Thursday, April 18 at 11:59pm

Overview

The goals of this week’s lab:

Introduction to machine learning for text-based analysis
Using a recurrent neural network (RNN) for text generation
Comparing the results to a transformer-based network
Extending your code to a text dataset of your choice

Note that there is a check in for this lab. By Tuesday April 16 you should be finished with the RNN and the transformer tutorial.

Introduction

To get started, accept the github repo for this lab assignment. Partners are recommended for this lab but not required.

You will have or create the following files:

run_rnn.py - for the recurrent neural network
run_transformer.py - for the transformer neural network
util.py - for data preprocessing
README.md - for analysis questions and results

Unfortunately, for this lab we’ll need to use a more recent version of python and tensorflow. In your .bashrc file, comment out the lines for python3.7.7 and add the lines below for python3.10.10 (log out and log back in for the change to take affect).

export PATH=/packages/python3.10.10/bin:$PATH
export LD_LIBRARY_PATH=/packages/python3.10.10/lib:/usr/local/TensorRT-7.2.3.4/targets/x86_64-linux-gnu/lib/:/usr/local/cuda-11.1/targets/x86_64-linux/lib

Part 1: Data Pre-processing

We will start by using a text file that contains all the works of Shakespeare:

/homes/smathieson/Public/cs360/shakespeare/input.txt

The following steps will help you prepare the data for input into either the RNN model or the transformer model. Put these in util.py.

Read in the entire file into a single string (and convert to lowercase).
Create a set of all the characters in the file. Hint:

>>> data = "to be or not to be"
>>> set(data)
{'t', 'b', ' ', 'r', 'n', 'e', 'o'}

I recommend converting the set of characters to a list and then sorting. You should get 39 total - this is the vocab_size.

Create a mapping (i.e. dictionaries) between each character and an integer (encoding) and visa versa (decoding).
Encode the entire dataset. This should produce a list of integers that is of length 1,115,394.
Choose a window size and use the provided to_dataset function to create tensorflow batches with this window size (argument length).
Split data into train (first 80% of the data), validation (next 10% of the data), and testing (last 10% of the data). Shuffle the training data but not the validation or test data. This part you may want to do in rnn_gen.py.

Part 2: Recurrent Neural Network

Next we will create an RNN model using gated recurrent units (GRUs), in the file rnn_gen.py. A template for this is provided in the starter code, and your task is to fill in the TODOs and make sure the data is getting passed in correctly.

Here we will use the sequential model instead of our own custom models. See the Sequential documentation for more details. The model will consist of:

An embedding layer
A GRU layer to keep track of information along the sequence
A dense (fully connected) to transform the output into a probability distribution (one probability for each letter). Think about what activation function to use for this part.

We will then compile and fit the model, using training data and validation data (but not yet test data). Think about which loss function you should use, as well as which optimizer.

Train your model for 10 epochs on the GPU machines and report your accuracy on the test data (in terms of predicting the next letter correctly given the start of the test data). Finally, given some context (i.e. a prompt), call your model to produce new text (one letter at a time for at least 100 characters). You can choose the most likely character each time (it is optional to implement a temperature that would sample from the softmax distribution).

Include a sample of your generated text in your README.md. You may need to experiment with the window length and other hyperparameters.

Part 3: Transformers

For this part we will first walk through a tutorial about how to create a transformer model for machine translation, then adapt the code for text generation. To obtain the data (English and Spanish), follow this code:

url = "https://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip"
path = tf.keras.utils.get_file("spa-eng.zip", origin=url, cache_dir="datasets", extract=True)
text = (Path(path).with_name("spa-eng") / "spa.txt").read_text()

In the file transformer_gen.py, follow the section in the textbook (Geron) starting from “Attention is all you need” in Chapter 16 (page 609). Use non-trainable positional encodings. The final part of this step should be fitting the model with this code from page 619:

Y_proba = tf.keras.layers.Dense(vocab_size, activation="softmax")(Z)
model = tf.keras.Model(inputs=[encoder_inputs, decoder_inputs],
    outputs=[Y_proba])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam",
    metrics=["accuracy"])
model.fit((X_train, X_train_dec), Y_train, epochs=10, 
    validation_data=((X_valid, X_valid_dec), Y_valid))

After this is working, you should have all the building blocks to create a text generation transformer model. This is the creative part! Devise a way to adapt this transformer architecture for text generation (so no longer the English/Spanish dataset, but the Shakespeare dataset). Hint: instead of the outputs being the new language, the outputs should be the text window, but shifted one character to the right. We already have processed the dataset like this, so you should be able to move around the pieces so the goal is similar to the RNN.

Finally, use your text generation model to create novel text (similar to Part 2) and provide the results in your README.md.

Part 4: Comparison and Application to a new dataset

Think about which model performed better and generated more realistic text. Re-train your best architecture on a dataset of your choice. This could be a book from your favorite author or other digitized text, as long as you can get characters out of it.

Project Gutenberg has over 70,000 free digitized books and is a good place to start. Provide an example of this generated text in your README.md.

Analysis

Include the following results and answer the following questions in your README.md:

Include an example of your prompt and resulting generated text for the RNN model (trained on Shakespeare):
Include an example of your prompt and resulting generated text for the transformer model (trained on Shakespeare):
Include an example of your prompt and resulting generated text for the transformer model (trained on a dataset of your choice):
How did the style of the generated text change (if at all) based on the two text choices?
What would you expect to happen if a different text (e.g. the comments section of an Internet post) was used?
Would you expect problematic text to be generated if none of the training text was problematic? Consider different definitions of problematic, including factually incorrect, biased, inappropriate, or illegal.

Acknowledgements: based on materials by:

“Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd Edition” by Aurélien Géron
“Let’s build GPT: from scratch, in code, spelled out” by Andrej Karpathy