CS360 Lab 8: Transformers

Due: Thursday, April 18 at 11:59pm


The goals of this week’s lab:

Note that there is a check in for this lab. By Tuesday April 16 you should be finished with the RNN and the transformer tutorial.


To get started, accept the github repo for this lab assignment. Partners are recommended for this lab but not required.

You will have or create the following files:

Unfortunately, for this lab we’ll need to use a more recent version of python and tensorflow. In your .bashrc file, comment out the lines for python3.7.7 and add the lines below for python3.10.10 (log out and log back in for the change to take affect).

export PATH=/packages/python3.10.10/bin:$PATH
export LD_LIBRARY_PATH=/packages/python3.10.10/lib:/usr/local/TensorRT-

Part 1: Data Pre-processing

We will start by using a text file that contains all the works of Shakespeare:


The following steps will help you prepare the data for input into either the RNN model or the transformer model. Put these in util.py.

  1. Read in the entire file into a single string (and convert to lowercase).

  2. Create a set of all the characters in the file. Hint:

>>> data = "to be or not to be"
>>> set(data)
{'t', 'b', ' ', 'r', 'n', 'e', 'o'}

I recommend converting the set of characters to a list and then sorting. You should get 39 total - this is the vocab_size.

  1. Create a mapping (i.e. dictionaries) between each character and an integer (encoding) and visa versa (decoding).

  2. Encode the entire dataset. This should produce a list of integers that is of length 1,115,394.

  3. Choose a window size and use the provided to_dataset function to create tensorflow batches with this window size (argument length).

  4. Split data into train (first 80% of the data), validation (next 10% of the data), and testing (last 10% of the data). Shuffle the training data but not the validation or test data. This part you may want to do in rnn_gen.py.

Part 2: Recurrent Neural Network

Next we will create an RNN model using gated recurrent units (GRUs), in the file rnn_gen.py. A template for this is provided in the starter code, and your task is to fill in the TODOs and make sure the data is getting passed in correctly.

Here we will use the sequential model instead of our own custom models. See the Sequential documentation for more details. The model will consist of:

We will then compile and fit the model, using training data and validation data (but not yet test data). Think about which loss function you should use, as well as which optimizer.

Train your model for 10 epochs on the GPU machines and report your accuracy on the test data (in terms of predicting the next letter correctly given the start of the test data). Finally, given some context (i.e. a prompt), call your model to produce new text (one letter at a time for at least 100 characters). You can choose the most likely character each time (it is optional to implement a temperature that would sample from the softmax distribution).

Include a sample of your generated text in your README.md. You may need to experiment with the window length and other hyperparameters.

Part 3: Transformers

For this part we will first walk through a tutorial about how to create a transformer model for machine translation, then adapt the code for text generation. To obtain the data (English and Spanish), follow this code:

url = "https://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip"
path = tf.keras.utils.get_file("spa-eng.zip", origin=url, cache_dir="datasets", extract=True)
text = (Path(path).with_name("spa-eng") / "spa.txt").read_text()

In the file transformer_gen.py, follow the section in the textbook (Geron) starting from “Attention is all you need” in Chapter 16 (page 609). Use non-trainable positional encodings. The final part of this step should be fitting the model with this code from page 619:

Y_proba = tf.keras.layers.Dense(vocab_size, activation="softmax")(Z)
model = tf.keras.Model(inputs=[encoder_inputs, decoder_inputs],
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam",
model.fit((X_train, X_train_dec), Y_train, epochs=10, 
    validation_data=((X_valid, X_valid_dec), Y_valid))

After this is working, you should have all the building blocks to create a text generation transformer model. This is the creative part! Devise a way to adapt this transformer architecture for text generation (so no longer the English/Spanish dataset, but the Shakespeare dataset). Hint: instead of the outputs being the new language, the outputs should be the text window, but shifted one character to the right. We already have processed the dataset like this, so you should be able to move around the pieces so the goal is similar to the RNN.

Finally, use your text generation model to create novel text (similar to Part 2) and provide the results in your README.md.

Part 4: Comparison and Application to a new dataset

Think about which model performed better and generated more realistic text. Re-train your best architecture on a dataset of your choice. This could be a book from your favorite author or other digitized text, as long as you can get characters out of it.

Project Gutenberg has over 70,000 free digitized books and is a good place to start. Provide an example of this generated text in your README.md.


Include the following results and answer the following questions in your README.md:

  1. Include an example of your prompt and resulting generated text for the RNN model (trained on Shakespeare):

  2. Include an example of your prompt and resulting generated text for the transformer model (trained on Shakespeare):

  3. Include an example of your prompt and resulting generated text for the transformer model (trained on a dataset of your choice):

  4. How did the style of the generated text change (if at all) based on the two text choices?

  5. What would you expect to happen if a different text (e.g. the comments section of an Internet post) was used?

  6. Would you expect problematic text to be generated if none of the training text was problematic? Consider different definitions of problematic, including factually incorrect, biased, inappropriate, or illegal.

Acknowledgements: based on materials by: