CS68 Lab 1: Sequences and the Central Dogma

Due: Wednesday, January 31 at 11:59pm

Before you start

Make sure you've filled out the pre-course survey (in email)
Read the entire syllabus on the course webpage

Credit for this lab: Ameet Soni

Overview

The goal of this week's lab is to reinforce basic concepts of biology and bioinformatics. You and your partner will implement a short program that will simulate many aspects of the central dogma, and provide practice working with sequence data.

This lab will be done in pairs, which have been randomly assigned for this lab. In future labs, you will have freedom to choose a partner. Your starting point files (and submissions) will be handled using the GitHub Enterprise interface (see below). This is similar to what most of you have seen in the CS31, CS35, and other upper-level courses. Your partner for this week will be listed in the name of your Lab 1 repo (see below). You may discuss concepts with a fellow classmate, especially if you are having difficulty with the details of transcription or translation. You may not share code, however, with students that are not your lab partner.

Your program should be broken into two files. The first, sequences.py will contain class definitions for DNA, RNA, and Protein objects. The second, dogma.py will have you implement a main program that allows a user to read in a DNA sequence and convert it to a protein using transcription and translation functionality. The user will be able to search a large genome for potential proteins of interest.

Next week we will start implementing and applying algorithms - for this first week the main goals are to practice Python if it's been a while, get used to working with sequences, and understand the central dogma of molecular biology.

Getting Started

To get started, first create a cs68 directory in your home directory, and add a labs subdirectory to it:

mkdir cs68
cd cs68
mkdir labs
cd labs
pwd

We will be using git repos hosted on the college's GitHub server for labs in this class. If you have not used git or the college's GitHub server before, here are some instructions: Using Git (follow the instructions for repos on Swarthmore's GitHub Enterprise server).

Next find your git repo for this lab assignment off the GitHub server for our class: cs68-s18

Clone your git repo with the lab 1 starter files into your labs directory:

cd ~/cs68/labs
git clone [the ssh url to your your repo]

Then cd into your lab01-id1_id2 subdirectory (id1 being replaced by the user id2 of you and your partner).

Make sure to always use python3!

Sequences and the Central Dogma

In this lab, you will create a Python library and main program to simulate operations described in the central dogma in order to better understand the link between a DNA sequence and resulting protein sequence(s).

One of our examples will be the green fluorescent protein (GFP), one of the most well studied proteins in molecular biology. It's discovery was recently awarded a Nobel Prize in Chemistry in 2008 for redefining how fluorescent microscopy is utilized in biology.

First, you will construct 3 class definitions, one each for DNA, RNA, and Protein. I will describe the main functionalities that are expected, you can feel free to add additional information/methods. All three should be defined in a file sequences.py.

Sequence classes

First, define a DNA class. Your class should have, at a minimum, the following functionality:

A constructor(i.e,. __init__()) that takes in a string, strand, for the DNA strand. It should create an instance variable to store the strand and initialize any other instance variables you want to maintain.
An __len__() method to get the length of the sequence
A getStrand() method to return the raw sequence as a string
A getSubStrand(start, stop) method to retrieve a portion of the sequence. This should take in a start and stop index and return a string containing all bases from start up to the stop index (pythonic).
An __str__() method for converting the object to a string. It should return a strand summary, including directionality. That is, the start of the string should be "5' " and the end should be "3' ". If the strand is longer than 30 bases, print the first 15 bases, a series of dots, and then the last 15 bases. E.g., "5' TTTGAGCAAGTCAAA...TTTTATTCGTGTGTA 3'"
An invert() method to replace (not return) the current strand with its reverse complement. That is, the other half of the DNA double strand. You should always think of sequences as 5' to 3', so you will not only need to find the complement of each base, but also reverse the sequence. For example, AAGG should become CCTT.
A transcription() method that returns a list of RNA objects. Each RNA object will represent the sequence between one pair of start/stop codons in the same reading frame. That is, the distance between them is evenly divisible by three. A naive way to implement this method is to search for all possible start codons (ATG). For each start codon, search the rest of the strand incrementing by three for a stop codon (TAG, TGA, or TAA). If there is no stop codon, do not add the encoding to the list. There may be overlaps in encodings (ATG can code for a regular Methionine or a start one). Be sure to substitute for U's for T's when constructing the RNA object. You should pass the index of the first nucleotide after the start codon and the index of the beginning of the stop codon to each RNA objects constructor (as well as the sequence).

Next, define an RNA class. Your class should have the following methods:

A constructor that takes in an RNA strand, as well as start and stop indices for where the encoding can be found in the original DNA sequence. You should store these three items as well as any other data members you see fit.
An __len__() method to return the length of the sequence
A getStrand() method to return the raw RNA strand as a string
An __str__() method similar to above, but it should also print out the indices e.g., "16-22: 5' CUGCCA 3'"
A translate() method that returns a Protein object containing the translation of the mRNA sequence. This method should take in a codon table as input and use this to produce the translation. You should pass the start/stop index in to the Protein constructor.

Lastly, you should create a Protein class. This class will look exactly the same as the RNA (e.g., have an amino acid strand, start, and stop) class minus the translate method. The constructor will take in an amino acid sequence and a start and stop index for finding the original encoding region in the DNA sequence. You do not need to print out directions for a protein sequence (i.e., there is no 5' to 3' designation).

Main program

You will define your main program in dogma.py. At a high level, your program should:

Greet the user
Prompt the user for a sequence file; load the sequence as a DNA object and print the sequence's summary
Prompt the user for the codon table file; load the table into a dictionary
Go into the program's main loop for allowing a user to interact with the sequence. The loop should exit when the user selections option "0"

The main loop can be as creative as you like. At a minimum, you should define behavior for the following options:

Print the raw DNA sequence (entire sequence, no 5' or 3' labels using getStrand())
Display a subsequence of the DNA strand. This should print the user for a start and stop location and display just the nucleotides between these two indices.
Allow the user to invert the DNA sequence, and then print the sequence summary (i.e., use its str() method). Any RNA or Protein sequences that have been stored should be cleared as they no longer apply.
Transcribe the DNA sequence. As described above, this should produce a list of all mRNA strands that could be produced from the sequence. You should print the number of mRNA molecules produced and their summary.
Print all of the raw mRNA sequences (entire sequence, no directionality)
Translate each mRNA sequence to a protein (make sure you clear out any previous proteins from your list); print a summary of each protein
Write raw protein sequences to an output file, one protein per line

Hints and Tips

Reading FASTA file

FASTA is a standardized format used across the field to represent DNA and/or protein sequences. You can read in detail about the format at the NCBI manual page. For this lab, you only need to know that there are two types of lines in the file: description lines and sequence lines. For example:

>gi|129295|sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED)
QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE
KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS
VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP
FLFLIKHNPTNTIVYFGRYWSP

The first line describes the gene and can be ignored for this lab. The next four lines are the gene's protein sequence. When loading your file, you can ignore description lines. The first character on a description line will be the greater than symbol ">". Each line below the description line is part of the sequence, with 80 characters per line. Simply finish reading the file line-by-line concatenating the lines together to create one large string for the sequence.

Reading Codon Table

A codon table maps three-letter RNA codons to a single-letter amino acid that it produces. Look at the codon.txt file and note that each line contains the amino-acid abbreviation first, and then a list of all codons that map to that amino acid. You should load this file into a dictionary data structure. You should map codons to their amino acid equivalent. E.g., codonTable["AUU"] = 'I'

Program Requirements

In addition to the requirements listed above, you should ensure your code satisfies these general guidelines:

Use good top-down design principles. In fact, your solution should be very short (about 125-150 lines in dogma.py) if you design your solution well.
Make sure to practice defensive programming. Make sure the user enters in valid file names and numeric choices for the menu.
Be sure to comment non-trivial sections of your code (include headers for each function and method as well)

Command line arguments

If you would like to set up command line arguments so you don't have to type in the filenames every time, I recommend the optparse library. This is completely optional - if you do use command line arguments, make sure a helpful message is printed when I just run python3 dogma.py.

Sample Runs

In your labs directory, I have placed two sample sequence files, test.fasta and gfp.fasta. The latter is the sequence for the green fluorescent protein, while the former is a toy example for which I have results below. Try your code on the test file first, and then see what happens with your GFP gene. If you want to try a large example, try running your code on the E. coli UTI89 genome in ecoli_uti89.fasta. It is located at /home/smathieson/public/cs68/ecoli_uti89.fasta. DO NOT COPY this file, it is quite large. Note that your program will take awhile to run for certain operations since it is a large sequence.

Welcome to the gene translator

Enter FASTA file name: test.fasta
Enter Codon Table file name: codon.txt

DNA sequence of length 126 successfully loaded:
5' TTAATAGCGTGGAAT...CATTTTATTTTAAAA 3'

Options:
0) Exit
1) Print raw DNA sequence
2) Invert DNA sequence
3) Transcribe DNA sequence and print summary
4) Print raw RNA sequences
5) Translate RNA sequences and print summary
6) Print raw sequences to file

Enter choice: 1

Entire DNA sequence:
TTAATAGCGTGGAATGATCCTTATTAAAGAGTGTCACGAAGAGTCGGAATAGAATATGGAGGCGACAGTCGAGGGTGGGATAGAGTCCTAAAGATAACATTAAGTGTTAATCATTTTATTTTAAAA

Options:
0) Exit
1) Print raw DNA sequence
2) Invert DNA sequence
3) Transcribe DNA sequence and print summary
4) Print raw RNA sequences
5) Translate RNA sequences and print summary
6) Print raw sequences to file

Enter choice: 3

2 Resulting mRNA sequences:
16-49: 5' AUCCUUAUUAAAGAG...CACGAAGAGUCGGAA 3'
58-88: 5' GAGGCGACAGUCGAGGGUGGGAUAGAGUCC 3'

Options:
0) Exit
1) Print raw DNA sequence
2) Invert DNA sequence
3) Transcribe DNA sequence and print summary
4) Print raw RNA sequences
5) Translate RNA sequences and print summary
6) Print raw sequences to file

Enter choice: 4

mRNA Sequence 0
AUCCUUAUUAAAGAGUGUCACGAAGAGUCGGAA
mRNA Sequence 1
GAGGCGACAGUCGAGGGUGGGAUAGAGUCC

Options:
0) Exit
1) Print raw DNA sequence
2) Invert DNA sequence
3) Transcribe DNA sequence and print summary
4) Print raw RNA sequences
5) Translate RNA sequences and print summary
6) Print raw sequences to file

Enter choice: 5

2 Resulting protein sequences:
16-49: ILIKECHEESE
58-88: EATVEGGIES

Options:
0) Exit
1) Print raw DNA sequence
2) Invert DNA sequence
3) Transcribe DNA sequence and print summary
4) Print raw RNA sequences
5) Translate RNA sequences and print summary
6) Print raw sequences to file

Enter choice: 6
Enter output filename: test.pro

File output complete

Options:
0) Exit
1) Print raw DNA sequence
2) Invert DNA sequence
3) Transcribe DNA sequence and print summary
4) Print raw RNA sequences
5) Translate RNA sequences and print summary
6) Print raw sequences to file

Enter choice: 2

DNA sequence successfully inverted:
5' TTTTAAAATAAAATG...ATTCCACGCTATTAA 3'

Options:
0) Exit
1) Print raw DNA sequence
2) Invert DNA sequence
3) Transcribe DNA sequence and print summary
4) Print raw RNA sequences
5) Translate RNA sequences and print summary
6) Print raw sequences to file

Enter choice: 3

1 Resulting mRNA sequences:
15-24: 5' AUUAACACU 3'

Options:
0) Exit
1) Print raw DNA sequence
2) Invert DNA sequence
3) Transcribe DNA sequence and print summary
4) Print raw RNA sequences
5) Translate RNA sequences and print summary
6) Print raw sequences to file

Enter choice: 5
1 Resulting protein sequences:
15-24: INT

Options:
0) Exit
1) Print raw DNA sequence
2) Invert DNA sequence
3) Transcribe DNA sequence and print summary
4) Print raw RNA sequences
5) Translate RNA sequences and print summary
6) Print raw sequences to file

Enter choice: 0

The output protein files for the other test cases are available as well:

Submitting your work

Be sure to commit your work often to prevent lost data. Only your final pushed solution will be graded.