CS364 Lab 1: String Search

Before you start

Read the entire syllabus on the course webpage

Goals

This first lab is a short introduction to string search, a fundamental task in computational biology but also an essential task in daily life (i.e. web search!) We’ll see throughout the course how there are tradeoffs in most algorithms between runtime, memory, pre-processing time etc and how these depend on which tasks we need to repeat many times. Specifically in this lab the goals are:

Re-familiarize yourself with Python and GitHub if it’s been a while.
Implement a string search algorithm (Boyer-Moore).
Test the runtime of this algorithm against a naive string search algorithm.
Visualize your results using matplotlib.

This first lab is individual, but there will be opportunities for pair programming later on.

Getting Started

We will be using GitHub classroom this semester. Follow the link (on Piazza) to accept this assignment and clone your repo. VScode is encouraged but you are welcome to use an editor of your choice. You can keep all your code for this lab in string_search.py, but you are also welcome to add helper files.

Part 1: Naive String Search

In the file string_search.py, implement a function:

def naive_string_search(P, S):

That takes two string arguments (the pattern P and the search string S) and returns two values:

a list of positions in S where P occurs (empty list if no occurrences)
the number of character comparisons performed

Notes: you may not use any slicing for this algorithm, or any Python built-ins like index or even in. You should manually check the characters of P and S until you see a mismatch, then shift over by 1 and start again (adding found indices to the return value as necessary).

For example if:

S = "CTGCACCTCATTTTGCCATTTATTTT"
P = "ATTTT"

Then you should return:

list of indices (in S) where P occurs: [9, 21]
number of comparisons: ~~15~~ 35

Your number of comparisons might be slightly off depending on your implementation.

Part 2: Boyer-Moore, bad character table

Now we will start to implement the Boyer-Moore algorithm, starting with the bad character table. See class notes for more details, but the idea is to match right-to-left. If we see a mismatch, we will shift the pattern over to the next character that would match the search string. Therefore we need not just one table of shifts, but a table for each character in the alphabet (representing what we would be mismatching with in the search string). One way to do this is with a dictionary, where the keys are the characters in the alphabet, and the values are lists of shifts (corresponding to each index in the pattern). For example if

P = "ATTTT"

and our alphabet is A,C,G,T then the bad character table is:

{'A': [0, 1, 2, 3, 4],
 'C': [1, 2, 3, 4, 5],
 'G': [1, 2, 3, 4, 5],
 'T': [1, 0, 0, 0, 0]}

I would recommend having one function to compute this table, and another to orchestrate the Boyer-Moore algorithm. Start comparing the pattern to the search string (right-to-left). Whenever a mismatch is reached, use the pre-computed bad character table to compute the shift. Move the pattern over this amount, then repeat the process. Whenever a complete instance of the pattern is found, add the appropriate index (in the search string) to the list of found occurrences.

The output of Boyer-Moore should be the same as naive in terms of the list of occurrences, but the number of comparisons should be smaller. Using the same example as in Part 1, the output should be:

list of indices (in S) where P occurs: [9, 21]
number of comparisons: ~~12~~ 17

Again the exact number of comparisons might be slightly different. With these small examples there is not a dramatic improvement for Boyer-Moore, but there often is for longer search strings and patterns.

Part 3: Boyer-Moore, good suffix table

Note: Part 1, 2, and 4 are worth most of the credit.

In this part we’ll add the idea of the good suffix table, then change Boyer-Moore slightly such that the shift is:

max(bad character shift, good suffix shift)

I would again recommend a separate function to compute the good suffix table, which is a single list of shifts (one for each position in the pattern). Here are a few examples:

P = "GCGACG"
good_suffix_table = [5, 5, 5, 3, 5, 1]

P = "TAAAA"
good_suffix_table = [5, 1, 2, 3, 4]

Finally, put the good suffix table with the bad character table together for the full Boyer-Moore algorithm and test with a variety of patterns and search strings.

Part 4: Runtime, Analysis, and Command line args

There are a few last components to this assignment to solidify the impact of fast search algorithms.

Command line arguments

To make your program more flexible, add command line arguments. For this first lab we’ll use the Python sys library, but in the future we’ll be a bit more robust. Here is an example of how we should be able to run your program on the command line:

python3 string_search.py <pattern> <search string file>

For example:

python3 string_search.py ATTTT rand_str.txt

This should print out the list of occurrences and the number of comparisons (for both naive and Boyer-Moore). In code, you can accomplish this with:

import sys

P = sys.argv[1]
S = sys.argv[2]

Runtime

This part is open-ended, but the requirement is to create a plot (using matplotlib) that compares the runtime (in terms of number of comparisons) between the naive algorithm and Boyer-Moore. One idea is to create a function the returns a random string of size n. Then you can vary n in terms of either the pattern length or the search string length. Or you can vary the size of the alphabet. To summarize, you can pick one of:

increase the length of the search string, keeping the pattern and the alphabet the same
increase the length of the pattern, keeping the search string and the alphabet the same
increase the size of the alphabet, keeping the pattern and the search string the same length

Complete this part in runtime.py. When we run this from the command line, a plot should be created (with labeled axes, a legend, and a title). Look back on CS260 assignments for more details about matplotlib.

Analysis and Submission

Finally, fill out the README.md in the git repo, which asks a few questions about runtime.

Make sure your code is easy to read, well-commented, and non-redundant. Style will be a smaller part of this first lab, but will be graded in future labs. Also be sure to commit your work often to prevent lost data. Only your final pushed solution will be graded.