Project 3.5: Global Alignment

We are taking a somewhat different approach to the Needleman-Wunsch algorithm than are St. Clair and Visick.

We are using Python instead of Perl (duh!).
- As is our standard practice, rather than writing a single program, we write separate procedures to handle the different parts of the problem.
Instead of building the directional table by post-processing the score table, we are building that table simultaneously.
- Instead of using the letters D, H, and V in that table, we are using the strings 'Mat' (for match), 'Del' (for deletion of an element from sequence 1) and 'Ins' (for insertion of an element into sequence 1; which also corresponds to a deletion of an element from sequence 2).
We provide some of the procedures and modifications they suggest you write.
- In constrast, we don't provide all of the features they include.
We make our alignment instructions from left-to-right, rather than from right-to-left.

Preparation

The file needleman-wunsch.py contains most of the procedures you will need to explore Needleman-Wunsch pattern matching. Make your own copy.

1. Basic Explorations

a. The procedure buildNWTables builds the two Needleman-Wunsch tables (scores and directional instructions). Run this procedure on the two sample strings from the reading (CACGTAT and CGCA).

>>> (values,directions) = buildNWTables('CACGTAT', 'CGCA')

Verify that you get the same tables as are represented by Figure 3.8 on p. 72. (These lists are not quite as conveniently represented, so you'll need to think a bit about them.)

b. The procedure oneAlignment takes the directions table from the previous procedure and a position in the table, and finds one set of alignment instructions. Use it to find one alignment of the two strings. Explain what the parts of the alignment mean. (Ask for help if you're not sure.)

c. Looking at Figure 3.8, make a list of all possible lists of alignment instructions for the optimal score.

d. The procedure allAlignments is supposed to find all of the alignment instructions from the directions table. Determine whether or not it finds the same set of alignments that you found.

e. The procedure showAlignment shows how to strings align, using a set of alignment instructions (as produced by oneAlignment or allAlignments). Test it with each of the alignments produced by allAlignments.

f. Rather than replicating these three steps (build the table, find the alignment instructions, show the alignment) by hand, we can combine them into a procedure. Write a procedure that does that for one alignment.

g. We've written similar procedures called nw1 and nw2. Test both procedures on the sequences CACGTAT and CGCA. Then compare their code to yours.

2. Other Alignments

All the optimal alignments of the two sequences from the reading consist of only matches and deletions.

a. Find a pair of strings, each of length at least 4, in which an optimal alignment involves insertions (that is, we'll see a '-' in sequence 1 where there is a letter in sequence 2)

b. Find a pair of strings, each of length at least 4, in which an optimal alignment involves both insertions and deletions (that is, we'll see a '-' in both sequences).

3. Other Scoring Metrics

Right now, the scoring metric of insertions have a value of -1, deletions have a value of -1, mismatches have a value of 0, and matches have a value of 1 is hard-coded into the algorithm. But that's not the only scoring metric we may care to use. For example, we may want to give mistmatches a value of -1, or matches a higher value.

a. Rewrite the code so that it is easier to specify the values for insertions, deletions, matches, and mismatches.

b. Set up a metric in which insertion, deletions, and mistmatches all have a value of -1, and matches have a value of 1. Does this new metric affect how the algorithm aligns CACGTAT and CGCA? Does this new metric affect how the algorithm aligns your strings from the previous exercise?

c. Set up a metric in which insertion, deletions, and mistmatches all have a value of -1, and matches have a value of 2. Does this new metric affect how the algorithm aligns CACGTAT and CGCA? Does this new metric affect the how the algorithm aligns your strings from the previous eercise?

4. Sequences of Insertions and Deletions

You may not complete this problem. That's okay. Do your best to think about it.

As currently written, our algorithm makes the penalty for a sequence of insertions (or a sequence as deletions) a multiple of the length of the sequence. Frequently, it makes more sense to say that the basic step of inserting (or deleting) is expensive, but the length is much less important.

Rewrite the algorithm so that the cost of an insertion (or deletion) is independent of the length of the insertion (or deletion).

In computing the value, instead of looking at just the cells in the table immediately above and to the left (for insertion and deletion), you'll need to look at all the cells above and to the left. (This computation will make the algorithm much less efficient, but is necessary.)

Note that you'll need to put more information in the directions table. Instead of just 'Del' or 'Ins' you'll also need to note how many deletions or insertions to do.

Note that the process for extracting alignments from the table is a bit more complicated using this cost algorithm. Replace the allAlignments procedure with one that takes your new format into account.