Exploring Bioinformatics with Python
Basic:
[Skip To Body]
[Front Door]
|
[Reference]
[Labs]
[Projects]
Courses:
[BIO/CSC295.01 2009F]
[BIO/CSC295.01 2011F]
Python:
[python.org]
[biopython.org]
Misc:
[Exploring Bioinformatics site]
We are taking a somewhat different approach to the Needleman-Wunsch algorithm than are St. Clair and Visick.
'Mat'
(for match),
'Del'
(for deletion of an element from sequence 1)
and 'Ins'
(for insertion of an element into sequence 1;
which also corresponds to a deletion of an element from sequence 2).
The file needleman-wunsch.py contains most of the procedures you will need to explore Needleman-Wunsch pattern matching. Make your own copy.
a. The procedure buildNWTables
builds the two Needleman-Wunsch
tables (scores and directional instructions). Run this procedure on the
two sample strings from the reading (CACGTAT and CGCA).
>>> (values,directions) = buildNWTables('CACGTAT', 'CGCA')
Verify that you get the same tables as are represented by Figure 3.8 on p. 72. (These lists are not quite as conveniently represented, so you'll need to think a bit about them.)
b. The procedure oneAlignment
takes the directions table
from the previous procedure and a position in the table, and finds one
set of alignment instructions. Use it to find one alignment of the two
strings. Explain what the parts of the alignment mean. (Ask for help
if you're not sure.)
c. Looking at Figure 3.8, make a list of all possible lists of alignment instructions for the optimal score.
d. The procedure allAlignments
is supposed to find all of
the alignment instructions from the directions table. Determine whether
or not it finds the same set of alignments that you found.
e. The procedure showAlignment
shows how to strings
align, using a set of alignment instructions (as produced by
oneAlignment
or allAlignments
). Test it with
each of the alignments produced by allAlignments
.
f. Rather than replicating these three steps (build the table, find the alignment instructions, show the alignment) by hand, we can combine them into a procedure. Write a procedure that does that for one alignment.
g. We've written similar procedures called nw1
and
nw2
. Test both procedures on the sequences CACGTAT and CGCA.
Then compare their code to yours.
All the optimal alignments of the two sequences from the reading consist of only matches and deletions.
a. Find a pair of strings, each of length at least 4, in which an optimal alignment involves insertions (that is, we'll see a '-' in sequence 1 where there is a letter in sequence 2)
b. Find a pair of strings, each of length at least 4, in which an optimal alignment involves both insertions and deletions (that is, we'll see a '-' in both sequences).
Right now, the scoring metric of insertions have a value of -1,
deletions have a value of -1, mismatches have a value of 0, and matches
have a value of 1
is hard-coded into the algorithm. But that's not
the only scoring metric we may care to use. For example, we may want to
give mistmatches a value of -1, or matches a higher value.
a. Rewrite the code so that it is easier to specify the values for insertions, deletions, matches, and mismatches.
b. Set up a metric in which insertion, deletions, and mistmatches all have a value of -1, and matches have a value of 1. Does this new metric affect how the algorithm aligns CACGTAT and CGCA? Does this new metric affect how the algorithm aligns your strings from the previous exercise?
c. Set up a metric in which insertion, deletions, and mistmatches all have a value of -1, and matches have a value of 2. Does this new metric affect how the algorithm aligns CACGTAT and CGCA? Does this new metric affect the how the algorithm aligns your strings from the previous eercise?
You may not complete this problem. That's okay. Do your best to think about it.
As currently written, our algorithm makes the penalty for a sequence of insertions (or a sequence as deletions) a multiple of the length of the sequence. Frequently, it makes more sense to say that the basic step of inserting (or deleting) is expensive, but the length is much less important.
Rewrite the algorithm so that the cost of an insertion (or deletion) is independent of the length of the insertion (or deletion).
In computing the value, instead of looking at just the cells in the table immediately above and to the left (for insertion and deletion), you'll need to look at all the cells above and to the left. (This computation will make the algorithm much less efficient, but is necessary.)
Note that you'll need to put more information in the directions
table. Instead of just 'Del'
or 'Ins'
you'll also need to note how many deletions or insertions to do.
Note that the process for extracting alignments from the table is a bit
more complicated using this cost algorithm. Replace the
allAlignments
procedure with one that takes your new
format into account.
This page was generated by
Siteweaver on Thu Sep 15 13:11:37 2011.
The source to the page was last modified on Thu Sep 15 13:11:34 2011.
This page may be found at http://www.cs.grinnell.edu/~rebelsky/ExBioPy/project-3.5.html
.
You may wish to validate this page's HTML
Samuel A. Rebelsky