Approximate overview
How are we interpreting the grades?
Grade on assignments are irrelevant, other than right/wrong.
The only thing that matters is the number of learning goals you’ve demonstrated that you’ve achieved.
Can I challenge the grades?
Sure.
Given two strings, s and t, find the minimum edit distance that
will convert s to t given a cost of d to delete a character and
a to add a character and r to replace a character.
A lot of TPS
Here’s the table. s{i} represents “the first i characters of s”
T A R G E T
C t{0} t{1} t{2} t{3} t{4} ...
+----+----+----+----+----+--
S s{0}| 0 | a | 2a | | | ...
+----+----+----+----+----+--
O s{1}| d | | | | | ...
+----+----+----+----+----+--
U s{2}| 2d | | | | | ...
+----+----+----+----+----+--
R s{3}| | | | | | ...
+----+----+----+----+----+--
C s{4}| | | | | | ...
+----+----+----+----+----+--
E s{5}| | | | | | ...
. . . . . .
. . . . . .
. . . . . .
What’s the recursive statement about the minimum edit distance
between two prefixes, s{i} and t{j}, which we phrase as
cost(s{i},t{j})? You can use s[i] for the ith character of i.
Reminder: For most dynamic programming problems, writing the recursive statement is 90% of the work.
cost(s{i}, t{j}) =
if (0 == i)
return a*j
else if (0 == j)
return d*i
else if (s[i] == t[j])
return cost(s{i-1}, t{j-1})
else
return min(d + cost(s{i-1},t{j}), // deletion
a + cost(s{i},t{j-1}), // insertion
r + cost(s{i-1},t{j-1})) // replacement
“Mathematics is invariant of notation.”
“Computer science likes to pretend it’s mathematics.”
How would we update the edit distance algorithm so that we know not only the edit distance, but also step necessary to change the source to the target?
Set up another table, as you insert or delete, add to the table. In each cell, you mark which choice you’ve made (add, delete, replace, keep). You can then read the steps backwards in the table.
What if s is much smaller than t and we care about finding the
optimal alignment? E.g., we match “habit” or “phat” or “shab”
against “alphabetical”?
Hint: Change the table initialization.
Related idea: Update the table so that it’s okay to start anywhere
in t.
C t{0} t{1} t{2} t{3} t{4} ...
+----+----+----+----+----+--
S s{0}| 0 | 0 | 0 | 0 | 0 | ...
+----+----+----+----+----+--
Running time O(mn), where m is the length of s and n is the length of t.
Probably as good as we can do.
There are many variants of the problem.
s and t, find the index of the first match of s to a substring of t.s and t, find the index of any one match of s to a substring of t.s and t, find all indices of the matches of s to a substring of t.s and t, does s match any substrings of t?We’ll focus on the first. E.g., find “ska” in “triskaidekaphobia”
TPS
What’s the “obvious” brute force algorithm?
for pos = 0 to (length(t) - length(s))
if (0 == strcmp(s, t[pos,pos+length(s)-1])) then
return pos
return "not found"
What is the cost of that algorithm? O(mn); no better than the one above.
What is an instance in which we find a match and still come close to the upper bound? s should be length 5, t should length 100.
Mediocre example: abcde vs abcd-abcd-abcd-abcd-abcd-abcd-….abcde. Cost: Approximately 2n.
vwxyzvsabcdefghijklmnopqrstuvwxyz. This is O(n), which is pretty fast.
Source
aaaactargetaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaacThis is O(mn).
What is an instance in which we don’t find a match but are far from the upper bound? (E.g., one that is close to length(t).) s should be length 5 (or, preferably, any length), t should be length 100.
vwxyvvsabcdefghijklmnopqrstuvwxyz. This is O(n), which is pretty fast.
TPS
Suppose we are doing many matches. Are there ways we can pre-process the target string so that the amortized cost is much less? Does it matter if all the source strings are the same length?
What other techniques can we use to solve this problem more quickly?
Sam’s favorite mechanism …
t of the appropriate length
(that is, length(s)).s appears, compute its hash code and look in
the hash table. (You may have to filter that index in the hash table.)Criticisms
Here’s a good hash algorithm to use that solves the problem.
Our string is t[1] … t[m]. p is a prime. t[1]xp^m + t[2]xp^(m-1) + s[3]xp^(m-2) + … s[m-1]xp^2 + s[m]xp^1.
If we have the hash code for t[1..m], can we easily compute the hash code for t[2..m+1]? (“easily” = “in constant time”)
We will return to the question next class.