EBoard 40: String matching, concluded

Approximate overview

Admin
Discussion of grading
Analysis of the hashing technique for string matching
Other algorithms

Administrative stuff

Notes

Please use the booking assistant on Teams or Outlook to book a meeting or more with me during finals week.
I’ve released some grades. More are forthcoming.
I’ve also added all the LOs.

Upcoming token-generating activities

Equity in CS talk tonight. See email from SamR yesterday.
Summer Opportunities in CS tomorrow at 4:00 p.m. in 3821.

Other fun things

“I Dig You” this weekend. Ticketing starts today, we think.

Upcoming work

HW12 due TOMORROW
- Some dynamic programming problems (pseudocode)
- Topological sort (real code)
HW11 due LAST WEEK (ish)
- Implement skip lists

Q&A

Do we have class on Friday?

Yes! I expect you to be here. With a pen.

Why do we need a pen?

It’s a surprise!

Can I use a tablet?

Probably not. I want output on paper.

Grading

Approach: Count how many of the Learning Outcomes (LOs) you’ve demonstrated.
There are 21 one of them.
- All 21: A+
- 20: A
- 19: A-
- 18: B+
- …
What are the Learning Outcomes?
- They are on the syllabus. (Not necessarily in the most sensible order.)
How do you demonstrate that you’ve met them?
- Usually, with successful completion of a homework assignment.
- In some cases, it’s up to you to come up with the demonstration.
We’ll look at them together.

Substring matching with hash codes

Compute the hash code of each substring of t of the appropriate length (that is, length(s)).
To determine everywhere s appears, compute its hash code and look in the hash table. (You may have to filter that index in the hash table.)
Hash algorithm/function
- Our string is t[1] … t[n]. We’re hashing t[1] … t[m].
- p is some prime.
- Hash code h[1] is … t[1]xp^m + t[2]xp^(m-1) + t[3]xp^(m-2) + … t[m-1]xp^2 + t[m]xp^1.

Question:

If we have h[1], the hash code for t[1..m], can we quickly compute h[2], the hash code for t[2..m+1]? (“quickly” = “in constant time”)

If we have h[2], the hash code for t[2..m+1], can we quickly compute h[3], the hash code for t[3..m+2]?

If we have h[i], the hash code for t[i..i+m-1], can we quickly compute h[i+1], the hash code for t[i+1..i+m]?

If so, the hash algorithm is O(m+n), which seems about as good as we can do.

Strategy?

t[i]xp^m + t[i+1]xp^(m-1) + t[i+2]xp^(m-2) + … t[i+m-2]xp^2 + t[i+m-1]xp^1.
t[i+1]xp^m + t[i+2]xp^(m-1) + t[i+3]xp^(m-2) + … t[i+m-1]xp^2 + t[i+m]xp^1.

We’ve dropped the first term of the first one and added the last term of the second one.

h[i+1] = (h[i] - t[i]xp^m)xp + t[i+m]xp^1
We can have pre-computed p^m., so it’s just three mult, one subtraction, and one addition. (We can write it in other ways that may change that slightly.)

Yay! We have an O(m+n) algorithm for exact substring matching.

Advantages

Straightforward
Relatively easy to implement.
Can give all matching positions, not just first.
Optional: We can amortize it over all searches of length m.

Disadvantages

May have overlapping hash codes, once we mod by the table size.
Need to choose an appropriate prime. (2 and 3 are probably not good.)
Requires extra space for the hash table.

Can we do without the hash table?

Assume that we’re only looking for the first match, can we use a similar strategy to find the match in O(m+n) time and O(1) additional space?

m = strlen(s);
hs = hash(s,m); // Compute hash code of first m letters of s
ht = hash(t,m); // Compute hash code of first m letters of t

for (int i = 0; i < n-m-1; i++) 
  {
    if (hs == ht) 
      {
        if (0 == strncmp(s,t+i,m)) 
          {
            return i;   // Position of match
          }
      }
    else 
      {
        ht = ...; formula above
      }
  } // for
return -1;      // "no match"

Morals

We don’t always have to use dynamic programming.
Good choices of hash functions can make an important difference. (The one we used has a name that I forget.)
My classmate says smart things.
Sometimes it’s useful to amortize, sometimes you just want to solve it once and be done.
“Can I do better?”
We have a large toolbox; sometimes the unexpected tool is best for the job.
Analyze carefully.
Ask questions.
KISS: Keep it simple, Sam

Other approaches to substring matching

A bit of preparation

The source strings that give us the most problem are things like aaaaaab or ababababc or abcabcabcd or …; that is, strings that repeat their prefixes.

Why? Consider matching aaaaaab against aaaaaaab.... When we fail to match the b against that last a, what should we do?

Consider matching ababababac against ababababababababac.

In constrast, consider, instead, matching abcdefg, against abcdefabcdefg. When we fail to match the g against the a, what should we do?

How might those examples inspire us to develop a new algorithm?

Algorithm design time

TPS

A new algorithm?

Idea: When you hit a mismatch, don’t back up all the way to the beginning of s, don’t back up at all in t.

Instead, make a table that says how far you need to back up in s upon a mismatch.

aaaaaaab
*******1

aaaacaaaaa

ababababab
**23......

An alternate; this will make no sense to anyone not in class

_Idea: Move backwardds in the string.

Source: baaaaaaaa

Target: abaaaaaaaa

Unless specified otherwise elsewhere on this page, this work is licensed under a Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/3.0/ or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.

This website was built using Jekyll, Twitter Bootstrap, and the Bootswatch Cosmo Theme.