This outline is also available in PDF.
Held: Monday, 12 September 2011
Summary:
Today we consider finite automata, which serve as another, more
computational, representation of data for lexical analysis.
Related Pages:
 EBoard.
 Reading: Aho et al., 3.6.
Notes:
 Don't forget that you are supposed to drop me a short note reflecting on my notes on your first assignment.
 Some of you have asked about my grading style for this class. As you may have noted, I respond to your work in a variety of ways. I note potential and real errors. I note potential stylistic problems. I give suggestions for improving the code. I may give example inputs for which I am not sure your program will work correctly. I may write short notes like "ugh" or "bleh", which indicate that you've done something unpleasant at that point in your code. (I expect that you are sophisticated enought to figure it out, and can ask me if you're unsure.) In the past, I have found that some students do not take the time to look at the comments, and that those who do look at the comments don't always think about them carefully. Hence, this semester, I am not going to summarize my comments and will instead ask you to come up with a summary. You may certainly come talk to me if you have questions about the details of particular comments or about the big picture of your assignment.
 Do you have to do the second assignment? No. But if you don't do the assignment, you will get a zero on the assignment, which is between two and five points off on your final grade. In addition, it is likely that you will find the next assignment harder.
 Can you use
extended regular expression syntax
on the second assignment? See Piazza.
 EC for President's speech on Thursday at 11.
 EC for Walker Group's speech on Thursday at 4:30.
Overview:
 Finite Automata.
 Deterministic Finite Automata.
 Examples
 Nondeterministic Finite Automata.
 Looking Ahead.
 Regular expressions provide a simple, easy to understand, and
elegant mechanism for describing the tokens of a language.
 Unfortunately, it is difficult to compute with regular expressions.
 Do you know how to execute a regular expression?
 How about a set of regular expressions?
 Regular expressions (which we like to write) are declarative, not imperative. That is, they say what we want to match, but now how.
 Hence, we will also consider a more computational approach to lexical
analysis: finite automata.
 What are finite automata?
 They are very simple computing machines (simple enough that there are things that they cannot compute that more complex computing machines can compute).
 What do they compute?
 Typically, they compute membership in simple languages.
 That is, given a string, they determine whether or not
that string is in the language.
 As with regular expressions, they can be modified to find
tokens in larger sequences of symbols.
 Because they are simple, finite automata can easily be translated to machine code.
 Essentially, a finite automaton is a series of states and transitions.
 Each state represents knowledge about the part of the string seen so far.
 Each transition indicates how state changes depending on the next input character.
 We will often draw finite automata as simple graphs (states are nodes,
transitions are labeled edges).
 Each state is represented by a circle.
 Final states have a double circle.
 The start state has an incoming arrow.
 Sigma is not represented explicitly.
 Formally, a deterministic finite automaton is a five tuple <Q,Sigma,delta,q0,F> where
 Q is a set of states;
 Sigma is the base alphabet, that is a finite set of symbols;
 delta is a transition function, a function from
state/symbol pairs to states;
 q_{0} in Q is the start state;
 F subset Q is the set of final states.
 How do we use an automaton to determine whether a string is in the language
given by that automaton? Using a matching function.
/**
* Determine whether a string is in the language given by the
* current automaton.
*/
public boolean inLanguage(String candidate)
begin
State current_state = q0;
for each symbol sym in candidate
current_state = delta(current_state, sym)
if current_state is undefined return false
end for
return (current_state is a final state);
end inLanguage
 Alternately, one could end with

return the token associated with the current state

execute the code associated with the current state
 As this example suggests, it is also helpful to have an explicit or implicit failure state.
 You may find it easier to represent some tokens with finite automata and others with regular expressions.
 They have the same computational power.
 Neither can represent the language
strings of a
's and
b
's with equal numbers of a
's and
b
's
.
 Neither can represent palindromes.
 We'll try doing each of these as a regular expression and then as a DFA.
 "Strings that start with a, have an arbitrary number of b's and end with a"
 "Strings that start and end with a".
 "Comments in C" (to simplify things notationally: strings that start with
ab, end with ba, and have no intervening ba's).
 It turns out that it is interesting and useful to consider a variant of
finite automata in which
 some transitions are
free
, in that they do not require
any symbols in order to make the transition. Typically
such edges are labeled with epsilon.
 there may be multiple edges with the same label exiting
the same node
 Such finite automata are often called nondeterministic finite
automata or NFAs.
 When is a string in the language given by such an automaton? When
there is some path through the automaton that uses the
appropriate symbols and free transitions. [We'd like the computer
to guess the correct path.]
 Do we gain any power from this nondeterminism? Surprisingly, no.
As we'll see next class, nondeterministic finite automata can
easily be converted to deterministic finite automata.
 Why is all this important?
 Regular expressions are declarative  they describe what, not how.
 Finite automata, on the other hand, include a nice procedure for matching.
 Fortunately, with a little bit of effort, you can turn any regular expression (or set of regular expressions) into a DFA.
 Convert each regular expression to an NFA.
 Annotate the final state for each regular expression with the token type or other useful information.
 Join all of the NFA's together. That gives a new NFA.
 Convert the NFA to a DFA.
 We'll do all that over the next few classes. If you are feeling particularly
energetic, you can write a program to do it for you.