Held: Friday, 6 February 2004
Summary:
Today we consider finite automata, which serve as another, more
computational, representation of data for lexical analysis.
Related Pages:
Notes:
- Monday at 2:15 in 2417, there is a talk entitled Beyond Pizza and Code: The ACM Code of Ethics. Extra credit for attending.
- Monday at 4:30, there is a talk entitled GUI: More than just a pretty interface. Extra credit for attending.
- I understand that there are questions on my code. I'd appreciate it if you'd share them with me.
Overview:
- Finite automata
- Deterministic Finite Automata
- Examples
- Nondeterministic Finite Automata
- Looking ahead
- Regular expressions provide a simple, easy to understand, and elegant mechanism for describing the tokens of a language.
- Unfortunately, it is difficult to compute with regular expressions.
- Hence, we will consider a more computational approach to lexical
analysis: finite automata.
- What are finite automata?
- They are very simple computing machines (simple enough that there are things that they cannot compute that more complex computing machines can compute).
- What do they compute?
- Typically, they compute membership in simple languages.
- That is, given a string, they determine whether or not
that string is in the language.
- As with regular expressions, they can be modified to find
tokens in larger sequences of symbols.
- Because they are simple, finite automata can easily be translated to machine code.
- Essentially, a finite automaton is a series of states and transitions.
- Each state represents knowledge about the part of the string seen so far.
- Each transition indicates how state changes depending on the next input character.
- Formally, a deterministic finite automaton is a five tuple <Q,Sigma,delta,q0,F> where
- Q is a set of states;
- Sigma is the base alphabet, that is a finite set of symbols;
- delta is a transition function, a function from
state/symbol pairs to states;
- q_{0} in Q is the start state;
- F subset Q is the set of final states.
- Typically, we represent finite automata pictorially.
- Each state is represented by a circle.
- The transition function, delta, is represented by directed and
labeled edges between states. For example, if (q1,b,q2) is in
delta, then there is an edge from q1 to q2 labeled b.
- Final states have a double circle.
- The start state has an incoming arrow.
- Sigma is not represented explicitly.
- How do we use an automaton to determine whether a string is in the language
given by that automaton? Using a matching function.
/**
* Determine whether a string is in the language given by the
* current automaton.
*/
public boolean inLanguage(String candidate)
begin
State current_state = q0;
for each symbol sym in candidate
current_state = delta(current_state ,sym)
if current_state is undefined return false
end for
return (current_state is a final state);
end inLanguage
- As this example suggests, it is also helpful to have an explicit or implicit failure state.
- You may find it easier to represent some tokens with finite automata and others with regular expressions.
- They have the same computational power.
- Neither can represent the language
strings of a
's and
b
's with equal numbers of a
's and
b
's
.
- We'll try doing each of these as a regular expression and then as a DFA.
- "Strings that start with a, have an arbitrary number of b's and end with a"
- "Strings that start and end with a".
- "Comments in C" (to simplify things notationally: strings that start with
ab, end with ba, and have no intervening ba's).
- It turns out that it's interesting and useful to consider a variant of
finite automata in which
- some transitions are
free
, in that they do not require
any symbols in order to make the transition. Typically
such edges are labeled with epsilon.
- there may be multiple edges with the same label exiting
the same node
- Such finite automata are often called nondeterministic finite
automata or NFAs.
- When is a string in the language given by such an automaton? When
there is some path through the automaton that uses the
appropriate symbols and free transitions. [We'd like the computer
to guess the correct path.]
- Do we gain any power from this nondeterminism? Surprisingly, no.
As we'll see next class, nondeterministic finite automata can
easily be converted to deterministic finite automata.
- Why is all this important?
- Because regular expressions (which we like to write), are declarative, not imperative. That is, they say what we want to match, but now how.
- Finite automata, on the other hand, include a nice procedure for matching.
- Fortunately, with a little bit of effort, you can turn any regular expression (or set of regular expressions) into a DFA.
- First, you convert each regular expression to an NFA.
- Then you join the NFA's together, giving a new NFA.
- Then you convert the NFA to a DFA.
- We'll do all that in the next few classes.