Held Monday, September 9, 2002
Summary
Today we consider finite automata, which serve as another, more
computational, representation of data for lexical analysis.
Notes
- In lab tomorrow, we'll explore regular expressions in the Unix
environment.
- In preparation for the first phase of your compiler, you should look
for a list of tokens in Pascal.
Due
- Regular expression for
strings that are not 'if'
(or strings that are not 'while'
)
(or strings that are neither 'if' nor 'while'
).
Overview
- Finite automata
- Deterministic Finite Automata
- Examples
- Nondeterministic Finite Automata
- Looking ahead
- Regular expressions provide a simple, easy to understand, and elegant
mechanism for describing the tokens of a language.
- Unfortunately, it is difficult to compute with regular expressions.
- Hence, we will consider a more computational approach to lexical
analysis: finite automata.
- What are finite automata?
- They are very simple computing machines
(simple enough that there are things that they cannot compute that
more complex computing machines can compute).
- What do they compute?
- Typically, they compute membership in simple languages.
- That is, given a string, they determine whether or not
that string is in the language.
- As with regular expressions, they can be modified to find
tokens in larger sequences of symbols.
- Because they are simple,
finite automata can easily be translated to machine code.
- Essentially, a finite automaton is a series of states and transitions.
- Each state represents knowledge about the part of the string seen
so far.
- Each transition indicates how state changes depending on the next
input character.
- Formally, a deterministic finite automaton is a five tuple
<Q,Sigma,delta,q0,F> where
- Q is a set of states;
- Sigma is the base alphabet, that is a finite set of symbols;
- delta is a transition function, it is a function from
state/symbol pairs to states;
- q in Q is the start state;
- F subset Q is the set of final states.
- Typically, we represent finite automata pictorially.
- Each state is represented by a circle.
- The transition function, delta, is represented by directed and
labeled edges between states. For example, if (q1,b,q2) is in
delta, then there is an edge from q1 to q2 labeled b.
- Final states have a double circle.
- The start state has an incoming arrow.
- Sigma is not represented explicitly.
- How do we use an automaton to determine whether a string is in the language
given by that automaton? Using a matching function.
/**
* Determine whether a string is in the language given by the
* current automaton.
*/
public boolean inLanguage(String candidate)
begin
State current_state = q0;
for each symbol sym in candidate
current_state = delta(current_state ,sym)
if current_state is undefined return false
end for
return (current_state is a final state);
end inLanguage
- As this suggests, it is also helpful to have an explicit or implicit
failure state.
- You may find it easier to represent some tokens with finite
automata and others with regular expressions.
- They have the same computational power.
- Neither can represent the language
strings of a
's and
b
's with equal numbers of a
's and
b
's
.
- We'll try doing each of these as a regular expression and then as a
DFA.
- "Strings that start with a, have an arbitrary number of b's and end with a"
- "Strings that start and end with a".
- "Comments in C" (to simplify things notationally: strings that start with
ab, end with ba, and have no intervening ba's).
- It turns out that it's interesting and useful to consider a variant of
finite automata in which
- some transitions are
free
, in that they do not require
any symbols in order to make the transition. Typically
such edges are labeled with epsilon.
- there may be multiple edges with the same label exiting
the same node
- Such finite automata are often called nondeterministic finite
automata or NFAs.
- When is a string in the language given by such an automaton? When
there is some path through the automaton that uses the
appropriate symbols and free transitions. [We'd like the computer
to guess the correct path.]
- Do we gain any power from this nondeterminism? Surprisingly, no.
As we'll see next class, nondeterministic finite automata can
easily be converted to deterministic finite automata.
- Why is all this important?
- Because regular expressions (which we like to write), are
declarative, not imperative. That is, they say what we want
to match, but now how.
- Finite automata, on the other hand, include a nice procedure
for matching.
- Fortunately, with a little bit of effort, you can turn any regular
expression (or set of regular expressions) into a DFA.
- First, you convert each regular expression to an NFA.
- Then you join the NFA's together, giving a new NFA.
- Then you convert the NFA to a DFA.
- We'll do all that in the next few classes.
Thursday, 29 August 2002
- First version, based somewhat on outlines from
CS362 2001S.
Monday, 9 September 2002
- Removed section on using regular expressions for lexical analysis
(covered in a slightly different format in Friday's class).
- Updated wording somewhat.