Back to Regular Expressions, Continued. On to From Specification to NFA.

**Held** Friday, February 2, 2001

**Summary**

Today we consider finite automata, which server as another representation of data for lexical analysis.

**Notes**

- A few of you didn't show up to lab yesterday and didn't warn me in advance. That's a no no.
- I'd like to know what you came up with for ``strings that neither start nor end with an a''.
- Don't forget all the cool Thursday events in the coming weeks
- Feb. 8: Summer CS Opportunities
- Feb. 15: Summer Math Research Opportunities
- Feb. 22: Pete Broadwell, Elizabeth Hulse, Chris Kern, and Erin Nichols talk about Web Raveler.
- March 1: Josh Vickery talks about annotations. Rachel Heck, Kumail Nanjiani, Ellie Raulerson, and Isabel Staicut talk about invading privacy.
- March 8: Lea Wittie, alumni scholar, talks about debugging.

- Don't forget the Grinnell Entrepreneurs' forthcoming conference on Saturday, Feb. 17th.

**Overview**

- Context: What we're doing and why
- Regular expressions
- Finite automata
- Deterministic
- Nondeterministic

- Looking ahead

- How will we learn about lexical analysis in this class?
- We will conclude our discussion of
*regular expressions*. - We will then consider how we might combine regular expressions to describe all of the tokens of a language. [Today]
- Because there seems to be a gap between regular expressions and
computation, we will then consider
*finite automata*, a more computational way to express tokens and, more importantly, the process of tokenizing. [Today] - Although finite automata are covered in some depth in CS341, we will spend some time considering some issues in the language of finite automata. [Today]
- Since regular expressions are often easier to develop and write than finite automata, we will consider how to automatically translate regular expressions to finite automata. [Monday]
- We will spend lab next week exploring a tool for building
lexical analyzers, such as
`lex`

,`Jlex`

, and`flex`

.

- As you might guess, we use regular expressions during lexical analysis by writing regular expressions to describe the tokens of our language.
- An identifier is (more or less) given by
*[a-zA-Z][a-zA-Z0-9]** - The keyword
`if`

is given by*if*. - Integers are almost given by
*[+-]?[0-9]+*. - Are there problems with using these definitions for lexical analysis? Yes.
- One problem is that different regular expressions can match the same string.
For example,
`if`

is matched by the regular expression for the if keyword. However, it is also matched by the definition of an identifier.- Most languages don't accept identifiers as keywords.
- Some do, but that can make it easier to write unreadable code, as
in the following (which I've formatted for more readability)
__if__then = if__then__if = else__else__if = then

- In addition, regular expressions are normally used for the membership test
(is
*this string*in*this language*?). For membership testing, we already know the bounds of a string. During lexical analysis, we may not know these bounds. For example, is`if3`

one token or two? How many tokens are there in`if 3 then`

?- There are languages in which the answer to the second question is ``one''.

- Two basic principles are used to answer these questions.
- The
*principle of longest substring*says that if regular expression A indicates that the first n characters of the string form a token, regular expression B indicates that the first m characters of the string form a token, and n is less than m, then the first m characters of the string form a token of type B. - This principle works when we are deciding between differing length
tokens, but what if we have two tokens of the same length?
The
*principle of rule priority*, says that the highest priority rule (regular expression plus name) is used. Which one has highest priority? Usually the first.

- Regular expressions provide a simple, easy to understand, and elegant
mechanism for describing the tokens of a language.
- Unfortunately, it is difficult to compute with regular expressions.
- Hence, we will consider a more computational approach to lexical
analysis:
*finite automata*.

- What are finite automata?
- They are very simple computing machines (simple enough that there are things that they cannot compute that more complex computing machines can compute).

- What do they compute?
- Typically, they compute membership in simple languages.
- That is, given a string, they determine whether or not that string is in the language.
- As with regular expressions, they can be modified to find tokens in larger sequences of symbols.

- Because they are simple, finite automata can easily be translated to machine code.
- Formally, a
*deterministic finite automaton*is a five tuple <Q,Sigma,delta,q0,F> where- Q is a set of
*states*; - Sigma is the base
*alphabet*, that is a finite set of symbols; - delta is a
*transition function*, it is a function from state/symbol pairs to states; - q in Q is the
*start state*; - F subset Q is the set of
*final states*.

- Q is a set of
- How do we use an automaton to determine whether a string is in the language
given by that automaton? Using a matching function.
/** * Determine whether a string is in the language given by the * current automaton. */ public boolean inLanguage(String candidate) begin State current_state = q0; for each symbol sym in candidate current_state = delta(current_state ,sym) if current_state is undefined return false end for return (current_state is a final state); end inLanguage

- As this suggests, it is also helpful to have an explicit or implicit
*failure state*. - Typically, we represent finite automata pictorially.
- Each state is represented by a circle.
- The transition function, delta, is represented by directed and labeled edges between states. For example, if (q1,b,q2) is in delta, then there is an edge from q1 to q2 labeled b.
- Final states have a double circle.
- The start state has an incoming arrow.
- Sigma is not represented explicitly.

- You may find it easier to represent some tokens with finite
automata and others with regular expressions.
- However, they have the same computational power.
- Neither can represent the language ``strings of
`a`

's and`b`

's with equal numbers of`a`

's and`b`

's''.

- It turns out that it's interesting and useful to consider a variant of
finite automata in which
- some transitions are ``free'', in that they do not require any symbols in order to make the transition. Typically such edges are labeled with epsilon.
- there may be multiple edges with the same label exiting the same node

- Such finite automata are often called
*nondeterministic finite automata*or NFAs. - When is a string in the language given by such an automaton? When
there is
*some path*through the automaton that uses the appropriate symbols and free transitions. [We'd like the computer to guess the correct path.] - Do we gain any power from this nondeterminism? Surprisingly, no. As we'll see next class, nondeterministic finite automata can easily be converted to deterministic finite automata.

- Why is all this important?
- Because regular expressions (which we like to write), are declarative, not imperative. That is, they say what we want to match, but now how.
- Finite automata, on the other hand, include a nice procedure for matching.
- Fortunately, with a little bit of effort, you can turn any regular expression (or set of regular expressions) into a DFA.
- First, you convert each regular expression to an NFA.
- Then you join the NFA's together, giving a new NFA.
- Then you convert the NFA to a DFA.
- We'll do all that tomorrow.

Monday, 22 January 2001

- Created as a blank outline.

Thursday, 1 February 2001

- Filled in the details, many of which were taken from outline 6 of the previous version of this course.
- Updated formatting.

Back to Regular Expressions, Continued. On to From Specification to NFA.

[Current]
[Discussions]
[Glance]
[Honesty]
[Instructions]
[Links]
[Search]
[Syllabus]
**Primary**

[Examples]
[Exams]
[Handouts]
[Homework]
[Labs]
[Outlines]
[Project]
[Quizzes]
[Readings]
[Reference]
**Sets**

[Blackboard]
[98F]
**Links**

**Disclaimer**:
I usually create these pages on the fly. This means that they
are rarely proofread and may contain bad grammar and incorrect details.
It also means that I may update them regularly (see the history for
more details). Feel free to contact me with any suggestions for changes.

This page was generated by Siteweaver on Mon Apr 30 10:51:47 2001.

This page may be found at `http://www.cs.grinnell.edu/~rebelsky/Courses/CS362/2001S/outline.06.html`

.

You may validate
this page's HTML.

The source was last modified Thu Feb 1 20:33:44 2001.