Held Monday, January 29, 2001
Summary
Today we begin our consideration of lexical analysis through a visit
to the regular expressions that are typically used to denote lexemes.
Notes
- I've finally set up the blackboard discussion board for posting your
sample programs (and comments on them).
- I probably won't be available this afternoon (day care was cancelled
because of the snow and Michelle needs help with the kids).
- Rable rousing: Grinnell entrepreneurs are violating basic Grinnell
precepts.
Overview
- The process of lexical analysis
- Lexical analyzer generators
- Regular expressions
- Lexical analysis is traditionally the first ``real'' step in compilation.
During lexical analysis, one identifies the simple tokens
(also called lexemes) that make up
a program.
- What are these tokens? Things like identifiers, particular keywords,
symbols, and such.
- Some tokens have associated semantic values, such as the
name of an identifier or the value of an integer.
- During the tokenizing phase, a compiler converts a sequence of characters
(sometimes thought of as a sequence of bytes) to a sequence of tokens.
- During this conversion we often eliminate ``unnecessary'' components, such
as whitespace (spaces, tabs, newlines) and comments.
- However, there are reasons to preserve some related information, such
as the line number (used in printing error messages).
- There are many techniques for writing lexical analyzers (sometimes
called lexers), including
- Hand coding (as we did in class)
- Relying on lexical analyzer generators
- Relying on parser generators
- In general, it's best to use a lexical analyzer generator because
it gives you the best performance.
- Lexical analyzer generators typically take as input a description
of the various tokens in the language.
- They output code that will read characters from some input and
then generate tokens.
- Often, lexers operate on a demand-driven basis: when you ask for the
next token, they read enough input to get it.
- They may also provide a
peek
operation.
- Regular expressions can only describe relatively simple languages,
but they permit very efficient membership tests and matching.
- Provlem: need a language to describe regular expressions, but
we're only now learning how to describe languages, which leads to
an interesting problem in recursion.
- We'll use an informal description of regular
expressions.
- Like all languages, the language of regular expressions is based
on an underlying alphabet, Sigma.
- We'll denote the set of utterances (i.e., the langauge)
a regular expression denotes by L(exp).
- There is a special symbol, epsilon, not in Sigma.
- epsilon is written as the greek letter epsilon
- L(epsilon) is the set of the empty string, { "" }.
- Any single symbol in Sigma is a regular expression.
- The regular expression for that symbol is that symbol
- For each symbol s in Sigma, L(s) = { s }.
- The concatenation of any two regular expressions is a regular
expression denoting the combinations of strings from those two
regular expressions.
- If R and S are regular expressions, the concatenation of R and S
is written (R.S)
- L((R.S)) = { concat(x,y) | x in L(R), y in L(S) }
- The concatenation of any two strings is the two strings written in
sequence with no intervening space.
- concat("hello", "world") = "helloworld"
- epsilon is the identity string for concatenation
- concat(epsilon,s) = concat(s,epsilon) = s
- We sometimes will want shorthand
- (R.R) may also be written as (R)^{2}; ((R.R).(R)) as
(R)^{3}; and so on
and so forth.
- The alternation of any two regular expressions is a regular
expression denoting strings in either language.
- If R and S are regular expressions, the alternation of R and S is
written (R|S).
- L((R|S)) = L(R) union L(S)
- If R is a regular expression, then the Kleene star of R is a regular
expression.
- If R is a regular expression, the Kleene star of R is written
(R*).
- L((R*)) = L(epsilon) union L((R)^{1}) union
L((R)^{2}) ...
- Because the parentheses are cumbersome, you may remove them if the
meaning is obvious without them. In addition, we often don't write
the concatenation symbol.
- Kleene star has highest precedence; concatenation next; alternation
least.
- ab* is (a.(b*)) and not ((a.b)*)
- a|bc is (a|(b.c)) and not ((a|b).c)
- There are a number of shorthands for regular expression. None add
any expressive power and all can be ``compiled'' into normal regular
expressions. These include
- A postfix plus sign (+) for ``at least one instance''. a+ is
(a.a*).
- Brackets for alternation of larger sets. [abc] is a|b|c.
[a-c] is also a|b|c.
- A postfix question mark for ``optional''. a? is shorthand for
(a|epsilon).
- What are some sample regular expressions?
- All strings of a's and b's: (a+b)*
- Strings of a's and b's with exactly one b: a*ba*
- Strings of one or more a's: aa* (also a*a, also a*aa*)
Monday, 22 January 2001
- Created as a blank outline.
Monday, 29 January 2001