Admin
Class in 2428 on Monday
More links on course web
Class news finally updated (just replicates outlines)
Success in Phase 2 meetings
How would you describe the mental well being of
someone who is (1) Taking two other CS courses;
(2) Taking a CS MAP; (3) Taking accounting;
(4) Decides to audit this course while doing the work;
and (5) claims to spend 40 HPW in MathLAN?
Where are Erik and Mark and Adam and ...?
Overview
Review
Simplifying Regular Expressions
Common Shorthands
Limitations
Describing Tokens
Review
* A regular expression is something you can use to
describe a language: a set of strings.
* At the simplest level, you can build regular expressions
in the following ways:
+ The Kleene *, if R is a regular expression, (R*) is
a regular expression
{ epsilon } union L(R) union L(R.R) ...
+ Concatenation RS is ...
+ The empty string is a regular expression (denoting the
set of the empty string)
+ Alternations, (R|S).
L((R|S|)) = L(R) union L(S)
+ s in Sigma L(s) = { s }
Simplifying Regular Expressions
* Two letter sequences from the alphabet { a, b, c }
(((a|b)|c).((a|b)|c))
* Do we really need all of them?a
* Without parenthesis, we have ambiguity
R1: a.b|c : [((a.b)|c)] or (a.(b|c))
R2: a.b.c : ((a.b).c) or (a.(b.c))
R3: a.b*: [(a.(b*))] or ((a.b)*)
R4: R.S.T : ((R.S).T) or (R.(S.T))
* EBA's Conjencture: Concatenation is associative
+ Daren's proof: Watch those hands wave.
* Assign precedence to operations:
+ Highest: * (like exponentiation)
+ Medium: . (like multiplication)
+ Low: | (like addition)
* Don't care about associativity
* Also: Don't bother writing concatenation operation.
Common Shorthands
* + as a shorthand for "one or more"
R+ = RR*
* Set notation
[abc] is shorthand for (a|b|c)
[a-e] is shorthand for (a|b|c|d|e)
[^a-e] is shorthand for [f-z], if Sigma is "lc letters"
* Name and reuse names
digit: [0-9]
letter: [a-z]
identifier: letter(digit|letter)*
* Question mark: 0 or 1
* R? (binds tightly) is (R|epsilon)
NO RECURSION OR REPETITION!
* What is a "numeric constant" in your favorite language?
It can start with a plus or a minus or nothing?
A digit next
zero or more digits
an optional (dot plus one or more digits)
an optional (e plus ...)
an optional F|f|L|l (in C)
an optional u|U|S|s
[+-]?[0-9]+(.[0-9]+)?(e[+-]?[0-9]+)
Limitations
* Can't describe "strings of a's and b's with equal numbers
of a's and b's"
* Can't describe "palindromes"
* Test: All palindromes over the alphabet { a }
+ a(aa)* : All odd length palindromes
+ a* : All palindromes
* When is the string c1 ... cn a palindrome?
When cn ... c1 is the same as c1 ... cn
Some Fun Things to Describe
* Strings over { a, b} with exactly one b.
(a*)b(a*)
* Strings over { a, b} with exactly two bs.
(a*)b(a*)b(a*)
* Strings over { a, b} with no more than two bs?
(a*)b?(a*)b?(a*)
* Strings over { a, b} without exactly two b's?
Describing Tokens
* We start by using regular expressions to describe
the tokens in our language.
INT_CONSTANT = [+-]?(0|[1-9][0-9]*)
OCT_CONSTANT = [+-]?0[0-9]*
FLOAT_CONSTANT = [+-]?(0|[1-9][0-9]*).[0-9]+
WHILE = [Ww][Hh][Ii][Ll][Ee]
IDENTIFIER = [_A-Za-z][0-9_A-Za-z]*
Two related problems:
* The same sequence of characters can be matched by
two regexps.
0 could be an INT_CONSTANT or OCT_CONSTANT
* Two different parts of a sequence can be matched
differently by two regexps.
3212.23 could be INT_CONSTANT(321) followed by
FLOAT_CONSTATN(2.23) or FLOAT_CONSTATN(3212.23) ...
WHILE_I_AM_STILL_TEACHING+
* Two principles for lexical analysis
(1) Match the longest possible string
(2) For two equal length matches, choose the one that
comes first in the list of regexps
"a number character sequence starting with a character"