Admin
  Class in 2428 on Monday
  More links on course web
  Class news finally updated (just replicates outlines)
  Success in Phase 2 meetings
  How would you describe the mental well being of 
    someone who is (1) Taking two other CS courses; 
    (2) Taking a CS MAP; (3) Taking accounting; 
    (4) Decides to audit this course while doing the work;
    and (5) claims to spend 40 HPW in MathLAN?
  Where are Erik and Mark and Adam and ...?

Overview
  Review
  Simplifying Regular Expressions
  Common Shorthands
  Limitations
  Describing Tokens

Review

* A regular expression is something you can use to
  describe a language: a set of strings.
* At the simplest level, you can build regular expressions
  in the following ways:
  + The Kleene *, if R is a regular expression, (R*) is
    a regular expression 
    { epsilon } union L(R) union L(R.R) ...
  + Concatenation RS is ...
  + The empty string is a regular expression (denoting the
    set of the empty string)
  + Alternations, (R|S).
    L((R|S|)) = L(R) union L(S)
  + s in Sigma  L(s) = { s }

Simplifying Regular Expressions

* Two letter sequences from the alphabet { a, b, c }
  (((a|b)|c).((a|b)|c))
* Do we really need all of them?a
* Without parenthesis, we have ambiguity
  R1: a.b|c : [((a.b)|c)] or (a.(b|c))
  R2: a.b.c : ((a.b).c) or (a.(b.c))
  R3: a.b*: [(a.(b*))] or ((a.b)*)
  R4: R.S.T : ((R.S).T) or (R.(S.T))
* EBA's Conjencture: Concatenation is associative
  + Daren's proof: Watch those hands wave.
* Assign precedence to operations:
  + Highest: * (like exponentiation)
  + Medium: . (like multiplication)
  + Low: | (like addition)
* Don't care about associativity
* Also: Don't bother writing concatenation operation.

Common Shorthands

* + as a shorthand for "one or more"
  R+ = RR*
* Set notation
  [abc] is shorthand for (a|b|c)
  [a-e] is shorthand for (a|b|c|d|e)
  [^a-e] is shorthand for [f-z], if Sigma is "lc letters"
* Name and reuse names
digit: [0-9]
letter: [a-z]
identifier: letter(digit|letter)*
* Question mark: 0 or 1
* R? (binds tightly) is (R|epsilon)
NO RECURSION OR REPETITION!

* What is a "numeric constant" in your favorite language?
  It can start with a plus or a minus or nothing?
  A digit next
  zero or more digits
  an optional (dot plus one or more digits)
  an optional (e plus ...)
  an optional F|f|L|l  (in C)
  an optional u|U|S|s
  [+-]?[0-9]+(.[0-9]+)?(e[+-]?[0-9]+)

Limitations
* Can't describe "strings of a's and b's with equal numbers
  of a's and b's"
* Can't describe "palindromes"

* Test: All palindromes over the alphabet { a }
  + a(aa)* : All odd length palindromes
  + a* : All palindromes
* When is the string c1 ... cn a palindrome?
  When cn ... c1 is the same as c1 ... cn

Some Fun Things to Describe
* Strings over { a, b} with exactly one b.
  (a*)b(a*)
* Strings over { a, b} with exactly two bs.
  (a*)b(a*)b(a*)
* Strings over { a, b} with no more than two bs?
  (a*)b?(a*)b?(a*)
* Strings over { a, b} without exactly two b's?

Describing Tokens
* We start by using regular expressions to describe
  the tokens in our language.
INT_CONSTANT = [+-]?(0|[1-9][0-9]*)
OCT_CONSTANT = [+-]?0[0-9]*
FLOAT_CONSTANT = [+-]?(0|[1-9][0-9]*).[0-9]+
WHILE = [Ww][Hh][Ii][Ll][Ee]
IDENTIFIER = [_A-Za-z][0-9_A-Za-z]*

Two related problems:
* The same sequence of characters can be matched by
  two regexps.
  0 could be an INT_CONSTANT or OCT_CONSTANT
* Two different parts of a sequence can be matched
  differently by two regexps.
  3212.23  could be INT_CONSTANT(321) followed by
  FLOAT_CONSTATN(2.23) or FLOAT_CONSTATN(3212.23) ...
  WHILE_I_AM_STILL_TEACHING+
* Two principles for lexical analysis
  (1) Match the longest possible string
  (2) For two equal length matches, choose the one that
      comes first in the list of regexps

"a number character sequence starting with a character"