CSC362 2004S, Class 6: Representing Tokens with Regular Expressions

Admin:
* Late: Stephane, Oge
* Missing: Daniel S-S, James, Cassie (Excused)
* Don't send me .sxw files (.txt is nice)
* Questions on project, phase 1?  Does anyone need a group?
  * Email me your groups
  * Is there a set list of tokens to use? (The "official" list)
  * Should we use Sam's classes?  Certainly.  
    * You must implement rebelsky.compiler.lexer.TokenStream.java
    * You must use rebelsky.compiler.lexer.Token.java
    * You can (and probably should) use 
      rebelsky.compiler.lexer.{IntegerToken,StringToken,Identifier}.java
* Printable versions of pages often available with .pdf suffix
* Vote!
* New draft virtual community documents to be online soon
* Talk Wednesday at 4:15.

Overview:
* What is a language?
* About lexical analysis
* Lexical analyzer generators
* Regular expression basics

The traditional linguistics view: "A predetermined system of signs used to communicate (vocally) among humans."
The computer science view (Something extremely abstract): A language is (1) a set of strings (over a particular alphabet); (2) an associated *semantics* that assigns meaning to those strings.
Theoreticians often ignore (2).

What techniques do we use for describing sets of strings?
* Enumerate/list all the members. (Only works for finite sets.)
* Give a "set predicate": If a value meets the predicate, it's in the set; If a value doesn't meet the predicate, it's not.
* Give a rule (or rules) for constructing the elements of a set.
  * Whole numbers:
    * 0 is a whole number
    * The successor of any whole number is a whole number
* Use set operations on finite sets
* Computer scientists focus on predicates

It is possible to categorize languages based on how hard it is determine membership in the language.

Simplest languages: Regular languages
* Determine membership with "finite automaton" (a program with fixed memory usage)

N key ways to describe regular languages:
* A program that matches
  * Hand-written
  * Automatically generated from a description of the language
* Description that is non-computational
  "Strings of a's and b's that start and end with a"

* "A hand-written program is easier to troubleshoot"
* An automatically-generated program may be easier to read (that is, you read the description, not the program)

The most common way to describe regular languages is with "regular expressions"

* The special symbol phi is a regular expression (the empty set)
  L(phi) = { }
* The special symbol epsilon is a regular expression (the empty string)
  L(epsilon) = { epsilon }
  concat(epsilon,X) = concat(X,epsilon) = X
* For all symbols, s, in the alphabet, s is a regular expression.
  L(s), the language denoted by s, is { s }
* For all regular expressions R and S, (R)|(S) is a regular expression
  * L((R)|(S)) = L(R) union L(S)
  * L((a)|(b)) = { a, b }
* For all regular expressions, R and S, (R)(S) is a regular expression
  * L((R)(S)) = { concat(X,Y) | X in L(R) and Y in L(S) }
  * L((a)(b)) = { ab }
  * L(((a)|(b))(b)) = { ab, bb }
* For all regular expressions, R, and whole numbers k, (R)^k is a regular expression
  L((R)^0) = { epsilon }
  L((R)^1) = L(R)
  L((R)^k) = L((R)(R^(k-1)))
  L((a)^5) = { aaaaa }
  L(((a)|(b))^3) { aaa, aab, aba, abb, baa, bab, bba, bbb }
* NOTE: ALL OF THESE ARE FINITE!  HOW DO WE DESCRIBE INFINITE SETS?
  * The Kleene *
  * If R is a regular expression, (R)* is a regular expression
  * L((R)*) = R^0 union R^1 union R^2 ...
  * L((a)*) = { epsilon, a, aa, aaa, aaaa, aaaaa, .... }

How would you describe "strings of a's and b's that begin and end with a" (using regular expressions) (assume: at least length two)

(a)((((a)|(b))*)(a))
a(a|b)*a