Today in CS362: Lexical Analysis

Admin
* Surveys
* Pascal programs
* Lab
* ECA

Topics
* What is a Language?
* About lexical analysis
* Hand-coding lexical analysis
* Generating lexical analyzers
* Regular Expressions

What is a language?
* Something you can use to express ...?
* NOT a description of a set of tasks
* PERHAPS a set of kinds of tasks the computer
  can accomplish
* NOT JUST something accepted by a finite automaton
* A language IS a set
* Usually we build languages from alphabets
  (usually denoted Sigma). 
  + A language is a set of strings built from the
    "letters" in sigma.
* Example:
  + Sigma1 = { ab }
  + Lang1 = { a, b } 
    "all one character strings from Sigma1"
  + Lang2 = "all strings from Sigma1 with equal numbers
    of a's and b's"
    Lang2 = { empty-string, ab, ba, abba, baab, abab, babba, aabb, ... }
  + Lang3 = "all palindromes over Sigma1"
* Different ways to describe languages
  + English text
  + List all the elements
  + Present representative elements and hope the
    "reader" intuits the rest.
    1,1,2,3,5,8,13,
  + Need formal mechanisms for describing these sets
    of values.
* Need to be able to do something interesting with
  descriptions.
  + Check membership
  + Generate elements of the language
  + Associate meaning to elements of the language
  + Translate into other languages
* Might also do set operations 

About Lexical Analysis
* Simple languages
* First phase of compiler
* Goal: group individual characters into useful and
  interesting tokens for later parts of the compiler.
* Normal model: Reads a sequence of characters and
  produces a sequence of tokens.
  Function from: Sigma* to Token*
* Tends to be done in a demand-driven way
* Three basic functions:
  + nextToken() -- get the next token
  + initialize
  + cleanup
* How do we get tokenizers?
  + Hand code
    Token nextToken(Stream str) {
      // Skip over whitespace and comments
      ...
      // Read the next token
      switch (str.peekChar()) {
        case alphabetic: 
          // read characters until you hit a
          // non-alphabetic character
        case digit:
          // read characters until you hit a
          // non-digit character
      } // switch
    } // nextToken(Stream)
  + Describe the tokens in a formal language
    and build the tokenizer using that formal
    description
    - By hand: Use instructions for turning
      formal descriptions into tokenizers
    - Automatically: Use tokenizer-generators

We need a formal mechanism for describing tokens
* Regular expressions
* A regular expression is a formal mechanism for
  describing a language.
* The simplest regular expression is the regular
  expression for the empty string.  We use epsilon.
  + L(epsilon) = { "" }
* For all s in Sigma, s is a regular expression.
  + L(s) = { "s" }
+ If r1 and r2 are regular expressions, (r1.r2)
  is a regular expression called concatenation.
  + L((r1.r2)) = { concat(s1,s2) | s1 in L(r1)
                                     s2 in L(r2) }
  + Suppose r1 = "strings starting with a"
    r2 = "strings ending with b"
    ((r1).(r2)) is "strings starting with a and
    ending with b"
  + L(((a).(b))) = { "ab" }
* Alternation: If r1 and r2 are regular expressions,
  (r1|r2) is a regular expression
  + L((r1|r2)) = L(r1) union L(r2)
  + L((a|b)) = { "a", "b" }
  + L(((a|d).c)) = { "ac" "dc" }
* We need a mechanism for describing infinite languages
* The Kleene * provides this mechanism.  If r is a
  regular expression, (r*) is
  L(epsilon) union L(r) union L((r.r)) union 
    L(((r.r).r) union ...
  + L((a*)) = { "", "a", "aa", "aaa", "aaaa", ... }
* (r+) = (r.(r*))
* Numbers = ([0123456789]+)