CSC362 2004S, Class 6: Representing Tokens with Regular Expressions
Admin:
* Late: Stephane, Oge
* Missing: Daniel S-S, James, Cassie (Excused)
* Don't send me .sxw files (.txt is nice)
* Questions on project, phase 1? Does anyone need a group?
* Email me your groups
* Is there a set list of tokens to use? (The "official" list)
* Should we use Sam's classes? Certainly.
* You must implement rebelsky.compiler.lexer.TokenStream.java
* You must use rebelsky.compiler.lexer.Token.java
* You can (and probably should) use
rebelsky.compiler.lexer.{IntegerToken,StringToken,Identifier}.java
* Printable versions of pages often available with .pdf suffix
* Vote!
* New draft virtual community documents to be online soon
* Talk Wednesday at 4:15.
Overview:
* What is a language?
* About lexical analysis
* Lexical analyzer generators
* Regular expression basics
The traditional linguistics view: "A predetermined system of signs used to communicate (vocally) among humans."
The computer science view (Something extremely abstract): A language is (1) a set of strings (over a particular alphabet); (2) an associated *semantics* that assigns meaning to those strings.
Theoreticians often ignore (2).
What techniques do we use for describing sets of strings?
* Enumerate/list all the members. (Only works for finite sets.)
* Give a "set predicate": If a value meets the predicate, it's in the set; If a value doesn't meet the predicate, it's not.
* Give a rule (or rules) for constructing the elements of a set.
* Whole numbers:
* 0 is a whole number
* The successor of any whole number is a whole number
* Use set operations on finite sets
* Computer scientists focus on predicates
It is possible to categorize languages based on how hard it is determine membership in the language.
Simplest languages: Regular languages
* Determine membership with "finite automaton" (a program with fixed memory usage)
N key ways to describe regular languages:
* A program that matches
* Hand-written
* Automatically generated from a description of the language
* Description that is non-computational
"Strings of a's and b's that start and end with a"
* "A hand-written program is easier to troubleshoot"
* An automatically-generated program may be easier to read (that is, you read the description, not the program)
The most common way to describe regular languages is with "regular expressions"
* The special symbol phi is a regular expression (the empty set)
L(phi) = { }
* The special symbol epsilon is a regular expression (the empty string)
L(epsilon) = { epsilon }
concat(epsilon,X) = concat(X,epsilon) = X
* For all symbols, s, in the alphabet, s is a regular expression.
L(s), the language denoted by s, is { s }
* For all regular expressions R and S, (R)|(S) is a regular expression
* L((R)|(S)) = L(R) union L(S)
* L((a)|(b)) = { a, b }
* For all regular expressions, R and S, (R)(S) is a regular expression
* L((R)(S)) = { concat(X,Y) | X in L(R) and Y in L(S) }
* L((a)(b)) = { ab }
* L(((a)|(b))(b)) = { ab, bb }
* For all regular expressions, R, and whole numbers k, (R)^k is a regular expression
L((R)^0) = { epsilon }
L((R)^1) = L(R)
L((R)^k) = L((R)(R^(k-1)))
L((a)^5) = { aaaaa }
L(((a)|(b))^3) { aaa, aab, aba, abb, baa, bab, bba, bbb }
* NOTE: ALL OF THESE ARE FINITE! HOW DO WE DESCRIBE INFINITE SETS?
* The Kleene *
* If R is a regular expression, (R)* is a regular expression
* L((R)*) = R^0 union R^1 union R^2 ...
* L((a)*) = { epsilon, a, aa, aaa, aaaa, aaaaa, .... }
How would you describe "strings of a's and b's that begin and end with a" (using regular expressions) (assume: at least length two)
(a)((((a)|(b))*)(a))
a(a|b)*a