CSC362 2004S, Class 6: Representing Tokens with Regular Expressions Admin: * Late: Stephane, Oge * Missing: Daniel S-S, James, Cassie (Excused) * Don't send me .sxw files (.txt is nice) * Questions on project, phase 1? Does anyone need a group? * Email me your groups * Is there a set list of tokens to use? (The "official" list) * Should we use Sam's classes? Certainly. * You must implement rebelsky.compiler.lexer.TokenStream.java * You must use rebelsky.compiler.lexer.Token.java * You can (and probably should) use rebelsky.compiler.lexer.{IntegerToken,StringToken,Identifier}.java * Printable versions of pages often available with .pdf suffix * Vote! * New draft virtual community documents to be online soon * Talk Wednesday at 4:15. Overview: * What is a language? * About lexical analysis * Lexical analyzer generators * Regular expression basics The traditional linguistics view: "A predetermined system of signs used to communicate (vocally) among humans." The computer science view (Something extremely abstract): A language is (1) a set of strings (over a particular alphabet); (2) an associated *semantics* that assigns meaning to those strings. Theoreticians often ignore (2). What techniques do we use for describing sets of strings? * Enumerate/list all the members. (Only works for finite sets.) * Give a "set predicate": If a value meets the predicate, it's in the set; If a value doesn't meet the predicate, it's not. * Give a rule (or rules) for constructing the elements of a set. * Whole numbers: * 0 is a whole number * The successor of any whole number is a whole number * Use set operations on finite sets * Computer scientists focus on predicates It is possible to categorize languages based on how hard it is determine membership in the language. Simplest languages: Regular languages * Determine membership with "finite automaton" (a program with fixed memory usage) N key ways to describe regular languages: * A program that matches * Hand-written * Automatically generated from a description of the language * Description that is non-computational "Strings of a's and b's that start and end with a" * "A hand-written program is easier to troubleshoot" * An automatically-generated program may be easier to read (that is, you read the description, not the program) The most common way to describe regular languages is with "regular expressions" * The special symbol phi is a regular expression (the empty set) L(phi) = { } * The special symbol epsilon is a regular expression (the empty string) L(epsilon) = { epsilon } concat(epsilon,X) = concat(X,epsilon) = X * For all symbols, s, in the alphabet, s is a regular expression. L(s), the language denoted by s, is { s } * For all regular expressions R and S, (R)|(S) is a regular expression * L((R)|(S)) = L(R) union L(S) * L((a)|(b)) = { a, b } * For all regular expressions, R and S, (R)(S) is a regular expression * L((R)(S)) = { concat(X,Y) | X in L(R) and Y in L(S) } * L((a)(b)) = { ab } * L(((a)|(b))(b)) = { ab, bb } * For all regular expressions, R, and whole numbers k, (R)^k is a regular expression L((R)^0) = { epsilon } L((R)^1) = L(R) L((R)^k) = L((R)(R^(k-1))) L((a)^5) = { aaaaa } L(((a)|(b))^3) { aaa, aab, aba, abb, baa, bab, bba, bbb } * NOTE: ALL OF THESE ARE FINITE! HOW DO WE DESCRIBE INFINITE SETS? * The Kleene * * If R is a regular expression, (R)* is a regular expression * L((R)*) = R^0 union R^1 union R^2 ... * L((a)*) = { epsilon, a, aa, aaa, aaaa, aaaaa, .... } How would you describe "strings of a's and b's that begin and end with a" (using regular expressions) (assume: at least length two) (a)((((a)|(b))*)(a)) a(a|b)*a