Admin Class in 2428 on Monday More links on course web Class news finally updated (just replicates outlines) Success in Phase 2 meetings How would you describe the mental well being of someone who is (1) Taking two other CS courses; (2) Taking a CS MAP; (3) Taking accounting; (4) Decides to audit this course while doing the work; and (5) claims to spend 40 HPW in MathLAN? Where are Erik and Mark and Adam and ...? Overview Review Simplifying Regular Expressions Common Shorthands Limitations Describing Tokens Review * A regular expression is something you can use to describe a language: a set of strings. * At the simplest level, you can build regular expressions in the following ways: + The Kleene *, if R is a regular expression, (R*) is a regular expression { epsilon } union L(R) union L(R.R) ... + Concatenation RS is ... + The empty string is a regular expression (denoting the set of the empty string) + Alternations, (R|S). L((R|S|)) = L(R) union L(S) + s in Sigma L(s) = { s } Simplifying Regular Expressions * Two letter sequences from the alphabet { a, b, c } (((a|b)|c).((a|b)|c)) * Do we really need all of them?a * Without parenthesis, we have ambiguity R1: a.b|c : [((a.b)|c)] or (a.(b|c)) R2: a.b.c : ((a.b).c) or (a.(b.c)) R3: a.b*: [(a.(b*))] or ((a.b)*) R4: R.S.T : ((R.S).T) or (R.(S.T)) * EBA's Conjencture: Concatenation is associative + Daren's proof: Watch those hands wave. * Assign precedence to operations: + Highest: * (like exponentiation) + Medium: . (like multiplication) + Low: | (like addition) * Don't care about associativity * Also: Don't bother writing concatenation operation. Common Shorthands * + as a shorthand for "one or more" R+ = RR* * Set notation [abc] is shorthand for (a|b|c) [a-e] is shorthand for (a|b|c|d|e) [^a-e] is shorthand for [f-z], if Sigma is "lc letters" * Name and reuse names digit: [0-9] letter: [a-z] identifier: letter(digit|letter)* * Question mark: 0 or 1 * R? (binds tightly) is (R|epsilon) NO RECURSION OR REPETITION! * What is a "numeric constant" in your favorite language? It can start with a plus or a minus or nothing? A digit next zero or more digits an optional (dot plus one or more digits) an optional (e plus ...) an optional F|f|L|l (in C) an optional u|U|S|s [+-]?[0-9]+(.[0-9]+)?(e[+-]?[0-9]+) Limitations * Can't describe "strings of a's and b's with equal numbers of a's and b's" * Can't describe "palindromes" * Test: All palindromes over the alphabet { a } + a(aa)* : All odd length palindromes + a* : All palindromes * When is the string c1 ... cn a palindrome? When cn ... c1 is the same as c1 ... cn Some Fun Things to Describe * Strings over { a, b} with exactly one b. (a*)b(a*) * Strings over { a, b} with exactly two bs. (a*)b(a*)b(a*) * Strings over { a, b} with no more than two bs? (a*)b?(a*)b?(a*) * Strings over { a, b} without exactly two b's? Describing Tokens * We start by using regular expressions to describe the tokens in our language. INT_CONSTANT = [+-]?(0|[1-9][0-9]*) OCT_CONSTANT = [+-]?0[0-9]* FLOAT_CONSTANT = [+-]?(0|[1-9][0-9]*).[0-9]+ WHILE = [Ww][Hh][Ii][Ll][Ee] IDENTIFIER = [_A-Za-z][0-9_A-Za-z]* Two related problems: * The same sequence of characters can be matched by two regexps. 0 could be an INT_CONSTANT or OCT_CONSTANT * Two different parts of a sequence can be matched differently by two regexps. 3212.23 could be INT_CONSTANT(321) followed by FLOAT_CONSTATN(2.23) or FLOAT_CONSTATN(3212.23) ... WHILE_I_AM_STILL_TEACHING+ * Two principles for lexical analysis (1) Match the longest possible string (2) For two equal length matches, choose the one that comes first in the list of regexps "a number character sequence starting with a character"