CSC362 2004S, Class 7: Regular Expressions, Continued Admin: * Late: Choed, Ananta * Missing: * Yes, I will record this every day * Bonus for Paul correcting my home page. * Questions on the homework? * Do we have to use the subclasses of Token? No, but I'd recommend that you do. * Do we have to follow you asinine numbering scheme? Yes. * Don't forget lab tomorrow * Cool convo tomorrow * Talk today at 4:15 * Self-Governance Committee Overview: * Review * Avoiding over-parenthesization * Common "extensions" * Limitations * Tokenizing Review: * Regular expressions are an easy way to represent languages * A set of strings (over an alphabet, sigma) * phi is a RE, L[phi] = { } * s in Sigma is a RE L[s] = { s } * Alternation: S,R are RE (R)|(S) is a RE L[(R)|(S)] = L[R] union L[S] * Concatenation: S,R are RE (R)(S) is a RE L[(R)(S)] = { concat(x,y) | x in L[R] and y in L[S] } * Exponentionation: R is a RE, (R)^k is a RE L[(R)^0] = { "" } L[(R)^1] = R L[(R)^k] = L[(R)((R)^(k-1))] * Kleene *: if R is a RE then (R)* is a RE L[(R)*] = UNION L[(R)^i] s.t. i >= 0 * Stupid side note: Church advised Kleene Kleene advised Constable Constable advised O'Donnell O'Donnell advised Rebelsky There are too many damn parentheses! * Parentheses remove ambiguity * ab|c * L[(a)((b)|(c))] = { ab, ac } * L[((a)(b))|(c)] = L[(a)(b)] union L[(c)] = { ab, c } * a*b+c * Just as we remove parentheses in arithmetic expressions by using rules of precedence, so can we remove them in regular expressions * Highest priority: Kleene * * Middle priority: Concatenation * Lowest priority: Alternation (|) * More ambiugity * L[a|bc] = { a, bc } * L[a|b*c|abc] = { a, abc, c, bc, bbc, bbbc, bbbbc, bbbbbc, ... } * a|b|c * (a|b)|c * a|(b|c) * abc * (ab)c * a(bc) * R|S|T * L[(R|S)|T] = L[R|S] union L[T] = (L[R] union L[S]) union L[T] * L[R|(S|T)] = L[R] union L[S|T] = L[R] union (L[S] union L[T]) * concat(concat(x,y), z) = concat(x, concat(y, z)) * 3 * 3 * 3 * 3 * ... * 3 * 1/3 * 1/3 * 1/3 * 1/3 * ... * 1/3 More common abbreviations in regular expressions * R+ "1 or more copies" * R? "0 or 1 copies" = (R|epsilon) * [xyz] = (x|y|z) * [a-z] = (a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z) [3-a] * [a-zA-Z] * [^asfasd] SIGMA - { a, s, f, a, s, d } * [^a-zA-Z0-9] What is the language for { sam } * sam Putting this all together, Integers: * [\+\-]?[0-9]+ * (+|-|epsilon)(0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)* How do lexical analyzer and language writers use all this? NAME/REGEXP pairs BEGIN: [Bb][Ee][Gg][Ii][Nn] INTEGER: [\+\-]?[0-9]+ IDENTIFIER: [_a-zA-Z][_a-zA-Z0-9]* As you start turning this into a program * What happens if two regular expressions match? * Whichever one appears first wins * Tokenizers read through characters until they find a token. What happens if one pattern continues when another one stops? * Rule of longest substring: Take the longer one * Consider the language Ab: a Boo: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab input is aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab x:=y