CSC362 2004S, Class 7: Regular Expressions, Continued
Admin:
* Late: Choed, Ananta
* Missing:
* Yes, I will record this every day
* Bonus for Paul correcting my home page.
* Questions on the homework?
* Do we have to use the subclasses of Token? No, but I'd recommend that you do.
* Do we have to follow you asinine numbering scheme? Yes.
* Don't forget lab tomorrow
* Cool convo tomorrow
* Talk today at 4:15
* Self-Governance Committee
Overview:
* Review
* Avoiding over-parenthesization
* Common "extensions"
* Limitations
* Tokenizing
Review:
* Regular expressions are an easy way to represent languages
* A set of strings (over an alphabet, sigma)
* phi is a RE, L[phi] = { }
* s in Sigma is a RE
L[s] = { s }
* Alternation: S,R are RE (R)|(S) is a RE
L[(R)|(S)] = L[R] union L[S]
* Concatenation: S,R are RE (R)(S) is a RE
L[(R)(S)] = { concat(x,y) | x in L[R] and y in L[S] }
* Exponentionation: R is a RE, (R)^k is a RE
L[(R)^0] = { "" }
L[(R)^1] = R
L[(R)^k] = L[(R)((R)^(k-1))]
* Kleene *: if R is a RE then (R)* is a RE
L[(R)*] = UNION L[(R)^i] s.t. i >= 0
* Stupid side note:
Church advised Kleene
Kleene advised Constable
Constable advised O'Donnell
O'Donnell advised Rebelsky
There are too many damn parentheses!
* Parentheses remove ambiguity
* ab|c
* L[(a)((b)|(c))] = { ab, ac }
* L[((a)(b))|(c)] = L[(a)(b)] union L[(c)] = { ab, c }
* a*b+c
* Just as we remove parentheses in arithmetic expressions by using rules of precedence, so can we remove them in regular expressions
* Highest priority: Kleene *
* Middle priority: Concatenation
* Lowest priority: Alternation (|)
* More ambiugity
* L[a|bc] = { a, bc }
* L[a|b*c|abc] = { a, abc, c, bc, bbc, bbbc, bbbbc, bbbbbc, ... }
* a|b|c
* (a|b)|c
* a|(b|c)
* abc
* (ab)c
* a(bc)
* R|S|T
* L[(R|S)|T] = L[R|S] union L[T]
= (L[R] union L[S]) union L[T]
* L[R|(S|T)] = L[R] union L[S|T]
= L[R] union (L[S] union L[T])
* concat(concat(x,y), z) = concat(x, concat(y, z))
* 3 * 3 * 3 * 3 * ... * 3 * 1/3 * 1/3 * 1/3 * 1/3 * ... * 1/3
More common abbreviations in regular expressions
* R+ "1 or more copies"
* R? "0 or 1 copies" = (R|epsilon)
* [xyz] = (x|y|z)
* [a-z] = (a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z)
[3-a]
* [a-zA-Z]
* [^asfasd] SIGMA - { a, s, f, a, s, d }
* [^a-zA-Z0-9]
What is the language for { sam }
* sam
Putting this all together, Integers:
* [\+\-]?[0-9]+
* (+|-|epsilon)(0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)*
How do lexical analyzer and language writers use all this?
NAME/REGEXP pairs
BEGIN: [Bb][Ee][Gg][Ii][Nn]
INTEGER: [\+\-]?[0-9]+
IDENTIFIER: [_a-zA-Z][_a-zA-Z0-9]*
As you start turning this into a program
* What happens if two regular expressions match?
* Whichever one appears first wins
* Tokenizers read through characters until they find a token. What happens if one pattern continues when another one stops?
* Rule of longest substring: Take the longer one
* Consider the language
Ab: a
Boo: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab
input is aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab
x:=y