CSC362 2004S, Class 12: Grammars for Parsing
Admin
* HW 1 due.
* No ragged edges!
* Project, phase 2, assigned.
* Input: TokenStream
* Output: Tree
* Process: Hand-Coded Recursive-Descent Parsing
* Grading of phase 1 to be done soon.
* Note-taking seminar at 4:15
About Phase I
::= :=
::= WHILE exp DO statement
Turn these rules into a procedure, parseStatement, that looks at tokens,
determines which rule applies, and then parses the right hand side.
private Node parseStatement(TokenStream ts)
{
Node me = new Node();
switch (ts.peek()) {
case IDENTIFIER:
me.setSymbol(Nonterminal.ASSIGN);
me.addChild(ts.next());
skipOver(ASSIGN, ts);
me.addChild(parseExpression(ts));
return me;
case WHILE:
me.setSymbol(Nonterminal.WHILE_LOOP);
skipOver(WHILE, ts);
me.addChild(parseExpression(ts));
skipOver(DO, ts);
me.addChild(parseStatement(ts));
}
} // parseStatement
Overview:
* Limitations of regular expressions and similar things
* BNF Grammars
* Parsing with BNF grammars
Regular expressions and DFAs and NFAs are good because:
* Seem to be a compact way to represent languages
* Inter-convertable
* Easy to implement DFAs on most computers
* Fast
* can determine membership in O(length of string)
* can usually tokenize a string of length N in O(N); If the language requires backtracking, then it may not be O(N)
* May be huge (exponential in size of RE)
Unfortunately, they are also weak
* Strings of a's and b's with equal numbers of a's and b's
* Seems likely that we'll need to count the number of a's and b's we'll use.
* Counting on a finite machine sounds difficult
* More formal proof (by contradiction):
* Assume there exists a DFA for the language above
* There are k states in that machine
* Give the machine input of a^(k+1): Some state must be duplicated
* Exist i, j, s.t. the state of the DFA after a^i = state after a^j.
i != j
* Consider input of a^ib^j: Accepted? (If it accepts a^jb^j)
Computer scientists searched for more powerful mechanisms for describing languages
* Idea: BNF Grammars
* A BNF grammar is a set of rules for building/generating parts of a language
SYMBOL ::= SYMBOL SYMBOL SYMBOL SYMBOL
sentence ::= subject predicate PERIOD
Two kinds of symbols:
* Terminals, the building blocks of our language (tokens, alphabet) (e.g., period, noun)
* Nonterminals, the structures of our language (e.g., sentence, predicate, subject)
Two ways to use these grammars:
(1) Start with a designated "start nonterminal". Repeatedly replace stuff on the left with stuff on the right. Keep doing this until you run out of nonterminals to replace. The resulting string of nonterminals is in the language.
a,b are nonterminals
A, B are terminals
a ::= A a B
a ::= b
a ::= epsilon
b ::= a a
a -> A a B -> A b B -> A a a B -> A A a B a B -> A A a B B -> A A B B
* * * * *
(2) Start with a sequence of terminals and repeatedly replace right-hand-sides by left-hand sides
A B <- A epsilon B <- A a B <- a
In practice, we draw both kinds of derivations as trees
a
/ | \
A a B
Parsing: The process of building a tree for an input string (either top down or bottom up)
A grammar is *ambiguous* if there are two parse trees for the same input
string.
* Fine for just language inclusion
* Not so fine for "meaning" associated with parse trees
s ::= IF E THEN s
s ::= IF E THEN s ELSE s
s ::= X
Rule: the "ELSE" binds with the most recent if
Goal: For ???: Rewrite the grammar so that there is no ambiguity
"Disambiguate the grammar"
s ::= IF E THEN s FI
s ::= IF E THEN s ELSE s FI
s ::= X
a ::= A a B
a ::= b
a ::= epsilon
b ::= a a
a ::= A a B
a ::= a a
a ::= epsilon
a ::= epsilon
a ::= b
b ::= A B
b ::= A b B
b ::= A b b B
b ::= epsilon