CSC362 2004S, Class 12: Grammars for Parsing Admin * HW 1 due. * No ragged edges! * Project, phase 2, assigned. * Input: TokenStream * Output: Tree * Process: Hand-Coded Recursive-Descent Parsing * Grading of phase 1 to be done soon. * Note-taking seminar at 4:15 About Phase I ::= := ::= WHILE exp DO statement Turn these rules into a procedure, parseStatement, that looks at tokens, determines which rule applies, and then parses the right hand side. private Node parseStatement(TokenStream ts) { Node me = new Node(); switch (ts.peek()) { case IDENTIFIER: me.setSymbol(Nonterminal.ASSIGN); me.addChild(ts.next()); skipOver(ASSIGN, ts); me.addChild(parseExpression(ts)); return me; case WHILE: me.setSymbol(Nonterminal.WHILE_LOOP); skipOver(WHILE, ts); me.addChild(parseExpression(ts)); skipOver(DO, ts); me.addChild(parseStatement(ts)); } } // parseStatement Overview: * Limitations of regular expressions and similar things * BNF Grammars * Parsing with BNF grammars Regular expressions and DFAs and NFAs are good because: * Seem to be a compact way to represent languages * Inter-convertable * Easy to implement DFAs on most computers * Fast * can determine membership in O(length of string) * can usually tokenize a string of length N in O(N); If the language requires backtracking, then it may not be O(N) * May be huge (exponential in size of RE) Unfortunately, they are also weak * Strings of a's and b's with equal numbers of a's and b's * Seems likely that we'll need to count the number of a's and b's we'll use. * Counting on a finite machine sounds difficult * More formal proof (by contradiction): * Assume there exists a DFA for the language above * There are k states in that machine * Give the machine input of a^(k+1): Some state must be duplicated * Exist i, j, s.t. the state of the DFA after a^i = state after a^j. i != j * Consider input of a^ib^j: Accepted? (If it accepts a^jb^j) Computer scientists searched for more powerful mechanisms for describing languages * Idea: BNF Grammars * A BNF grammar is a set of rules for building/generating parts of a language SYMBOL ::= SYMBOL SYMBOL SYMBOL SYMBOL sentence ::= subject predicate PERIOD Two kinds of symbols: * Terminals, the building blocks of our language (tokens, alphabet) (e.g., period, noun) * Nonterminals, the structures of our language (e.g., sentence, predicate, subject) Two ways to use these grammars: (1) Start with a designated "start nonterminal". Repeatedly replace stuff on the left with stuff on the right. Keep doing this until you run out of nonterminals to replace. The resulting string of nonterminals is in the language. a,b are nonterminals A, B are terminals a ::= A a B a ::= b a ::= epsilon b ::= a a a -> A a B -> A b B -> A a a B -> A A a B a B -> A A a B B -> A A B B * * * * * (2) Start with a sequence of terminals and repeatedly replace right-hand-sides by left-hand sides A B <- A epsilon B <- A a B <- a In practice, we draw both kinds of derivations as trees a / | \ A a B Parsing: The process of building a tree for an input string (either top down or bottom up) A grammar is *ambiguous* if there are two parse trees for the same input string. * Fine for just language inclusion * Not so fine for "meaning" associated with parse trees s ::= IF E THEN s s ::= IF E THEN s ELSE s s ::= X Rule: the "ELSE" binds with the most recent if Goal: For ???: Rewrite the grammar so that there is no ambiguity "Disambiguate the grammar" s ::= IF E THEN s FI s ::= IF E THEN s ELSE s FI s ::= X a ::= A a B a ::= b a ::= epsilon b ::= a a a ::= A a B a ::= a a a ::= epsilon a ::= epsilon a ::= b b ::= A B b ::= A b B b ::= A b b B b ::= epsilon