Today in CS362: Parsing Missing: Erik, Mark Admin Watermelon Lab yesterday Questions on the project? Rant about 151 Overview From grammar to parser Predictive Parsing A sample grammar A membership tester Some problems A tree builder Generalizing About Grammars and Parsers * Grammars describe languages * Grammars give informal membership tests: If you can generate the string, it's in the language * Parsing: A deterministic technique for determining membership (by showing its derivation). * Design decisions in parsers: + From start symbol to desired string (top-down) vs. From desired string back to start symbol (bottom-up) + In top-down parsers, which nonterminal do you expand next? left-to-right or right-to-left or highest-first Predictive Parsing (Recursive Descent) * Top-down, Left-to-right parsing strategy * Popular because it's easy to hand code * Basic idea: Write a "parseX" procedure for each nonterminal, X. * The procedure looks at some input tokens, decides what RHS to match, and then matches the RHS, calling stuff along the way. * Simple example: Suppose we've decided to apply the rule Statement -> if Exp then Statement We match this by (1) Read a token and make sure it's if. (2) Call parseExp procedure (3) Read a token and make sure it's then (4) Call parseStatement procedure * Side note: Picking the right right hand side is hard; for now, we'll assume that a black box tells us how. Example: "Palindromes over a and b with c in the middle" Pal -> c Pal -> a Pal a Pal -> b Pal b boolean parsePal(Lexer l) { Token t = l.nextToken(); case t of Token(a): if (!parsePal(l)) return false; if (l.nextToken() != Token(a)) return false; return true; Token(b): if (!parsePal(l)) return false; if (l.nextToken() != Token(b)) return false; return true; Token(c): return true; default: System.err.println("Trying to match a palindrome, saw a" + t.toString()); return false; } // parsePal What happens if we change our language to "Even-length palindromes over a and b" R1: Pal -> R2: Pal -> a Pal a R3: Pal -> b Pal b Parsing is harder. Why? When you see an "a" is it (1) a lhs a, so apply rule 2 (2) a rhs a, so apply rule 1 This grammar is not amenable to the predictive parsing technique. (Wah!) However, there are techniques for transforming some grammars to those amenable by predictive parsing. Let's look at a variant of our first grammar Pal -> c Pal -> APal Pal -> BPal APal -> a Pal a BPal -> b Pal b In this case, the decision is obvious. If you see an a, call parseAPal, if you see a b, call parseBPal However, there's a problem. if parseAPal expects to see an a, you can't consume it in parsePal. Solution: "peekToken" Sometimes the decision as to RHS can be hard. Consider R1: S -> c R2: S -> AB R3: S -> C R4: A -> a A R5: A -> epsilon R6: B -> b R7: C -> d If you're trying to match an S and you see a a: R2 (just peek) b: R2 (just peek, assume A derives epsilon) c: R1 (consume) d: R3 (just peek) other: If you're trying to match an A and you see a a: R4 (consume) b: R6 (just peek) c: (die a horrible death since you shouldn't see a c) Two potential topics: * Using a similar technique for building parse trees * Formalizing this mechanism so that the computer can do it (or so that you can do it for more complicated problems) Building Parse Trees for original c-centered palindrome grammar. class Palindrome subclass RealPalindrome fields palchar and center subclass CenterPalindrome field palchar Palindrome parsePal(Lexer l) throws ParseException { Token t = l.peekToken(); case t of Token(a): t = l.readToken(); Palindrome center = parsePal(l); if (l.nextToken() != Token(a)) throw new ParseException("Failed to match a pal"); return new RealPalindrome(Token(a), center); Token(b): // Similar changes Token(c): return new CenterPalindrome(l.readToken()); default: System.err.println("Trying to match a palindrome, saw a" + t.toString()); return false; } // parsePal ----- Can we formalize the construction of predictive parsers? Technique is to build some "helper tables" from each grammar. First For each nonterminal, N, gives all tokens that can begin strings derivable from N. Follow For each nonterminal, N, gives all tokens that can follow N in a string of terminals and nonterminals derivable from some nonterminal. Nullable For each nonterminal, N, tells whether or not N can derive epsilon (directly or indirectly) Two questions to address next class (Sam's homework, not yours): * How do we use these tables to build predictive parsers? * How do we build these tables?