CSC362 2004S, Class 15: Predictive Parsing (1) Admin: * Talk Wednesday. Pizza afterwards (for those who replied early)! * Applications for summer research way down. Bleh. * Don't ask me about first and follow until after Wednesday's class. Overview: * Parsing basics * Predictive parsing * Problems of predictive parsing * Producing parse trees * Looking ahead: Useful tables and functions For regular languages: Regular expressions describe languages; Lexers match (and tokenize) Now: Grammars describe languages; Parsers match (and build parse trees) There are many ways to go from grammars to parsers, each of which has its own advantages and disadvantages The simplest is "predictive parsing" * Can be automated * Can be implemented by hand relatively easily This is a top-down matching strategy Given a current symbol you're matching and a next input token(s), determine which right hand side applies R1: s ::= A s A R2: s ::= B s B R3: s ::= C If you're matching an s and see an A, attempt to apply R1 If you're matching an s and see an B, attempt to apply R2 If you're matching an s and see an C, attempt to apply R3 In the by-hand version of this, we write a procedure, parse_s, that looks something like the following /* * Procedure: * match * Parameters: * tok, a token * ts, a token stream * Purpose: * Verify that the next token in the token strem is tok and * consume that token. * Produces: * consumed: boolean * Preconditions: * (None) */ jjk boolean parse_s(TokenStream ts) { switch (ts.peek()) { case A: return match(ts,A) && parse_s(ts) && match(ts,A); case B: return match(ts,B) && parse_s(ts) && match(ts,B); case C: return match(ts,C); default: return false; } // switch } // parse_s(TokenStream) Note that the pictures we drew were very stack-like The automated implementation uses an explicit stack push the start symbol while symbols remain in the input and symbols remain on the stack if the top symbol in the stack is a token, match it against the next input token (and return false if they don't match) if the top symbol on the stack is a nonterminal, find the "appropriate" rhs and push it in right-to-left order The problem can get a little or a lot harder if you change the grammar (or language) s ::= x s ::= y s ::= C x ::= A s A y ::= B s B Can we teach the computer which rule to apply, even though it requires some "reasoning"? Probably s ::= A s A s ::= B s B s ::= epsilon This grammar is not parsable by predictive parsing ---- When we parse, we can also build a parse tree Node parse_s(TokenStream ts) { switch (ts.peek()) { case A: Node result = new Node(Nonterminal_s); result.addChild(match(A, ts)); result.addChild(parse_s(ts)); result.addChild(match(A, ts)); return result; case B: Node result = new Node(Nonterminal_s); result.addChild(match(B, ts)); result.addChild(parse_s(ts)); result.addChild(match(B, ts)); return result; case C: Node result = new Node(Nonterminal_s); result.addChild(C); return result; default: throw new Exception("You seem unable to write palindromes."); } // switch } // parse_s(TokenStream) Detour: Context-free grammars vs. context-sensitive grammars BNF grammars describe languages by derivation A string is a sequence of symbols (nonterminals and terminals) A string, alpha, is in the language of a grammar G iff (1) alpha consists only of terminals (2) there exists a finite series of strings alpha_0 alpha_1 alpha_2 ... alpha_n such that each alpha_i is derivable from alpha_(i-1) by replacing some substring by the corresponding rhs of a rule In context-free grammars, the left-hand-sides of rules have length one In context-sensistive grammars, the left-hand-sides of rules can be bigger (but must include an nonterminal) As compiler writers, we work only in the world of context-free grammars, which we can (usually) build efficient parsers for. * When we need extra stuff (e.g. declaration before use), we rely on semantic analysis Returning to the issue of predictive parsing: * Given a nonterminal and an input symbol, how do you decide which rule to apply? * We pre-compute the answer for every possible pair (If we can't pre-compute those answers, we say the grammar isn't parsable by this technique and rewrite it.) * How do we compute that table? * We begin by computing other tables, first[NONTERMINAL] = all terminals that can begin strings derivable from NONTERMINAL follow[NONTERMINAL] = all terminals that can follow NONTERMINAL in some derivable string nullable[NONTERMINAL] = can this thing derive epsiolon? Read about first and follow in the book, come with questions