CSC362 2004S, Class 17: Predictive Parsing (3): Helpful Tables Admin: * The late list returns: Ananta (abs), Piper (abs), Davis * Questions or comments on yesterday's lab? * Feel free to copy code (and cite it) * Feel free to email questions * Questions on your parsers? * Will Sam write the tests for you? No. * Do we have to do hand-coded predictive parsers. No. However, if you do something else and things go wrong: (1) It's your own damn fault and (2) I'll be less likely to be able to help you. * What are some things likely to cause problems? * Statements include assignment statements or procedure calls; both assignment statements and procedure calls begin with identifiers * Factors include both function calls and identifiers; both begin with identifiers * Pascal makes the stupid decision that zeroary functions and procedures don't need parens * Solution one: Handle it at the type-checking phase * Solution two: Change the language [Sam permits it in this case] * How do we deal with things that may have many different forms (e.g., variables)? * Check each in turn (sorry) What do we do when lots of right hand sides begin with the same thing variable ::= identifier variable ::= identifier . identifier vairable ::= identifier [ expression ] variable ::= variable^ variable ::= function_call function_call ::= identifier ( expression_list ) Traditionally when we have this problem, we *left factor* the grammar. Extract common part of RHSs, create new nonterminal for the stuff you have to decide between variable ::= identifier extra-variable-stuff extra-variable-stuff ::= epsilon extra-variable-stuff ::= DOT identifier extra-variable-stuff ::= LBRACKET expression RBRACKET extra-variable-stuff ::= UPARROW extra-variable-stuff ::= LPAREN expression_list RPAREN public Node parseVariable(TokenStream tokens) { if (tokens.peek() instanceof Identifier) parseExtraVariableStuff(tokens.next(), tokens); else return error("You seem unaware of what a variable is"); } // parseVariable(TokenStream) public Node parseExtraVariableStuff(Token id, TokenStream tokens) { if (tokens.peek() == PascalTokens.T_LEFT_BRACE) { Vector children = new Vector; children.add(new Node(id, null)); consume(tokens, PascalTokens.T_LEFT_BRACE); children.add(parseExpression(tokens)); consume(tokens, PascalTokens.T_RIGHT_BRACE); return new Node(PascalNonterminals.ARRAY_REFERENCE, children); } ... else { return new Node(id, null); // Just the identifier } } // parseExtraVariableStuff(Token, TokenStream) Overview: * Three helpful tables and two helpful functions * Why we have those tables * The first function * Building the First table * The nullable function * Building the Nullable table * Building the Follow table /Sample Grammars/ Grammar One: R1: s ::= b s X R2: s ::= A R3: b ::= epsilon R4: b ::= B Grammar Two: R1: s ::= b X R2: s ::= A C R3: s ::= c d R4: b ::= epsilon R5: b ::= B R6: c ::= C R7: c ::= epsilon R8: d ::= D R9: d ::= epsilon Overall goal of recursive-descent analysis: * Given a nonterminal to match and the next input token, determine precisely which RHS to apply. * Easy part: If the next input token can begin exactly one rhs for the nonterminal, you're set. * For rhs's that begin with tokens, this is easy. * For rhs's that begin with nonterminals, we need to do more work (first(rhs)) * Hard part: What happens if the rhs can be empty? * How do we determine if a rhs can be empty? nullable(rhs) * How do we decide whether or not to apply a => empty rule? * The hack answer: Whenever nothing else applies, apply that rule * The better answer: If the next input token can *follow* the nonterminal in a derived string, feel free to nullify the nonterminal (catches errors early; also helps identify potential problems in the grammar) Follow[n] So far: One table (Follow) and two functions (first, nullable) In addition: Two more tables: First[n] and Nullable[n] Tables take nonterminals as "parameters"/indices Functions take sequences of symbols as parameters Why does first need to be a function rather than a table? * Saves space: There is a very large number of sequences of symbols; There is a much smaller number of nonterminals / first returns the set of tokens that can begin strings derivable from los SetOfSymbols first(ListOfSymbols los) { // If the list of symbols is empty ... if (los.isEmpty()) return theEmptySet; // If the list of symbols begins with a terminal/token else if (isToken(los.car())) return new Set(los.car()) // If the list of symbols begins with a nonterminal else if (Nullable(los.car())) { return First[los.car()] union first(los.cdr()) } return First[los.car()] } // first(ListOfSymbols) How do we build the First table? repeat for each rule n ::= rhs First[n] := First[n] union first(rhs) until no more changes are made Does this process ever terminate? Davis claims that it does. How do we show that it terminates? The total size of First is bounded: (one entry for each nonterminal, each nonterminal has finitely many tokens that begin strings derivable from that nonterminal) R1: s ::= b X R2: s ::= A C R3: s ::= c d R4: b ::= epsilon R5: b ::= B R6: c ::= C R7: c ::= epsilon R8: d ::= D R9: d ::= epsilon First[s] = { X, A, B, C, D } First[b] = { B } First[c] = { C } First[d] = { D } Nullable and nullable public boolean nullable(ListOfSymbols los) { // If the list of symbols is empty ... if (los.isEmpty()) return true // If the list of symbols begins with a terminal/token else if (los.car() is a token) return false // If the list of symbols begins with a nonterminal return Nullable[los.car()] && nullable(los.cdr()); } for each nonterminal n Nullable[n] = false; repeat for each rule n ::= rhs if nullable(rhs), Nullable[n] = true; until no more changes are made Have a great afternoon!