CSC362 2004S, Class 17: Predictive Parsing (3): Helpful Tables

Admin:
* The late list returns: Ananta (abs), Piper (abs), Davis
* Questions or comments on yesterday's lab?
  * Feel free to copy code (and cite it)
  * Feel free to email questions
* Questions on your parsers?
  * Will Sam write the tests for you?  No.
  * Do we have to do hand-coded predictive parsers.  No.  However, if you do something else and things go wrong: (1) It's your own damn fault and (2) I'll be less likely to be able to help you.
  * What are some things likely to cause problems?
    * Statements include assignment statements or procedure calls; both assignment statements and procedure calls begin with identifiers
    * Factors include both function calls and identifiers; both begin with identifiers
    * Pascal makes the stupid decision that zeroary functions and procedures don't need parens
      * Solution one: Handle it at the type-checking phase
      * Solution two: Change the language [Sam permits it in this case]
  * How do we deal with things that may have many different forms (e.g., variables)?
    * Check each in turn (sorry)

What do we do when lots of right hand sides begin with the same thing
variable ::= identifier
variable ::= identifier . identifier
vairable ::= identifier [ expression ]
variable ::= variable^
variable ::= function_call 
function_call ::= identifier ( expression_list )

Traditionally when we have this problem, we *left factor* the grammar.
Extract common part of RHSs, create new nonterminal for the stuff you have to decide between

variable ::= identifier extra-variable-stuff

extra-variable-stuff ::= epsilon
extra-variable-stuff ::= DOT identifier
extra-variable-stuff ::= LBRACKET expression RBRACKET
extra-variable-stuff ::= UPARROW
extra-variable-stuff ::= LPAREN expression_list RPAREN

public Node parseVariable(TokenStream tokens)
{
  if (tokens.peek() instanceof Identifier)
    parseExtraVariableStuff(tokens.next(), tokens);
  else
    return error("You seem unaware of what a variable is");
} // parseVariable(TokenStream)

public Node parseExtraVariableStuff(Token id, TokenStream tokens)
{
  if (tokens.peek() == PascalTokens.T_LEFT_BRACE) {
    Vector children = new Vector;
    children.add(new Node(id, null));
    consume(tokens, PascalTokens.T_LEFT_BRACE);
    children.add(parseExpression(tokens));
    consume(tokens, PascalTokens.T_RIGHT_BRACE);
    return new Node(PascalNonterminals.ARRAY_REFERENCE, children);
  }
  ...
  else {
    return new Node(id, null); // Just the identifier
  }
} // parseExtraVariableStuff(Token, TokenStream)

Overview:
* Three helpful tables and two helpful functions
* Why we have those tables
* The first function
* Building the First table
* The nullable function
* Building the Nullable table
* Building the Follow table

/Sample Grammars/
Grammar One:
R1: s ::= b s X
R2: s ::= A
R3: b ::= epsilon
R4: b ::= B

Grammar Two:
R1: s ::= b X
R2: s ::= A C
R3: s ::= c d
R4: b ::= epsilon
R5: b ::= B
R6: c ::= C
R7: c ::= epsilon
R8: d ::= D
R9: d ::= epsilon

Overall goal of recursive-descent analysis:
* Given a nonterminal to match and the next input token, determine precisely which RHS to apply.
* Easy part: If the next input token can begin exactly one rhs for the nonterminal, you're set.
  * For rhs's that begin with tokens, this is easy.
  * For rhs's that begin with nonterminals, we need to do more work (first(rhs))
* Hard part: What happens if the rhs can be empty?
  * How do we determine if a rhs can be empty?  nullable(rhs)
  * How do we decide whether or not to apply a => empty rule?
    * The hack answer: Whenever nothing else applies, apply that rule
    * The better answer: If the next input token can *follow* the nonterminal in a derived string, feel free to nullify the nonterminal (catches errors early; also helps identify potential problems in the grammar)  Follow[n]

So far: One table (Follow) and two functions (first, nullable)
In addition: Two more tables: First[n] and Nullable[n]

Tables take nonterminals as "parameters"/indices
Functions take sequences of symbols as parameters

Why does first need to be a function rather than a table?
* Saves space: There is a very large number of sequences of symbols; There is a much smaller number of nonterminals

/ first returns the set of tokens that can begin strings derivable from los
SetOfSymbols first(ListOfSymbols los)
{
  // If the list of symbols is empty ...
  if (los.isEmpty())
    return theEmptySet;
  // If the list of symbols begins with a terminal/token
  else if (isToken(los.car()))
    return new Set(los.car())
  // If the list of symbols begins with a nonterminal
  else 
    if (Nullable(los.car())) {
      return First[los.car()] union first(los.cdr())
    }
    return First[los.car()]
} // first(ListOfSymbols)

How do we build the First table?

repeat
  for each rule n ::= rhs
    First[n] := First[n] union first(rhs)
until no more changes are made

Does this process ever terminate?  Davis claims that it does.
How do we show that it terminates?
  The total size of First is bounded: (one entry for each nonterminal, each nonterminal has finitely many tokens that begin strings derivable from that nonterminal)
R1: s ::= b X
R2: s ::= A C
R3: s ::= c d
R4: b ::= epsilon
R5: b ::= B
R6: c ::= C
R7: c ::= epsilon
R8: d ::= D
R9: d ::= epsilon

First[s] = { X, A, B, C, D }
First[b] = { B }
First[c] = { C }
First[d] = { D }

Nullable and nullable

public boolean nullable(ListOfSymbols los)
{
  // If the list of symbols is empty ...
  if (los.isEmpty()) return true
  // If the list of symbols begins with a terminal/token
  else if (los.car() is a token) return false
  // If the list of symbols begins with a nonterminal
  return Nullable[los.car()] && nullable(los.cdr());
}

for each nonterminal n
  Nullable[n] = false;
repeat
  for each rule n ::= rhs
    if nullable(rhs), Nullable[n] = true;
until no more changes are made

Have a great afternoon!