CSC362 2004S, Class 15: Predictive Parsing (1)

Admin:
* Talk Wednesday.  Pizza afterwards (for those who replied early)!
* Applications for summer research way down.  Bleh.
* Don't ask me about first and follow until after Wednesday's class.

Overview:
* Parsing basics
* Predictive parsing
* Problems of predictive parsing
* Producing parse trees
* Looking ahead: Useful tables and functions

For regular languages: Regular expressions describe languages; Lexers match (and tokenize)
Now: Grammars describe languages; Parsers match (and build parse trees)

There are many ways to go from grammars to parsers, each of which has its own advantages and disadvantages

The simplest is "predictive parsing"
* Can be automated
* Can be implemented by hand relatively easily

This is a top-down matching strategy

Given a current symbol you're matching and a next input token(s), determine
which right hand side applies

R1: s ::= A s A
R2: s ::= B s B
R3: s ::= C

If you're matching an s and see an A, attempt to apply R1
If you're matching an s and see an B, attempt to apply R2
If you're matching an s and see an C, attempt to apply R3

In the by-hand version of this, we write a procedure, parse_s, that
looks something like the following

/*
 * Procedure:
 *   match
 * Parameters:
 *   tok, a token
 *   ts, a token stream
 * Purpose:
 *   Verify that the next token in the token strem is tok and
 *   consume that token.
 * Produces:
 *   consumed: boolean
 * Preconditions:
 *   (None)
 */
 jjk
boolean parse_s(TokenStream ts)
{
  switch (ts.peek()) {
    case A: return match(ts,A) && parse_s(ts) && match(ts,A); 
    case B: return match(ts,B) && parse_s(ts) && match(ts,B); 
    case C: return match(ts,C);
    default: return false;
  } // switch
} // parse_s(TokenStream)

Note that the pictures we drew were very stack-like
The automated implementation uses an explicit stack

push the start symbol
while symbols remain in the input and symbols remain on the stack
  if the top symbol in the stack is a token, match it against the next 
    input token (and return false if they don't match)
  if the top symbol on the stack is a nonterminal, find the "appropriate" rhs and push it in right-to-left order

The problem can get a little or a lot harder if you change the grammar (or language)

s ::= x
s ::= y
s ::= C
x ::= A s A
y ::= B s B

Can we teach the computer which rule to apply, even though it requires some "reasoning"?  Probably

s ::= A s A
s ::= B s B
s ::= epsilon

This grammar is not parsable by predictive parsing

----

When we parse, we can also build a parse tree

Node parse_s(TokenStream ts)
{
  switch (ts.peek()) {
    case A: 
      Node result = new Node(Nonterminal_s);
      result.addChild(match(A, ts));
      result.addChild(parse_s(ts));
      result.addChild(match(A, ts));
      return result;
    case B: 
      Node result = new Node(Nonterminal_s);
      result.addChild(match(B, ts));
      result.addChild(parse_s(ts));
      result.addChild(match(B, ts));
      return result;
    case C: 
      Node result = new Node(Nonterminal_s);
      result.addChild(C);
      return result; 
    default: throw new Exception("You seem unable to write palindromes.");
  } // switch
} // parse_s(TokenStream)

Detour: Context-free grammars vs. context-sensitive grammars

BNF grammars describe languages by derivation

A string is a sequence of symbols (nonterminals and terminals)

A string, alpha, is in the language of a grammar G iff
(1) alpha consists only of terminals
(2) there exists a finite series of strings alpha_0 alpha_1 alpha_2 ... alpha_n
  such that each alpha_i is derivable from alpha_(i-1) by replacing some substring by the corresponding rhs of a rule

In context-free grammars, the left-hand-sides of rules have length one
In context-sensistive grammars, the left-hand-sides of rules can be bigger (but must include an nonterminal)

As compiler writers, we work only in the world of context-free grammars, which we can (usually) build efficient parsers for.
  * When we need extra stuff (e.g. declaration before use), we rely on semantic analysis

Returning to the issue of predictive parsing:
* Given a nonterminal and an input symbol, how do you decide which rule to apply?
* We pre-compute the answer for every possible pair
  (If we can't pre-compute those answers, we say the grammar isn't parsable by this technique and rewrite it.)
* How do we compute that table?
  * We begin by computing other tables,
    first[NONTERMINAL] = all terminals that can begin strings derivable from NONTERMINAL
    follow[NONTERMINAL] = all terminals that can follow NONTERMINAL in some derivable string
    nullable[NONTERMINAL] = can this thing derive epsiolon?

Read about first and follow in the book, come with questions