Today in CS362: Parsing

  Missing: Erik, Mark

Admin
  Watermelon
  Lab yesterday
  Questions on the project?
  Rant about 151

Overview
  From grammar to parser
  Predictive Parsing
  A sample grammar
  A membership tester
  Some problems
  A tree builder
  Generalizing

About Grammars and Parsers
* Grammars describe languages
* Grammars give informal membership tests: If you can generate
  the string, it's in the language
* Parsing: A deterministic technique for determining membership
  (by showing its derivation).
* Design decisions in parsers:
  + From start symbol to desired string (top-down) vs.
    From desired string back to start symbol (bottom-up)
  + In top-down parsers, which nonterminal do you expand
    next?  left-to-right or right-to-left or highest-first

Predictive Parsing (Recursive Descent)
* Top-down, Left-to-right parsing strategy
* Popular because it's easy to hand code
* Basic idea: Write a "parseX" procedure for each
  nonterminal, X.
* The procedure looks at some input tokens, decides
  what RHS to match, and then matches the RHS, calling
  stuff along the way.
* Simple example: Suppose we've decided to apply the
  rule 
    Statement -> if Exp then Statement
We match this by
  (1) Read a token and make sure it's if.
  (2) Call parseExp procedure
  (3) Read a token and make sure it's then
  (4) Call parseStatement procedure
* Side note: Picking the right right hand side is hard;
  for now, we'll assume that a black box tells us how.

Example: "Palindromes over a and b with c in the middle"

Pal -> c
Pal -> a Pal a
Pal -> b Pal b

boolean parsePal(Lexer l) {
  Token t = l.nextToken();
  case t of 
    Token(a):
      if (!parsePal(l)) return false;
      if (l.nextToken() != Token(a)) return false;
      return true;
    Token(b):
      if (!parsePal(l)) return false;
      if (l.nextToken() != Token(b)) return false;
      return true;
    Token(c):
      return true;
    default:
      System.err.println("Trying to match a palindrome, saw a"
        + t.toString());
      return false;
} // parsePal

What happens if we change our language to
"Even-length palindromes over a and b"

R1: Pal -> 
R2: Pal -> a Pal a
R3: Pal -> b Pal b

Parsing is harder.  Why?  When you see an "a" is it
(1) a lhs a, so apply rule 2
(2) a rhs a, so apply rule 1

This grammar is not amenable to the predictive parsing 
technique.  (Wah!)  However, there are techniques for
transforming some grammars to those amenable by predictive
parsing.

Let's look at a variant of our first grammar

Pal -> c
Pal -> APal
Pal -> BPal
APal -> a Pal a
BPal -> b Pal b

In this case, the decision is obvious.  If you see an a,
call parseAPal, if you see a b, call parseBPal

However, there's a problem.  if parseAPal expects to see an
a, you can't consume it in parsePal.  Solution: "peekToken"

Sometimes the decision as to RHS can be hard.  Consider

R1: S -> c
R2: S -> AB
R3: S -> C
R4: A -> a A
R5: A -> epsilon
R6: B -> b
R7: C -> d

If you're trying to match an S and you see a
a: R2 (just peek)
b: R2 (just peek, assume A derives epsilon)
c: R1 (consume)
d: R3 (just peek)
other:

If you're trying to match an A and you see a
a: R4 (consume)
b: R6 (just peek)
c: (die a horrible death since you shouldn't see a c)

Two potential topics:
* Using a similar technique for building parse trees
* Formalizing this mechanism so that the computer can
  do it (or so that you can do it for more complicated
  problems)

Building Parse Trees for original c-centered palindrome
grammar.

class Palindrome
  subclass RealPalindrome
    fields palchar and center
  subclass CenterPalindrome
    field palchar
    
Palindrome parsePal(Lexer l) 
  throws ParseException
{
  Token t = l.peekToken();
  case t of 
    Token(a):
      t = l.readToken();
      Palindrome center = parsePal(l);
      if (l.nextToken() != Token(a)) 
        throw new ParseException("Failed to match a pal");
      return new RealPalindrome(Token(a), center);
    Token(b):
      // Similar changes
    Token(c):
      return new CenterPalindrome(l.readToken());
    default:
      System.err.println("Trying to match a palindrome, saw a"
        + t.toString());
      return false;
} // parsePal

-----

Can we formalize the construction of predictive parsers?

Technique is to build some "helper tables" from each grammar.

First
  For each nonterminal, N, gives all tokens that can
  begin strings derivable from N.
Follow
  For each nonterminal, N, gives all tokens that can
  follow N in a string of terminals and nonterminals
  derivable from some nonterminal.
Nullable
  For each nonterminal, N, tells whether or not N can
  derive epsilon (directly or indirectly)

Two questions to address next class (Sam's homework, not
yours):
* How do we use these tables to build predictive parsers?
* How do we build these tables?