Today in CS362: Fun with Grammars

ADMIN
[bcdf]
Question d: How do I do "strings of digits with no
  repetitions?"
Alternate: How do I approach the problem?
* Start with a simpler alphabet (e.g., 1-3 rather than 0-9)
* Look at all the cases
  + Length
  + Starting digit
  + Repeated digit
  + ...
* Name regular expressions
* Think about characteristics
  + Maximum length

Question f: Even # of 0's and odd # of 1's.
* Use ideas from today's class.
* Name regular expressions and make them recursive
evenodd: (eveneven 1) | (oddodd 0) | ...
eveneven: (evenodd 1) | ...

Quiz: When is tokenizing ambiguous in Pascal?
  QUIZ FOR CREDIT ON WEDNESDAY

Question on 3.17: How many accepting states?  
Answer from SamR: (Just one)

ON TO THE TOPIC OF TODAY'S CLASS: GOING BEYOND REGEXPS

Problem: Some fairly simple languages cannot be implemented
with regular expressions or finite automata

Example: "Strings of a's and b's with equal numbers of 
a's and b's".
Intuition: Need a way to count.  How do you count?  Using
states.  But you'll run out of states for enough a's or b's.

Theorem: There does not exist a DFA for the language "strings
of a's and b's with equal numbers of a's and b's"
Proof (by contradiction): 
* Assume there is a DFA of n states for that language.
* Trace the sequence of states it visits for the string
  a^(n+1).
* That sequence must contain a loop. [Pigeonhole principle]
* Formally,  delta*(q0,a^i) = delta*(q0,a^j) for some i 
  and j.
* Observation: delta*(q0,a^ib^i) is an accepting state.
* Hence: delta*(q0,a^jb^i) is an accepting state.
* QED (or, for the computer scientists, CAB)

Larger conclusion: Finite automata fail to work for many
useful languages:
* Matched parentheses

Observation:
* Equal numbers of a's and b's is hard.
* Will always tax memory.

What do we do (about the regexp/fa problem)?
* Need something with a theoretically infinite memory
* But a finite description

Standard model: Grammar
* A whole lot like named regular expressions
* Except that the named expression can be recursive

S : epsilon | (a S b)

Exp: NUMBER | ID | (OP Exp) | (Exp OP Exp) | '(' Exp ')'

Statement : IfStatement | WhileStatement | Assignment | ...

IfStatement : IF Exp THEN Statement ELSE Statement

StatementSeq : Statement | Statement SEMI StatementSeq

FORMAL DEFINITION:
* A grammar is a four-tuple:
  + T, a set of tokens/terminals/symbols (individual
    building blocks)
  + N, a set of nonterminals (instructions for building
    stuff in the language)
  + S in N, a designated start-symbol
  + P, a function from sequences of terminals and
    nonterminals to sequences of terminals and nonterminals
    
A string, str, in T*, is in the language of a grammar, 
G, (the language
 is L(G))
if

Start with S, repeatedly apply rules from P and end up
with str.

Useful for membership testing and for generation.

Questions for the theoretician:
* Are there times I want more than one thing on the left?
* Do I gain anything from putting more than one thing on the
  left?

Questions for the hacker:
* Can I build efficient programs that check membership from 
  these nondeterministic-seeming descriptions?