# Class 09: Introduction to Grammars and Parsing

Back to From NFA to Optimal DFA. On to Ambiguous Grammars.

Held Friday, February 9, 2001

Summary

Today we move from lexing to parsing.

Notes

• For the second week in a row: Lab Attendance is Required!. Attendance includes showing up on time.
• I'd like to know who each of you is planning to work with on phase 1 of the project.

Overview

• Limits of regular expressions
• BNF Grammars
• Form
• Some examples
• Context-Free Grammars

## Limits of Regular Expressions and Finite Automata

• As we've mentioned before, there are some limits of regular expressions and finite automata.
• In particular, there are some ``simple'' languages that we cannot express.
• Consider the language ``strings of `a`'s and `b`'s with equal numbers of `a`'s and `b`'s''
• This language cannot be expressed by a finite automaton.
• The proof is a proof by contradiction. Suppose it could be expressed by a deterministic finite automaton, A.
• A has a finite number of states, n
• Record the sequence of states the automaton goes through given `a`(n+1) as input.
• At least one state, q, must be duplicated in that sequence. Suppose it happens after `a`i and `a`j, with i and j different.
• If the automaton accepts `a`i`b`i and `a`j`b`j (which it should), then it will also accept `a`j`b`i and `a`i`b`j and (which it should not).
• Since every regular expression and every NFA can be represented by DFA, that language also cannot be expressed by NFAs or regular expressions.
• There are a host of other similar languages that cannot be expressed by finite automata.
• So, what do we do? We move on to a more powerful notation.
• Note that you may need pretty powerful notations. For example, our equal numbers of `a`'s and `b`'s may not be implementable on any machine with finite memory.

## An Introduction to Grammars

• In order to express more complicated (or more sophisticated) languages, we need a more powerful tool.
• Preferably, this tool will support some form of recursion or looping. For example, a string has equal numbers of `a`'s and `b`'s if
• the string is empty or
• the string has at least one `a` and at least one `b` and, if we remove an `a` and a `b` from the string, we end up with a string with equal numbers of `a`'s and `b`'s.
• The most common tool used is the Backus-Naur Form (BNF) grammar.
• A grammar is a formal set of rules that specify the legal utterances in the language.

## BNF grammars

• Formally, BNF grammars are four-tuples that contain
• Sigma, the alphabet from which strings are built (note that this alphabet may have been generated from the original program via a lexer). These are also called the terminal symbols of the grammar.
• N, the nonterminals in the language. You can think of these as names for the structures in the language or as names for groups of strings.
• S in N, the start symbol. In BNF grammars, all the valid utterances are of a single type.
• P, the productions that present rules for generating valid utterances.
• The set of productions is the core part of a grammar. Often, language definitions do not formally specify the other parts because they can be derived from the grammar.
• A production is, in effect, a statement that one sequence of symbols (a string of terminals and nonterminals) can be replaced by another sequence of terminals and nonterminals. You can also view a production as an equivalence: you can represent the left-hand-side by the right-hand-side.
• What do productions look like? It depends on the notation one uses for nonterminals and terminals, which may depend on the available character set.
• We'll use words that begin with capital letters to indicate nonterminals, words that begin with lowercase letters to indicate terminals, and stuff in quotation marks to indicate particular phrases. We'll use `::=` to separate the two parts of a rule.
• We might indicate the standard form of a Pascal program with
```Program ::= PROGAM
identifier
OPEN_PAREN
Identifier-List
CLOSE_PAREN
Declaration-List
Compound-Statement
END_PROGRAM
```
• Similarly, we might indicate a typical compound statement in Pascal with
```Compound-Statement ::= BEGIN
Statement-List
END
```
• Lists are often defined recursively, using multiple definitions. Here's one for non-empty statement lists. It says, ``a statement list can be a statement; it can also be a statement followed by a semicolon and another statement list''.
```Statement-List ::= Statement
Statement-List ::= Statement SEMICOLON Statement-List
```
• The statements may then be defined with
```Statement ::= Assignment-Statement
Statement ::= Compound-Statement
...
Assignment-Statement ::= identifier ASSIGN Expression
...
```
• How do we use BNF grammars? We show derivations by starting with the start symbol and repeatedly applying productions. Eventually, we end up with a sequence of terminals.
• Any string that can be produced in this way is in the language.
• The attempt to show that a string is in a language given by a BNF grammar is called parsing that string.
• We'll soon see a more formal way of parsing.

## Context-Free Grammars

• The most common form of BNF grammars used are the so-called context-free grammars.
• In context-free grammars, the left-hand-sides of productions are always single nonterminals.
• When parsing with context-free grammars, we can build parse trees that prove that the string is in the language.
• The root of a parse tree is the start symbol
• The leaves of a parse tree are the tokens of the string (in order)
• Each node and its down edges correspond to a production.
• You can build the tree starting with the start symbol and working your way down, or starting with the tokens and working your way up.
• Can you write a context-free grammar for ``strings of a's and b's with equal numbers of a's and b's''?

## History

Monday, 22 January 2001

• Created as a blank outline.

Friday, 9 February 2001

• Filled in the details, primarily from outline 8 and outline 9 of CSC362 98F.
• Removed some material from that outline (the material had been covered earlier in this class).
• Reformatted.

Back to From NFA to Optimal DFA. On to Ambiguous Grammars.

Disclaimer: I usually create these pages on the fly. This means that they are rarely proofread and may contain bad grammar and incorrect details. It also means that I may update them regularly (see the history for more details). Feel free to contact me with any suggestions for changes.

This page was generated by Siteweaver on Mon Apr 30 10:51:49 2001.
This page may be found at `http://www.cs.grinnell.edu/~rebelsky/Courses/CS362/2001S/outline.09.html`.