# Class 12: Introduction to Grammars and Parsing

Back to Cancelled. On to Ambiguous Grammars.

Held: Monday, 16 February 2004

Summary: Today we move from lexing to parsing. In particular, we consider the design of more complex languages through BNF grammars.

Related Pages:

Due

Assignments

Notes:

• Someone asked whether the midterm and the final will be take-home or in-class, open book or closed book. The midterm is likely to be take-home, open-book. The final is likely to be in-class, closed-book.

Overview:

• Limits of regular expressions
• BNF Grammars: Form
• Examples

## Limits of Regular Expressions and Finite Automata

• There are some limits of regular expressions and finite automata as mechanisms for defining languages.
• In particular, there are some simple languages that we cannot express.
• Consider the language strings of `a`'s and `b`'s with equal numbers of `a`'s and `b`'s
• This language cannot be expressed by a finite automaton.
• The proof is a proof by contradiction. Suppose it could be expressed by a deterministic finite automaton, A.
• A has a finite number of states, n
• Record the sequence of states the automaton goes through given `a`(n+1) as input.
• At least one state, q, must be duplicated in that sequence. Suppose it happens after `a`i and `a`j, with i and j different.
• If the automaton accepts `a`i`b`i and `a`j`b`j (which it should), then it will also accept `a`j`b`i and `a`i`b`j and (which it should not).
• Since every regular expression and every NFA can be represented by a DFA, that language also cannot be expressed by NFAs or regular expressions.
• There are a host of other similar languages that cannot be expressed by finite automata.
• anbn
• palindromes
• balanced parentheses
• ...
• So, what do we do? We move on to a more powerful notation.
• Note that you may need pretty powerful notations.
• For example, our equal numbers of `a`'s and `b`'s may not be implementable on any machine with finite memory.

## An Introduction to Grammars

• In order to express more complicated (or more sophisticated) languages, we need a more powerful tool.
• Preferably, this tool will support some form of recursion or looping.
• For example, a string has equal numbers of `a`'s and `b`'s if
• the string is empty or
• the string has at least one `a` and at least one `b` and, if we remove an `a` and a `b` from the string, we end up with a string with equal numbers of `a`'s and `b`'s.
• Alternatively, a string has equal numbers of `a`'s and `b`'s if
• the string is empty or
• the string is built by adding an `a` and a `b` to a string with an equal number of `a`'s and `b`'s.
• The most common tool used for these recursive definisions is the Backus-Naur Form (BNF) grammar.
• A grammar is a formal set of rules that specify the legal utterances in the language.

## BNF grammars

• Formally, BNF grammars are four-tuples that contain
• Sigma, the alphabet from which strings are built (note that this alphabet may have been generated from the original program via a lexer). These are also called the terminal symbols of the grammar.
• N, the nonterminals in the language. You can think of these as names for the structures in the language or as names for groups of strings.
• S in N, the start symbol. In BNF grammars, all the valid utterances are of a single type.
• P, the productions that present rules for generating valid utterances.
• The set of productions is the core part of a grammar. Often, language definitions do not formally specify the other parts because they can be derived from the grammar.
• A production is, in effect, a statement that one sequence of symbols (a string of terminals and nonterminals) can be replaced by another sequence of terminals and nonterminals. You can also view a production as an equivalence: you can represent the left-hand-side by the right-hand-side.
• If we read productions from left to right, we have a mechanism for building strings in the language.
• If we read productions from right to left, we have a mechanism for testing language membership.
• What do productions look like? It depends on the notation one uses for nonterminals and terminals, which may depend on the available character set.
• We'll use words in all caps to indicate terminals and words in mixed-case (or all lower case) to represent nonterminals, and stuff in quotation marks to indicate particular phrases. We'll use `::=` to separate the two parts of a rule.
• We might indicate the standard form of a Pascal program with
```Program ::= PROGAM
IDENTIFIER
OPEN_PAREN
identifier-list
CLOSE_PAREN
declaration-list
compound-statement
END_PROGRAM
```
• Similarly, we might indicate a typical compound statement in Pascal with
```compound-statement ::= BEGIN
statement-list
END
```
• Lists of things are often defined recursively, using multiple definitions. Here's one for non-empty statement lists. It says, a statement list can be a statement; it can also be a statement followed by a semicolon and another statement list.
```statement-sist ::= statement
statement-list ::= statement SEMICOLON statement-list
```
• The statements may then be defined with
```statement ::= assignment-statement
statement ::= compound-statement
...
assignment-statement ::= IDENTIFIER BECOMES Expression
...
```
• How do we use BNF grammars? We show derivations by starting with the start symbol and repeatedly applying productions. Eventually, we end up with a sequence of terminals.
• Any string that can be produced in this way is in the language.
• The attempt to show that a string is in a language given by a BNF grammar is called parsing that string.
• We'll soon see a more formal way of parsing.

Back to Cancelled. On to Ambiguous Grammars.

Disclaimer: I usually create these pages on the fly, which means that I rarely proofread them and they may contain bad grammar and incorrect details. It also means that I tend to update them regularly (see the history for more details). Feel free to contact me with any suggestions for changes.

This document was generated by Siteweaver on Wed May 5 11:47:01 2004.
The source to the document was last modified on Tue Jan 20 23:06:45 2004.
This document may be found at `http://www.cs.grinnell.edu/~rebelsky/Courses/CS362/2004S/Outlines/outline.12.html`.

You may wish to validate this document's HTML ; ; Check with Bobby

Samuel A. Rebelsky, rebelsky@grinnell.edu