Held: Monday, 16 February 2004
Today we move from lexing to parsing. In particular, we consider the design of more complex languages through BNF grammars.
- Someone asked whether the midterm and the final will be take-home or in-class, open book or closed book. The midterm is likely to be take-home, open-book. The final is likely to be in-class, closed-book.
- Limits of regular expressions
- BNF Grammars: Form
- There are some limits of regular expressions and finite automata as mechanisms for defining languages.
- In particular, there are some
simple languages that we cannot
- Consider the language
b's with equal numbers of
- This language cannot be expressed by a finite automaton.
- The proof is a proof by contradiction. Suppose it could be expressed
by a deterministic finite automaton, A.
- A has a finite number of states, n
- Record the sequence of states the automaton goes through given
a(n+1) as input.
- At least one state, q, must be duplicated in that sequence.
Suppose it happens after
aj, with i
and j different.
- If the automaton accepts
(which it should), then it will also accept
(which it should not).
- Since every regular expression and every NFA can be represented by
a DFA, that language also cannot be expressed by NFAs or regular
- There are a host of other similar languages that cannot be expressed
by finite automata.
- balanced parentheses
- So, what do we do? We move on to a more powerful notation.
- Note that you may need pretty powerful notations.
- For example, our equal numbers of
b's may not be implementable on any machine with finite memory.
- In order to express more complicated (or more sophisticated) languages, we need a more powerful tool.
- Preferably, this tool will support some form of recursion or looping.
- For example, a string has equal numbers of
- the string is empty or
- the string has at least one
a and at least one
b and, if we remove an
a and a
b from the string, we end up with a string with equal numbers of
- Alternatively, a string has equal numbers of
- the string is empty or
- the string is built by adding an
a and a
b to a string with an equal number of
- The most common tool used for these recursive definisions is the Backus-Naur Form (BNF) grammar.
- A grammar is a formal set of rules that specify the legal utterances in the language.
- Formally, BNF grammars are four-tuples that contain
- Sigma, the alphabet from which strings are built
(note that this alphabet may have been generated from the original
program via a lexer). These are also called the terminal symbols
of the grammar.
- N, the nonterminals in the language. You
can think of these as names for the structures in the language or as
names for groups of strings.
- S in N, the start symbol. In BNF grammars,
all the valid utterances are of a single type.
- P, the productions that present rules for
generating valid utterances.
- The set of productions is the core part of a grammar. Often, language definitions do not formally specify the other parts because they can be derived from the grammar.
- A production is, in effect, a statement that one sequence of symbols (a string of terminals and nonterminals) can be replaced by another sequence of terminals and nonterminals. You can also view a production as an equivalence: you can represent the left-hand-side by the right-hand-side.
- If we read productions from left to right, we have a mechanism for building strings in the language.
- If we read productions from right to left, we have a mechanism for testing language membership.
- What do productions look like? It depends on the notation one uses for nonterminals and terminals, which may depend on the available character set.
- We'll use words in all caps to indicate terminals and words in mixed-case (or all lower case) to represent nonterminals, and stuff in quotation marks to indicate particular phrases. We'll use
::= to separate the two parts of a rule.
- We might indicate the standard form of a Pascal program with
Program ::= PROGAM
- Similarly, we might indicate a typical compound statement in Pascal
compound-statement ::= BEGIN
- Lists of things are often defined recursively, using multiple definitions. Here's one for non-empty statement lists. It says,
a statement list can be a statement; it can also be a statement followed by a semicolon and another statement list.
statement-sist ::= statement
statement-list ::= statement SEMICOLON statement-list
- The statements may then be defined with
statement ::= assignment-statement
statement ::= compound-statement
assignment-statement ::= IDENTIFIER BECOMES Expression
- How do we use BNF grammars? We show derivations by starting with the start symbol and repeatedly applying productions. Eventually, we end up with a sequence of terminals.
- Any string that can be produced in this way is in the language.
- The attempt to show that a string is in a language given by a
BNF grammar is called parsing that string.
- We'll soon see a more formal way of parsing.