Held Friday, February 9, 2001
Summary
Today we move from lexing to parsing.
Notes
- For the second week in a row: Lab Attendance is Required!.
Attendance includes showing up on time.
- I'd like to know who each of you is planning to work with on
phase 1 of the project.
Overview
- Limits of regular expressions
- BNF Grammars
- Context-Free Grammars
- As we've mentioned before, there are some limits of regular expressions
and finite automata.
- In particular, there are some ``simple'' languages that we cannot
express.
- Consider the language ``strings of
a
's and
b
's with equal numbers of
a
's and b
's''
- This language cannot be expressed by a finite automaton.
- The proof is a proof by contradiction. Suppose it could be expressed
by a deterministic finite automaton, A.
- A has a finite number of states, n
- Record the sequence of states the automaton goes through given
a
^{(n+1)} as input.
- At least one state, q, must be duplicated in that sequence.
Suppose it happens after
a
^{i} and
a
^{j}, with i
and j different.
- If the automaton accepts
a
^{i}b
^{i} and
a
^{j}b
^{j}
(which it should), then it will also accept
a
^{j}b
^{i} and
a
^{i}b
^{j} and
(which it should not).
- Since every regular expression and every NFA can be represented by
DFA, that language also cannot be expressed by NFAs or regular
expressions.
- There are a host of other similar languages that cannot be expressed
by finite automata.
- So, what do we do? We move on to a more powerful notation.
- Note that you may need pretty powerful notations. For example,
our equal numbers of
a
's and b
's may not
be implementable on any machine with finite memory.
- In order to express more complicated (or more sophisticated) languages,
we need a more powerful tool.
- Preferably, this tool will support some form of recursion or looping.
For example, a string has equal numbers of
a
's and
b
's if
- the string is empty or
- the string has at least one
a
and
at least one b
and, if we remove an
a
and a b
from the
string, we end up with a string with equal numbers of
a
's and b
's.
- The most common tool used is the Backus-Naur Form (BNF) grammar.
- A grammar is a formal set of rules that specify the legal utterances
in the language.
- Formally, BNF grammars are four-tuples that contain
- Sigma, the alphabet from which strings are built
(note that this alphabet may have been generated from the original
program via a lexer). These are also called the terminal symbols
of the grammar.
- N, the nonterminals in the language. You
can think of these as names for the structures in the language or as
names for groups of strings.
- S in N, the start symbol. In BNF grammars,
all the valid utterances are of a single type.
- P, the productions that present rules for
generating valid utterances.
- The set of productions is the core part of a grammar. Often, language
definitions do not formally specify the other parts because they can
be derived from the grammar.
- A production is, in effect, a statement that one sequence of symbols
(a string of terminals and nonterminals) can be replaced by another
sequence of terminals and nonterminals. You can also view a production
as an equivalence: you can represent the left-hand-side by the
right-hand-side.
- What do productions look like? It depends on the notation one uses
for nonterminals and terminals, which may depend on the available
character set.
- We'll use words that begin with capital letters to indicate
nonterminals, words that begin with lowercase letters to indicate
terminals, and stuff in quotation marks to indicate particular
phrases. We'll use
::=
to separate the two parts of a rule.
- We might indicate the standard form of a Pascal program with
Program ::= PROGAM
identifier
OPEN_PAREN
Identifier-List
CLOSE_PAREN
Declaration-List
Compound-Statement
END_PROGRAM
- Similarly, we might indicate a typical compound statement in Pascal
with
Compound-Statement ::= BEGIN
Statement-List
END
- Lists are often defined recursively, using multiple definitions. Here's
one for non-empty statement lists. It says, ``a statement list can
be a statement; it can also be a statement followed by a semicolon
and another statement list''.
Statement-List ::= Statement
Statement-List ::= Statement SEMICOLON Statement-List
- The statements may then be defined with
Statement ::= Assignment-Statement
Statement ::= Compound-Statement
...
Assignment-Statement ::= identifier ASSIGN Expression
...
- How do we use BNF grammars? We show derivations by starting
with the start symbol and repeatedly applying productions. Eventually,
we end up with a sequence of terminals.
- Any string that can be produced in this way is in the language.
- The attempt to show that a string is in a language given by a
BNF grammar is called parsing that string.
- We'll soon see a more formal way of parsing.
- The most common form of BNF grammars used are the so-called
context-free grammars.
- In context-free grammars, the left-hand-sides of productions are
always single nonterminals.
- When parsing with context-free grammars, we can
build parse trees that
prove that the string is in the language.
- The root of a parse tree is the start symbol
- The leaves of a parse tree are the tokens of the string (in order)
- Each node and its down edges correspond to a production.
- You can build the tree starting with the start symbol and working your
way down, or starting with the tokens and working your way up.
- Can you write a context-free grammar for ``strings of a's and b's
with equal numbers of a's and b's''?
Monday, 22 January 2001
- Created as a blank outline.
Friday, 9 February 2001
- Filled in the details, primarily from
outline 8 and
outline 9 of
CSC362 98F.
- Removed some material from that outline (the material had been
covered earlier in this class).
- Reformatted.