Today in CS362: Lexical Analysis Admin * Surveys * Pascal programs * Lab * ECA Topics * What is a Language? * About lexical analysis * Hand-coding lexical analysis * Generating lexical analyzers * Regular Expressions What is a language? * Something you can use to express ...? * NOT a description of a set of tasks * PERHAPS a set of kinds of tasks the computer can accomplish * NOT JUST something accepted by a finite automaton * A language IS a set * Usually we build languages from alphabets (usually denoted Sigma). + A language is a set of strings built from the "letters" in sigma. * Example: + Sigma1 = { ab } + Lang1 = { a, b } "all one character strings from Sigma1" + Lang2 = "all strings from Sigma1 with equal numbers of a's and b's" Lang2 = { empty-string, ab, ba, abba, baab, abab, babba, aabb, ... } + Lang3 = "all palindromes over Sigma1" * Different ways to describe languages + English text + List all the elements + Present representative elements and hope the "reader" intuits the rest. 1,1,2,3,5,8,13, + Need formal mechanisms for describing these sets of values. * Need to be able to do something interesting with descriptions. + Check membership + Generate elements of the language + Associate meaning to elements of the language + Translate into other languages * Might also do set operations About Lexical Analysis * Simple languages * First phase of compiler * Goal: group individual characters into useful and interesting tokens for later parts of the compiler. * Normal model: Reads a sequence of characters and produces a sequence of tokens. Function from: Sigma* to Token* * Tends to be done in a demand-driven way * Three basic functions: + nextToken() -- get the next token + initialize + cleanup * How do we get tokenizers? + Hand code Token nextToken(Stream str) { // Skip over whitespace and comments ... // Read the next token switch (str.peekChar()) { case alphabetic: // read characters until you hit a // non-alphabetic character case digit: // read characters until you hit a // non-digit character } // switch } // nextToken(Stream) + Describe the tokens in a formal language and build the tokenizer using that formal description - By hand: Use instructions for turning formal descriptions into tokenizers - Automatically: Use tokenizer-generators We need a formal mechanism for describing tokens * Regular expressions * A regular expression is a formal mechanism for describing a language. * The simplest regular expression is the regular expression for the empty string. We use epsilon. + L(epsilon) = { "" } * For all s in Sigma, s is a regular expression. + L(s) = { "s" } + If r1 and r2 are regular expressions, (r1.r2) is a regular expression called concatenation. + L((r1.r2)) = { concat(s1,s2) | s1 in L(r1) s2 in L(r2) } + Suppose r1 = "strings starting with a" r2 = "strings ending with b" ((r1).(r2)) is "strings starting with a and ending with b" + L(((a).(b))) = { "ab" } * Alternation: If r1 and r2 are regular expressions, (r1|r2) is a regular expression + L((r1|r2)) = L(r1) union L(r2) + L((a|b)) = { "a", "b" } + L(((a|d).c)) = { "ac" "dc" } * We need a mechanism for describing infinite languages * The Kleene * provides this mechanism. If r is a regular expression, (r*) is L(epsilon) union L(r) union L((r.r)) union L(((r.r).r) union ... + L((a*)) = { "", "a", "aa", "aaa", "aaaa", ... } * (r+) = (r.(r*)) * Numbers = ([0123456789]+)