Today in CS362 from RegExp to DFA, Continued

Admin
  Q on HW1?
  Project, Phase 1 coming soon
  Visitor

Remind you: Conversion from programmer/design
friendly reg-exp to computer-friendly DFA

Problem:
  The resultant DFA is big when you convert an NFA to
  a DFA.
Solution:
  Look at ways to make DFA's smallera
  Example on board
  Observation: If qi and qj have the same edges
   with the same labels and the same finality, 
   you can join them.

Revised Solution:
  Go overboard on joining: 
    All non-final states are the same
    All final states are the same
  Given that that's wrong, refine
    Whenever we notice a mistake (e.g., an edge that
      should go to two different sets), split a set

  // Initialization
  S0 = { }
  S1 = { }
  for each q in Q
    if q.isFinal, add q to S0
    else add q to S1
  repeat almost forever
    for each state, Si
      for all qj in Si
        for all qk in Si
          for all s in Sigma
            if delta(qj,s) is in a different S than
               delta(qk,s)
            split Si
   if you didn't split, exit repeat

How to build your lexer in N easy steps
  (1) Write your regular expressions
  (2) Convert them to NFAs
  (3) Merge all the NFAs together, augmenting their
      final states with a notation as to symbol type.
  (4) Convert it to a DFA
  (5) Optimize the DFA
      Hack when a final state might represent different
      symbols
  (6) Convert the DFA to working codee or interpret
      the DFA indirectly
OR
  (1) Write your regular expressions
  (2) Write a regular-expression compiler
  (3) Run your regular-expression compiler on your regular
      expressions
OR
  (1) Write your regular expressions
  (2) Use someone else's regular expression compiler
OR
  (1) Find someone else's regular expressions
  (2) Convert them to a lexer in an ad-hoc format
OR
  (1) Find someone else's lexer

Problem: Once we have this DFA, augmented with notations as
to the token type, how do we use it to tokenize rather than
to identify membership?

CHALLENGE: IS THERE A PLACE IN PASCAL IN WHICH YOU CAN'T
TELL WHETHER YOU'VE ENDED ONE TOKEN AND STARTED ANOTHER
(EXCEPT WITH CONTEXT?)