Skip to main content

CSC 151.02 2019S, Class 17: Processing XML

Overview

  • Preliminaries
    • Notes and news
    • Upcoming work
    • Extra credit
    • Questions
  • XML, revisited
  • Representing XML in Racket
  • Expressing patterns in XML
  • Constructing new documents from old
  • Lab

Preliminaries

News / Etc.

  • Mentor sessions Wednesday 8-9 p.m., Thursday 8-9 p.m., Sunday 5-6 p.m.
  • Welcome to any prospective students we have. Thank you for bringing warmer weather with you.
  • I’m back! We hope that you had a good time without us.
  • I brought you conference swag. (One of each item per person.)

Upcoming work

  • Reading for Wednesday
    • [Forthcoming]
  • Assignment 6 due Tuesday.
  • Quiz Friday: Hash tables, structs, and searching XML

Extra Credit

I would certainly appreciate suggestions of other extra credit activities (preferably via email).

Extra credit (Academic/Artistic)

  • Grinnell Singers, Sunday at 2pm. with Lyra Baroque Orchestra (professional musicians, period instruments), really difficult pieces by Handel and others.
  • Twelfth Night this weekend

Extra credit (Peer)

Extra credit (Wellness)

Extra credit (Wellness, Regular)

  • 30 Minutes of Mindfulness at SHACS every Monday 4:15-4:45
  • Any organized exercise. (See previous eboards for a list.)
  • 60 minutes of some solitary self-care activities that are unrelated to academics or work. Your email reflection must explain how the activity contributed to your wellness.
  • 60 minutes of some shared self-care activity with friends. Your email reflection must explain how the activity contributed to your wellness.

Extra credit (Misc)

Other good things

Questions

Can we do overkill on the date time stuff, say paying attention to Julian/Gregorian switch?

You need not pay attention to the switch. But if it floats your boat, as they say, it’s fine.

Can we talk about 1a?

Goal on 1a is to build a hash table that tallies letters, extract info in a systematic way.

  • Create a hash table.
  • Turn the string into a list of characters.
  • Iterate through the list, using for-each. If a character is a letter, count it in the hash table. If not, don’t.
  • Use the hash table to make a list of character/count lists.

Here’s an incomplete solution.

(define tally-letters
  (lambda (str)
    (let ([uh (make-hash)])
      ; Initialize
      (for-each (section hash-set! uh <> 0)
                (string-split "abcdefghijklmnopqrstuvwxyz" ""))
      ; Fill in values
      (for-each
       (lambda (thing)
         (when (regexp-match? #px"[a-z]" thing)
           (hash-set! uh thing (+ 1 (hash-ref uh thing)))))
       (string-split str ""))
      uh)))

See exercise 7 on the hash tables lab for more ideas.

XML, revisited

  • A mechanism for marking up documents using tags. There are tags like <p> for opening paragraphs and </p> for closing paragraphs.
  • The following is HTML
  • <uh> marks universal headings?
  • <em></em> marks emphasized text.
  • <q></q> marks quotation.
  • <strong></strong> marks strongly emphasized text.
  • <ul></ul> - list (unnumbered list)
  • <ol></ol> - list (ordered list)
  • <li></li> - list items
  • Reminder: <p class="article"> gives additional information about an element of the document.
  • Hierarchical.

We can write programs that transform and analyze text. (Not a surprise at this time.)

We should also be able to write programs that transform and analyze Web pages.

Write a regular expression that matches emphasized pieces of text so that we can count the number of emphasized pieces.

> (regexp-match* #px"<em>" example)
'("<em>" "<em>" "<em>" "<em>" "<em>")
> (length (regexp-match* #px"<em>" example))
5
> (regexp-match* #px"<em" "<em class='booktitle'>Alice in CSC151land</em>")
'("<em")
> (regexp-match* #px"<em[ >]" "<em class='booktitle'>Alice in CSC151land</em>")
'("<em ")
> (regexp-match* #px"</em>" "<em class='booktitle'>Alice in CSC151land</em>")
'("</em>")
> (length (regexp-match* #px"</em>" example))
5

_Write a regular expression that matches nested emphasized pieces of text so that we can, say, replace the inner one with a tag. <em>Sam thinks this is <em>very</em> important!</em>_

Because XML/HTML are hierarchical and regular expressions are linear, regular expressions do poorly this kind of problem.

There is no common pattern language for hierarchical structures.

However, there is a standard for XML documents, called XPath. We will consider a subset of XPath.

Representing XML in Racket

XML is hierarchical, strings are not. How might we represent a hierarchical document in Racket?

Options: Lists, Strings, Hash tables

We can think about each HTML element as a list. For example <p>Hello world</p> might be represented as '(p "Hello world"). <p>This is <em>more</em> complicated</p> as (p "This is " (em "more") " complicated")

file->html and string->html convert to this format

html->file and html->string convert from this format

Expressing patterns in XML

"//em" - Search for this tag

Constructing new documents from old

Lab