---
title: Eboard 17  Processing XML
number: 17
section: eboards
held: 2019-03-04
link: true
---
CSC 151.02 2019S, Class 17:  Processing XML
========================================

_Overview_

* Preliminaries
    * Notes and news
    * Upcoming work
    * Extra credit
    * Questions
* XML, revisited
* Representing XML in Racket
* Expressing patterns in XML
* Constructing new documents from old
* Lab

Preliminaries
-------------

### News / Etc.

* Mentor sessions Wednesday 8-9 p.m., Thursday 8-9 p.m., Sunday 5-6 p.m.
* Welcome to any prospective students we have.  Thank you for bringing
  warmer weather with you.
* I'm back!  We hope that you had a good time without us.  
* I brought you conference swag.  (One of each item per person.)

### Upcoming work

* Reading for Wednesday
    * [Forthcoming]
* [Assignment 6](../assignments/assignment06) due Tuesday. 
* Quiz Friday: Hash tables, structs, and searching XML

### Extra Credit

_I would certainly appreciate suggestions of other extra credit activities
(preferably via email)._

#### Extra credit (Academic/Artistic)

* Grinnell Singers, Sunday at 2pm. with Lyra Baroque Orchestra (professional
  musicians, period instruments), really difficult pieces by Handel and
  others.
* Twelfth Night this weekend

#### Extra credit (Peer)


#### Extra credit (Wellness)

#### Extra credit (Wellness, Regular)

* 30 Minutes of Mindfulness at SHACS every Monday 4:15-4:45
* Any organized exercise.  (See previous eboards for a list.)
* 60 minutes of some solitary self-care activities that are unrelated to
  academics or work.  Your email reflection must explain how the activity
  contributed to your wellness.
* 60 minutes of some shared self-care activity with friends. Your email
  reflection must explain how the activity contributed to your wellness.

#### Extra credit (Misc)

### Other good things 

### Questions

_Can we do overkill on the date time stuff, say paying attention to Julian/Gregorian switch?_

> You need not pay attention to the switch.  But if it floats your boat,
  as they say, it's fine.

_Can we talk about 1a?_

> Goal on 1a is to build a hash table that tallies letters, extract
info in a systematic way.


* Create a hash table.
* Turn the string into a list of characters.
* Iterate through the list, using for-each. If a character is a letter, count it in the hash table. If not, don’t.
* Use the hash table to make a list of character/count lists.

Here's an incomplete solution.

```drracket
(define tally-letters
  (lambda (str)
    (let ([uh (make-hash)])
      ; Initialize
      (for-each (section hash-set! uh <> 0)
                (string-split "abcdefghijklmnopqrstuvwxyz" ""))
      ; Fill in values
      (for-each
       (lambda (thing)
         (when (regexp-match? #px"[a-z]" thing)
           (hash-set! uh thing (+ 1 (hash-ref uh thing)))))
       (string-split str ""))
      uh)))
```

See exercise 7 on the hash tables lab for more ideas.

XML, revisited
--------------

* A mechanism for marking up documents using tags.  There are tags like
  `<p>` for opening paragraphs and `</p>` for closing paragraphs.
* The following is HTML
* `<uh>` marks universal headings?
* `<em>` ... `</em>` marks emphasized text.
* `<q>` ... `</q>` marks quotation.
* `<strong>` ... `</strong>` marks strongly emphasized text.
* `<ul>` ... `</ul>` - list (unnumbered list)
* `<ol>` ... `</ol>` - list (ordered list)
* `<li>` ... `</li>` - list items
* Reminder: `<p class="article">` gives additional information about
  an element of the document.
* Hierarchical.

We can write programs that transform and analyze text.  (Not a surprise
at this time.)

We should also be able to write programs that transform and analyze
Web pages.

_Write a regular expression that matches emphasized pieces of text
so that we can count the number of emphasized pieces._

```drrackeet
> (regexp-match* #px"<em>" example)
'("<em>" "<em>" "<em>" "<em>" "<em>")
> (length (regexp-match* #px"<em>" example))
5
> (regexp-match* #px"<em" "<em class='booktitle'>Alice in CSC151land</em>")
'("<em")
> (regexp-match* #px"<em[ >]" "<em class='booktitle'>Alice in CSC151land</em>")
'("<em ")
> (regexp-match* #px"</em>" "<em class='booktitle'>Alice in CSC151land</em>")
'("</em>")
> (length (regexp-match* #px"</em>" example))
5
```

_Write a regular expression that matches *nested* emphasized pieces
of text so that we can, say, replace the inner one with a <strong>
tag.  `<em>Sam thinks this is <em>very</em> important!</em>`_

Because XML/HTML are hierarchical and regular expressions are linear,
regular expressions do poorly this kind of problem.

There is no common pattern language for hierarchical structures.

However, there is a standard for XML documents, called XPath.  We
will consider a subset of XPath.

Representing XML in Racket
--------------------------

XML is hierarchical, strings are not.  How might we represent a hierarchical
document in Racket?

Options: Lists, Strings, Hash tables

We can think about each HTML element as a list.  For example
`<p>Hello world</p>` might be represented as `'(p "Hello world")`.
`<p>This is <em>more</em> complicated</p>` as 
`(p "This is " (em "more") " complicated")`

`file->html` and `string->html` convert to this format

`html->file` and `html->string` convert from this format

Expressing patterns in XML
--------------------------

`"//em"` - Search for this tag

Constructing new documents from old
-----------------------------------

Lab
---
