Lab: Processing XML

Held: Monday, 4 March 2019
Writeup due: Wednesday, 6 March 2019
Summary: We consider some techniques for processing XML documents, including ways to extract information from XML documents and to build new XML documents from old.

Useful notation

'(tag (@ (name1 val1) (name2 val2) ...) element1 element2 ...) - A list-based representation of an XML/HTML element. The attribute section is optional. The elements are either strings or themselves XML/HTML elements.

"//tag" - an XPath pattern to search for elements with the given tag.

"//tag0/tag1" - an XPath pattern to search for elements with tag tag1 that appear directly under elements with tag tag0.

"//tag0//tag1" - an XPath pattern to search for elements with tag tag1 that appear anywhere under elements with tag tag0.

"//tag[1]" - the first instance of the tag within an enclosing element. (We have similar "//tag[2]" and so on and so forth.)

"//tag[@class='name']" - all tags with the given class.

Useful procedures

(file->xml fname) - Read an HTML or XML document and convert it to the list-based representaiton.

(xml->file xml fname) - Save the list-based representation of an HTML or XML document in a file.

(string->xml str) - Convert a string to the list-based representation.

(xml->string xml) - Convert the list-based representation to a string.

(sxpath-match pattern xml) - Search the xml document for matching patterns.

Preparation

a. Start DrRacket.

b. Before running the examples, you need to install a variety of libraries, including mcfly, overeasy, html-parsing, html-writing, and sxml. You can open a terminal window and type /home/rebelsky/bin/csc151/update to install all of those packages. Alternately, you can select File > Install Package…, enter each name in turn, click Install for each, and then click the Close button when it becomes available.

c. Make sure that you have the latest version of the loudhum package by opening a terminal window and typing /home/rebelsky/bin/csc151/update. (You may have done that in a prior step. If so, there’s no need to do so again.) Alternately, select File > Install Package…, enter “https://github.com/grinnell-cs/loudhum.git” and follow the instructions.

d. Add (require loudhum) to the definitions pane.

e. If you did not set up a Web site at the start of the semester, set one up now by opening a terminal window and typing /home/rebelsky/bin/csc151/setup-web.

f. Verify that you can load one of the sample pages by directing your browser to https://www.cs.grinnell.edu/~username/excerpt.html, substituting your own user name.

g. If we did not review the self checks at the start of class, review the self checks with your partner. You should also check your answers within DrRacket. For example, if we asked you for the output of an expression, try the expression in DrRacket and see the result. Similarly, if we asked you to write code, check your code.

Exercises

Exercise 1: Some basic conversions

a. Write a string that describes a portion of an HTML page that contains at least two paragraphs, each of which has an emphasis tag and one of which has a strong tag.

b. Predict the result of converting that string to the list-based representation of HTML.

c. Write a list-based HTML representation of an unordered list (HTML-style) of three aphorisms, at least one of which contains a quotation.

d. Convert that back to a string.

e. Save the converted string to a file named exercise01.html in your public_html directory.

f. See what happens when you try to load that page in your Web browser. It should be at something like https://www.cs.grinnell.edu/~username/example01.html.

Exercise 2: Rewriting HTML files with regular expressions

You may recall that the file excerpt.html contains an excerpt from Through the Looking Glass.

a. Using regexp-replace*, write a set of instructions that make a copy of excerpt.html with all of the spaces deleted. Save that copy as you should call exercise02a.html.

b. Using regexp-replace*, write a set of instructions that make a copy of excerpt.html in which each emphasized entry becomes strongly emphasized.

Exercise 3: Exploring HTML files with regular expressions

a. Using regular expressions, determine how many times a quotation appears in excerpt.html.

b. Using regular expressions, determine how many times emphasized text appears in excerpt.html.

c. Using regular expressions, attempt to determine how many times emphasized text appears within a quotation in excerpt.html. (You may not succeed, but you should try.)

Exercise 4: Extracting information from HTML documents

a. Write a procedure to read excerpt.html into the internal, list-based representation.

b. Using sxpath-match, determine how many paragraphs are in the document.

c. Using sxpath-match, determine what elements are emphasized in the document.

d. Using sxpath-match determine when emphasized text appears within quotations in the document.

Exercise 5: Summarizing documents

In the reading, we said that it should be possible to extract all of the emphasized text from a document and then put it into a new document by putting the list of elements in a paragraph, the paragraph in a body element, and the body in an html element.

Try doing that with all the quotations in the excerpt.html document.

That is, repeat the steps from the reading, using q rather than em, and making any other modifications you consider appropriate.

Exercise 6: From XML to HTML

When we studied XML and HTML, we saw that XML is much more expressive than HTML, but that HTML provides a bit more standardization on the components of a document. It is therefore often useful to store information in the more expressive XML form and then use a program to automatically convert the XML to HTML for display. XSLT can do this, but it’s sometimes a bit complicated, so we’ll do so using Racket.

Consider the following XML document, taken from the reading on XML, intended to represent a list of books.

<?xml version="1.0" encoding="UTF-8"?>
<collection>
  <name>Project Gutenberg</name>
  <bookinfo book-id="000001">
    <author><first>Thomas</first> <last>Jefferson</last></author>
    <title>The Declaration of Independence of the United States of America</title>
    <url>https://www.gutenberg.org/ebooks/1</url>
  </bookinfo>
  <bookinfo book-id="000002">
    <author>
      Anonymous
      <alternative>
        The United States of America
      </alternative>
    </author>
    <title>The United States Bill of Rights</title>
    <url>https://www.gutenberg.org/ebooks/2</url>
  </bookinfo>
  <bookinfo book-id="000011">
    <author author-id="412369">
      <first>Lewis</first> <last>Carroll</last>
      <alternative>
        <first>Charles</first> <middle>Lutwidge</middle> <last>Dogson</last>
      </alternative>
    </author>
    <title>Alice's Adventures in Wonderland</title>
    <url>https://www.gutenberg.org/ebooks/11</url>
  </bookinfo>
  <bookinfo book-id="000012">
    <author author-id="412369"/>
    <title>Through the Looking-Glass</title>
    <url>https://www.gutenberg.org/ebooks/12</url>
  </bookinfo>
</collection>

We might want to turn that information into a Web page with quick links to each of the books mentioned. In particular, we’d like something like the following.

<html>
<head>
<title>Some online books at Project Gutenberg</title>
</head>
<body>
<ul>
  <li><a href="https://www.gutenberg.org/ebooks/1">The Declaration of Independence of the United States of America</a></li>
  <li><a href="https://www.gutenberg.org/ebooks/2">The United States Bill of Rights</a></li>
  <li><a href="https://www.gutenberg.org/ebooks/11">Alice in Wonderland</a></li>
  <li><a href="https://www.gutenberg.org/ebooks/12">Through the Looking-Glass</a></li>
</ul>
</body>
</html>

In an abbreviated list representation, we might express that as follows.

'(html
  (head (title "Some online books at Project Gutenberg"))
  (body
   (ul
    (li (a (@ (href "https://www.gutenberg.org/ebooks/1")) "The Declaration of Independence of the United States of America"))
    (li (a (@ (href "https://www.gutenberg.org/ebooks/2")) "The United States Bill of Rights"))
    (li (a (@ (href "https://www.gutenberg.org/ebooks/11")) "Alice in Wonderland"))
    (li (a (@ (href "https://www.gutenberg.org/ebooks/12")) "Through the Looking-Glass")))))

a. Write a procedure, book->list-item that takes a book in list-based XML form and converts it to corresponding HTML in the form listed above.

> (define bill-of-rights
    '(bookinfo (@ (book-id "000002")) 
               (author "Anonymous" 
                       (alternative "The United States of America")) 
               (title "The United States Bill of Rights") 
               (url "https://www.gutenberg.org/ebooks/2")))

> (book->list-item bill-of-rights)
'(li (a (@ (href "https://www.gutenberg.org/ebooks/2")) "The United States Bill of Rights"))

b. The file /home/rebelsky/Desktop/pg01.xml contains a collection of four books in the form described above. Using file->xml and sxpath-match, extract a list of the books from the document.

c. Using a technique similar to that of the prior problem, turn your list of books into a Web page (in list representation). That is

Apply book->list-item to each element of the list.
Prepend ul with cons.
Wrap the new ul element in a body.
Create a head element with a title.
Wrap the head and body elements into an html element.

d. Using xml->file, save your file in your public_html directory as pg01.html.

e. Preview the file in your Web browser. https://www.cs.grinnell.edu/~username/pg01.html.

For those with extra time

We do not anticipate that anyone will have extra time. If you do, start the next lab.

Acknowledgements

This lab was newly written in spring 2019.

The loudhum libraries to support these exercises on on the Racket SXML libraries, and on Neil Van Dyke’s html-parsing and html-writing libraries.

Copyright © Charlie Curtsinger, Sarah Dahlby Albright, Janet Davis, Fahmida Hamid, Titus Klinge, Samuel A. Rebelsky, and Jerod Weinman. Selected materials are copyright by John David Stone or Henry Walker and are used with permission.

Unless specified otherwise elsewhere on this page, this work is licensed under a Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/3.0/ or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.

This website was built using Jekyll, Twitter Bootstrap, and the Bootswatch Cosmo Theme.