Skip to main content

Lab: Transforming XML

Held
Wednesday, 6 March 2019
Writeup due
Friday, 8 March 2019
Summary
We consider some additional techniques for transforming XML documents, particularly ways that let us modify particular parts of a document.

Useful notation

'(tag (@ (name1 val1) (name2 val2) ...) element1 element2 ...) - A list-based representation of an XML/HTML element. The attribute section is optional. The elements are either strings or themselves XML/HTML elements.

"//tag" - an XPath pattern to search for elements with the given tag.

"//tag0/tag1" - an XPath pattern to search for elements with tag tag1 that appear directly under elements with tag tag0.

"//tag0//tag1" - an XPath pattern to search for elements with tag tag1 that appear anywhere under elements with tag tag0.

"//tag[1]" - the first instance of the tag within an enclosing element. (We have similar "//tag[2]" and so on and so forth.)

"//tag[@class='name']" - all tags with the given class.

"//text()" - all of the text in a document.

"//tag[contains(text(),'string')]" - all instances of the tag that contain that string in their text. “

Useful procedures

(file->xml fname) - Read an HTML document and convert it to the list-based representaiton.

(xml->file html fname) - Save the list-based representation of an HTML document in a file.

(string->xml str) - Convert a string to the list-based representation.

(xml->string xml) - Convert the list-based representation to a string.

(sxpath-match pattern xml) - Search the html document for matching patterns.

(sxpath-replace pattern xml proc) - Update any element matching the pattern by applying proc.

(sxpath-delete pattern xml) - Delete any element matching the pattern.

(sxpath-remove pattern xml) - Remove the tag in any element matching the pattern, moving any contents of the element up to the enclosing element.

Preparation

a. Start DrRacket.

b. Make sure that you have the latest version of the loudhum package by opening a terminal window and typing /home/rebelsky/bin/csc151/update. (Alternately, select File > Install Package…, enter “https://github.com/grinnell-cs/loudhum.git” and follow the instructions.)

c. Install the sxml package as follows: Select File > Install Package…. Enter “sxml”. Click Install. When a Close button appears, click it.

d. Add (require loudhum) and (require sxml) to the definitions pane.

e. If you did not set up a Web site in MathLAN at the start of the semester, set one up now by opening a terminal window and typing /home/rebelsky/bin/csc151/setup-web.

f. Verify that you can load one of the sample pages by directing your browser to https://www.cs.grinnell.edu/~username/excerpt.html, substituting your own user name.

Exercises

Exercise 1: Exploring quotations

As you may recall, excerpt.html contains a short excerpt from Through the Looking Glass.

a. Write an expression that identifies all of the quotations in that document.

b. Write an expression that identifies all of the quotations by the White Queen.

c. Write an expression that identifies all of the spoken quotations.

Exercise 2: Off with their words!

a. Write an expression that replaces every one of the White Queen’s quotations with the text “Off with their heads!”.

b. Write an expression that removes every one of the White Queen’s quotations.

Exercise 3: Reformatting

a. Write an expression that strongly emphasizes every spoken quotation . That is, put a strong tag around the quotation.

b. Write an expression that turns every spoken quotation into all caps. That is, identify the text within the quotation and call string-upcase on that text.

Exercise 4: Dequoting Alice

Write an expression that removes the q tag from any of Alice’s quotations.

Exercise 5: Inserting text

Write an expression that inserts the text "PAY ATTENTION:" at the start of every quotation.

Reminder: You can use append to join lists. In this case, you’ll want to join a list of the tag (and, possibly, the attributes), a list of the string, and the rest of the contents of the element.

Note that this process is complicated by the possible inclusion of attributes in the quotation. Fortunately, there’s a (has-attributes? element) procedure that checks whether or not there’s a set of attributes.

For those with extra time

We do not anticipate that anyone will have extra time.

Acknowledgements

This lab was newly written in spring 2019.

The loudhum libraries to support these exercises on on the Racket SXML libraries, and on Neil Van Dyke’s html-parsing and html-writing libraries.