Transforming XML

Due: Wednesday, 6 March 2019
Summary: We consider some additional techniques for transforming XML documents that build upon our work in extracting information from XML documents.
Prerequisites: XML basics. HTML and the Web. Regular expressions. Processing XML.
Note: Before running the examples, you need to install a variety of libraries, including mcfly, overeasy, html-parsing, html-writing, and sxml. You should also update loudhum. In MathLAN, running /home/rebelsky/bin/csc151/update should do the job.

In our initial explorations of ways to process XML and the corresponding lab, we considered some approaches to writing programs that take a Web page as input and generate a new Web page based on the information on the page.

While such approaches can be useful, they are not universally applicable. For example, what happens if we want to replace just one small portion of a page, keeping everything else the same? Or what if we want to remove rather than extract portions of a page?

The sxml:modify procedure can address many of these situations. However, it is complicated enough that we will instead use a variety of procedures from the loudhum library.

Replacing and rewriting elements

The (sxpath-replace path xml transform) procedure identifies all the entities described by path in the given xml list, applying the given transformation to each.

Here’s a simple example of XML to get us started.

> (define example (string->xml "<div><p>I like <em>this</em> and <em>that</em>.</p><p>You may like <em>emphasized</em> text.</p></div>"))
> example
'(div (p "I like " (em "this") " and " (em "that") ".") (p "You may like " (em "emphasized") " text."))
> (sxpath-match "//em" example)
'((em "this") (em "that") (em "emphasized"))
> (sxpath-match "//p/em[1]" example)
'((em "this") (em "emphasized"))

Now, let’s build a procedure that changes an em element to a strong element.

> (define em->strong
    (lambda (elt)
      (cons 'strong (cdr elt))))
> (em->strong '(em "this"))
'(strong "this")
> (map em->strong (sxpath-match "//em" example))
  '((strong "this") (strong "that") (strong "emphasized"))

We can use this to replace all of the emphasized text “in place”.

> (sxpath-replace "//em" example em->strong)
'((div (p "I like " (strong "this") " and " (strong "that") ".") (p "You may 

We can also use it with more complex paths, such as the following one that only updates the first piece of emphasized text.

> (sxpath-replace "//em[1]" example em->strong)
'((div (p "I like " (strong "this") " and " (em "that") ".") (p "You may like " (strong "emphasized") " text.")))

Of course, we can also choose somewhat less sensible replacements.

> (sxpath-replace "//em" example (lambda (x) "FOO"))
'((div (p "I like " "FOO" " and " "FOO" ".") (p "You may like " "FOO" " text.")))
> (xml->string (sxpath-replace "//em" example (lambda (x) "FOO")))
"<div><p>I like FOO and FOO.</p><p>You may like FOO text.</p></div>"

What about replacing the text within the document? There’s an XPath that we had not mentioned yet, "//text()", that selects all the text (and only the text) in a document.

> (sxpath-match "//text()" example)
'("I like " " and " "." "this" "that" "You may like " " text." "emphasized")

We can therefore replace pieces of text (e.g., with regexp-replace*) by using that selector.

> (sxpath-replace "//text()" example (section regexp-replace* #px"[aeiou]" <> "-"))
'((div (p "I l-k- " (em "th-s") " -nd " (em "th-t") ".") (p "Y-- m-y l-k- " (em "-mph-s-z-d") " t-xt.")))

What if we wanted to update only the text within certain tags? That may be a bit more complicated. First we’ll need to select the emphasized elements, then we’ll need to extract the text from them. Let’s see ….

> (sxpath-replace "//em" 
                  example 
                  (lambda (element)
                    (sxpath-replace "//text()"
                                    element
                                    (section regexp-replace* 
                                             #px"[aeiou]"
                                             <>
                                             "-"))))
'((div (p "I like " (em "th-s") " and " (em "th-t") ".") (p "You may like " (em "-mph-s-z-d") " text.")))

Deleting text

As you might expect, (sxpath-delete path xml) deletes all elements in an xml expression that are indicated by the given path.

> (sxpath-delete "//em" example)
'((div (p "I like " " and " ".") (p "You may like " " text.")))
> (sxpath-delete "//em[2]" example)
'((div (p "I like " (em "this") " and " ".") (p "You may like " (em "emphasized") " text.")))
> (sxpath-delete "//text()" example)
'((div (p (em) (em)) (p (em))))

Once in a while, you may find that you want to delete the enclosing tag, but not the enclosed text. While it may sometimes be possible to use sxpath-replace to achieve this goal, it turns out to be better to have a more general procedure, sxpath-remove, which removes only the tag at the top of any matching section.

> (sxpath-remove "//em" example)
'((div (p "I like " "this" " and " "that" ".") (p "You may like " "emphasized" " text.")))
> (sxpath-remove "//em[1]" example)
'((div (p "I like " "this" " and " (em "that") ".") (p "You may like " "emphasized" " text.")))

We’ll introduce one other pattern here … [contains(text(),'string'] can be used to select elements that contain a particular string. (No regular expressions here, just a string.)

> (sxpath-remove "//em[contains(text(),'this')]" example)
'((div (p "I like " "this" " and " (em "that") ".") (p "You may like " (em "emphasized") " text.")))
> (sxpath-remove "//em[contains(text(),'that')]" example)
'((div (p "I like " (em "this") " and " "that" ".") (p "You may like " (em "emphasized") " text.")))
> (sxpath-remove "//em[contains(text(),'a')]" example)
'((div (p "I like " (em "this") " and " "that" ".") (p "You may like " "emphasized" " text.")))

We mentioned that in some cases, we could use sxpath-replace to achieve the same goal. For example, here’s another way to replace all of the em tags in the example with their contents.

> (sxpath-replace "//em" example (section list-ref <> 1))
'((div (p "I like " "this" " and " "that" ".") (p "You may like " "emphasized" " text.")))

Unfortunately, this approach won’t quite work if the selected elements include more than one nested element.

> (sxpath-remove "//em" '(p (em "I'm " (strong "very") " confused.")))
'((p "I'm " (strong "very") " confused."))
> (sxpath-replace "//em" '(p (em "I'm " (strong "very") " confused.")) (section list-ref <> 1))
'((p "I'm "))

In the end, we’re better off just using sxpath-remove for cases like this.

Other possibilities

What else might we want to do with an XML document? We might want to insert elements, move elements, and perhaps even rearrange elements. However, we have covered enough for now and will leave those activities to another time.

Acknowledgements

This reading was newly written in spring 2019.

The loudhum libraries to support these exercises on on the Racket SXML libraries, and on Neil Van Dyke’s html-parsing and html-writing libraries.

Copyright © Charlie Curtsinger, Sarah Dahlby Albright, Janet Davis, Fahmida Hamid, Titus Klinge, Samuel A. Rebelsky, and Jerod Weinman. Selected materials are copyright by John David Stone or Henry Walker and are used with permission.

Unless specified otherwise elsewhere on this page, this work is licensed under a Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/3.0/ or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.

This website was built using Jekyll, Twitter Bootstrap, and the Bootswatch Cosmo Theme.