mcfly
, overeasy
, html-parsing
,
html-writing
, and sxml
. You should also update loudhum
.
In MathLAN, running /home/rebelsky/bin/csc151/update
should
do the job.
In our initial explorations of ways to process XML and the corresponding lab, we considered some approaches to writing programs that take a Web page as input and generate a new Web page based on the information on the page.
While such approaches can be useful, they are not universally applicable. For example, what happens if we want to replace just one small portion of a page, keeping everything else the same? Or what if we want to remove rather than extract portions of a page?
The sxml:modify
procedure can address many of these situations.
However, it is complicated enough that we will instead use a
variety of procedures from the loudhum
library.
The (sxpath-replace path xml transform)
procedure identifies all
the entities described by path
in the given xml list, applying
the given transformation to each.
Here’s a simple example of XML to get us started.
> (define example (string->xml "<div><p>I like <em>this</em> and <em>that</em>.</p><p>You may like <em>emphasized</em> text.</p></div>"))
> example
'(div (p "I like " (em "this") " and " (em "that") ".") (p "You may like " (em "emphasized") " text."))
> (sxpath-match "//em" example)
'((em "this") (em "that") (em "emphasized"))
> (sxpath-match "//p/em[1]" example)
'((em "this") (em "emphasized"))
Now, let’s build a procedure that changes an em
element to a
strong
element.
> (define em->strong
(lambda (elt)
(cons 'strong (cdr elt))))
> (em->strong '(em "this"))
'(strong "this")
> (map em->strong (sxpath-match "//em" example))
'((strong "this") (strong "that") (strong "emphasized"))
We can use this to replace all of the emphasized text “in place”.
> (sxpath-replace "//em" example em->strong)
'((div (p "I like " (strong "this") " and " (strong "that") ".") (p "You may
We can also use it with more complex paths, such as the following one that only updates the first piece of emphasized text.
> (sxpath-replace "//em[1]" example em->strong)
'((div (p "I like " (strong "this") " and " (em "that") ".") (p "You may like " (strong "emphasized") " text.")))
Of course, we can also choose somewhat less sensible replacements.
> (sxpath-replace "//em" example (lambda (x) "FOO"))
'((div (p "I like " "FOO" " and " "FOO" ".") (p "You may like " "FOO" " text.")))
> (xml->string (sxpath-replace "//em" example (lambda (x) "FOO")))
"<div><p>I like FOO and FOO.</p><p>You may like FOO text.</p></div>"
What about replacing the text within the document? There’s an XPath
that we had not mentioned yet, "//text()"
, that selects all the
text (and only the text) in a document.
> (sxpath-match "//text()" example)
'("I like " " and " "." "this" "that" "You may like " " text." "emphasized")
We can therefore replace pieces of text (e.g., with regexp-replace*
)
by using that selector.
> (sxpath-replace "//text()" example (section regexp-replace* #px"[aeiou]" <> "-"))
'((div (p "I l-k- " (em "th-s") " -nd " (em "th-t") ".") (p "Y-- m-y l-k- " (em "-mph-s-z-d") " t-xt.")))
What if we wanted to update only the text within certain tags? That may be a bit more complicated. First we’ll need to select the emphasized elements, then we’ll need to extract the text from them. Let’s see ….
> (sxpath-replace "//em"
example
(lambda (element)
(sxpath-replace "//text()"
element
(section regexp-replace*
#px"[aeiou]"
<>
"-"))))
'((div (p "I like " (em "th-s") " and " (em "th-t") ".") (p "You may like " (em "-mph-s-z-d") " text.")))
As you might expect, (sxpath-delete path xml)
deletes all elements
in an xml expression that are indicated by the given path.
> (sxpath-delete "//em" example)
'((div (p "I like " " and " ".") (p "You may like " " text.")))
> (sxpath-delete "//em[2]" example)
'((div (p "I like " (em "this") " and " ".") (p "You may like " (em "emphasized") " text.")))
> (sxpath-delete "//text()" example)
'((div (p (em) (em)) (p (em))))
Once in a while, you may find that you want to delete the enclosing
tag, but not the enclosed text. While it may sometimes be possible
to use sxpath-replace
to achieve this goal, it turns out to be
better to have a more general procedure, sxpath-remove
, which
removes only the tag at the top of any matching section.
> (sxpath-remove "//em" example)
'((div (p "I like " "this" " and " "that" ".") (p "You may like " "emphasized" " text.")))
> (sxpath-remove "//em[1]" example)
'((div (p "I like " "this" " and " (em "that") ".") (p "You may like " "emphasized" " text.")))
We’ll introduce one other pattern here … [contains(text(),'string']
can be used to select elements that contain a particular string. (No
regular expressions here, just a string.)
> (sxpath-remove "//em[contains(text(),'this')]" example)
'((div (p "I like " "this" " and " (em "that") ".") (p "You may like " (em "emphasized") " text.")))
> (sxpath-remove "//em[contains(text(),'that')]" example)
'((div (p "I like " (em "this") " and " "that" ".") (p "You may like " (em "emphasized") " text.")))
> (sxpath-remove "//em[contains(text(),'a')]" example)
'((div (p "I like " (em "this") " and " "that" ".") (p "You may like " "emphasized" " text.")))
We mentioned that in some cases, we could use sxpath-replace
to achieve the same goal. For example, here’s another way to
replace all of the em
tags in the example with their contents.
> (sxpath-replace "//em" example (section list-ref <> 1))
'((div (p "I like " "this" " and " "that" ".") (p "You may like " "emphasized" " text.")))
Unfortunately, this approach won’t quite work if the selected elements include more than one nested element.
> (sxpath-remove "//em" '(p (em "I'm " (strong "very") " confused.")))
'((p "I'm " (strong "very") " confused."))
> (sxpath-replace "//em" '(p (em "I'm " (strong "very") " confused.")) (section list-ref <> 1))
'((p "I'm "))
In the end, we’re better off just using sxpath-remove
for cases like
this.
What else might we want to do with an XML document? We might want to insert elements, move elements, and perhaps even rearrange elements. However, we have covered enough for now and will leave those activities to another time.
This reading was newly written in spring 2019.
The loudhum
libraries to support these exercises on on the Racket
SXML libraries, and on Neil
Van Dyke’s html-parsing
and html-writing
libraries.