'(tag (@ (name1 val1) (name2 val2) ...) element1 element2 ...)
-
A list-based representation of an XML/HTML element. The attribute
section is optional. The elements are either strings or themselves
XML/HTML elements.
"//tag"
- an XPath pattern to search for elements with the given tag.
"//tag0/tag1"
- an XPath pattern to search for elements with tag tag1
that appear directly under elements with tag tag0
.
"//tag0//tag1"
- an XPath pattern to search for elements with tag tag1
that appear anywhere under elements with tag tag0
.
"//tag[1]"
- the first instance of the tag within an enclosing
element. (We have similar "//tag[2]"
and so on and so forth.)
"//tag[@class='name']"
- all tags with the given class.
(file->xml fname)
- Read an HTML or XML document and convert it
to the list-based representaiton.
(xml->file xml fname)
- Save the list-based representation of an HTML
or XML document in a file.
(string->xml str)
- Convert a string to the list-based representation.
(xml->string xml)
- Convert the list-based representation to a string.
(sxpath-match pattern xml)
- Search the xml document for matching
patterns.
a. Start DrRacket.
b. Before running the examples, you need to install a variety of
libraries, including mcfly
, overeasy
, html-parsing
, html-writing
,
and sxml
. You can open a terminal window and type
/home/rebelsky/bin/csc151/update
to install all of those packages.
Alternately, you can select File > Install Package…, enter
each name in turn, click Install for each, and then click the
Close button when it becomes available.
c. Make sure that you have the latest version of the loudhum
package
by opening a terminal window and typing /home/rebelsky/bin/csc151/update
.
(You may have done that in a prior step. If so, there’s no need to do
so again.)
Alternately, select File > Install Package…, enter
“https://github.com/grinnell-cs/loudhum.git
” and follow the instructions.
d. Add (require loudhum)
to the definitions pane.
e. If you did not set up a Web site at the start of the semester, set
one up now by opening a terminal window and typing
/home/rebelsky/bin/csc151/setup-web
.
f. Verify that you can load one of the sample pages by directing your browser to https://www.cs.grinnell.edu/~username/excerpt.html, substituting your own user name.
g. If we did not review the self checks at the start of class, review the self checks with your partner. You should also check your answers within DrRacket. For example, if we asked you for the output of an expression, try the expression in DrRacket and see the result. Similarly, if we asked you to write code, check your code.
a. Write a string that describes a portion of an HTML page that contains at least two paragraphs, each of which has an emphasis tag and one of which has a strong tag.
b. Predict the result of converting that string to the list-based representation of HTML.
c. Write a list-based HTML representation of an unordered list (HTML-style) of three aphorisms, at least one of which contains a quotation.
d. Convert that back to a string.
e. Save the converted string to a file named exercise01.html
in your
public_html
directory.
f. See what happens when you try to load that page in your Web browser. It should be at something like https://www.cs.grinnell.edu/~username/example01.html.
You may recall that the file excerpt.html
contains an excerpt
from Through the Looking Glass.
a. Using regexp-replace*
, write a set of instructions that make
a copy of excerpt.html
with all of the spaces deleted. Save that
copy as you should call exercise02a.html
.
b. Using regexp-replace*
, write a set of instructions that make
a copy of excerpt.html
in which each emphasized entry becomes
strongly emphasized.
a. Using regular expressions, determine how many times a quotation
appears in excerpt.html
.
b. Using regular expressions, determine how many times emphasized
text appears in excerpt.html
.
c. Using regular expressions, attempt to determine how many times
emphasized text appears within a quotation in excerpt.html
.
(You may not succeed, but you should
try.)
a. Write a procedure to read excerpt.html
into the internal,
list-based representation.
b. Using sxpath-match
, determine how many paragraphs are in the
document.
c. Using sxpath-match
, determine what elements are emphasized
in the document.
d. Using sxpath-match
determine when emphasized text appears within
quotations in the document.
In the reading, we said that it should be possible to extract all of the emphasized text from a document and then put it into a new document by putting the list of elements in a paragraph, the paragraph in a body element, and the body in an html element.
Try doing that with all the quotations in the excerpt.html
document.
That is, repeat the steps from the reading, using q
rather than
em
, and making any other modifications you consider appropriate.
When we studied XML and HTML, we saw that XML is much more expressive than HTML, but that HTML provides a bit more standardization on the components of a document. It is therefore often useful to store information in the more expressive XML form and then use a program to automatically convert the XML to HTML for display. XSLT can do this, but it’s sometimes a bit complicated, so we’ll do so using Racket.
Consider the following XML document, taken from the reading on XML, intended to represent a list of books.
<?xml version="1.0" encoding="UTF-8"?>
<collection>
<name>Project Gutenberg</name>
<bookinfo book-id="000001">
<author><first>Thomas</first> <last>Jefferson</last></author>
<title>The Declaration of Independence of the United States of America</title>
<url>https://www.gutenberg.org/ebooks/1</url>
</bookinfo>
<bookinfo book-id="000002">
<author>
Anonymous
<alternative>
The United States of America
</alternative>
</author>
<title>The United States Bill of Rights</title>
<url>https://www.gutenberg.org/ebooks/2</url>
</bookinfo>
<bookinfo book-id="000011">
<author author-id="412369">
<first>Lewis</first> <last>Carroll</last>
<alternative>
<first>Charles</first> <middle>Lutwidge</middle> <last>Dogson</last>
</alternative>
</author>
<title>Alice's Adventures in Wonderland</title>
<url>https://www.gutenberg.org/ebooks/11</url>
</bookinfo>
<bookinfo book-id="000012">
<author author-id="412369"/>
<title>Through the Looking-Glass</title>
<url>https://www.gutenberg.org/ebooks/12</url>
</bookinfo>
</collection>
We might want to turn that information into a Web page with quick links to each of the books mentioned. In particular, we’d like something like the following.
<html>
<head>
<title>Some online books at Project Gutenberg</title>
</head>
<body>
<ul>
<li><a href="https://www.gutenberg.org/ebooks/1">The Declaration of Independence of the United States of America</a></li>
<li><a href="https://www.gutenberg.org/ebooks/2">The United States Bill of Rights</a></li>
<li><a href="https://www.gutenberg.org/ebooks/11">Alice in Wonderland</a></li>
<li><a href="https://www.gutenberg.org/ebooks/12">Through the Looking-Glass</a></li>
</ul>
</body>
</html>
In an abbreviated list representation, we might express that as follows.
'(html
(head (title "Some online books at Project Gutenberg"))
(body
(ul
(li (a (@ (href "https://www.gutenberg.org/ebooks/1")) "The Declaration of Independence of the United States of America"))
(li (a (@ (href "https://www.gutenberg.org/ebooks/2")) "The United States Bill of Rights"))
(li (a (@ (href "https://www.gutenberg.org/ebooks/11")) "Alice in Wonderland"))
(li (a (@ (href "https://www.gutenberg.org/ebooks/12")) "Through the Looking-Glass")))))
a. Write a procedure, book->list-item
that takes a book in list-based XML form
and converts it to corresponding HTML in the form listed above.
> (define bill-of-rights
'(bookinfo (@ (book-id "000002"))
(author "Anonymous"
(alternative "The United States of America"))
(title "The United States Bill of Rights")
(url "https://www.gutenberg.org/ebooks/2")))
> (book->list-item bill-of-rights)
'(li (a (@ (href "https://www.gutenberg.org/ebooks/2")) "The United States Bill of Rights"))
b. The file /home/rebelsky/Desktop/pg01.xml
contains a collection
of four books in the form described above. Using file->xml
and
sxpath-match
, extract a list of the books from the document.
c. Using a technique similar to that of the prior problem, turn your list of books into a Web page (in list representation). That is
book->list-item
to each element of the list.ul
with cons
.ul
element in a body
.head
element with a title.head
and body
elements into an html
element.d. Using xml->file
, save your file in your public_html
directory
as pg01.html
.
e. Preview the file in your Web browser. https://www.cs.grinnell.edu/~username/pg01.html.
We do not anticipate that anyone will have extra time. If you do, start the next lab.
This lab was newly written in spring 2019.
The loudhum
libraries to support these exercises on on the Racket
SXML libraries, and on Neil
Van Dyke’s html-parsing
and html-writing
libraries.