The Declaration of Independence of the United States of America

#lang racket (require csc151) (require csc151www) (require racket/undefined) ;; CSC 151-01 (Fall 2021 ;; Lab: Processing XML ;; Authors: YOUR NAMES HERE ;; Date: THE DATE HERE ;; Acknowledgements: ;; ACKNOWLEDGEMENTS HERE ; +-------------------------+---------------------------------------- ; | Exercise 0: Preparation | ; +-------------------------+ #| We repeat the preparation instructions here as a reminder of what you need to get done to make the lab work. |# #| a. Have the traditional start-of-lab discussion. That is, introduce yourselves; discuss working strategies, strengths, and weakness; and review the reading. |# #| b. You'll need a variety of packages for this lab. Install the following packages in DrRacket using File -> Install Package.... * `html-parsing` * `html-writing` * `sxml` * `https://github.com/grinnell-cs/csc151www.git#main` |# #| c. Make a copy of [excerpt.html](../files/sample-web/excerpt.html), which you may recall from a recent lab. |# #| A |# ; +--------------------------------------------+--------------------- ; | Exercise 1: From HTML strings to XML lists | ; +--------------------------------------------+ #| a. Write a string that describes a portion of an HTML page that contains at least two paragraphs, each of which has an emphasis tag and one of which has a strong tag. |# (define two-paragraphs "

...

") #| b. Predict the result of converting that string to an SXML structure (the list-based representation of XML). (string->xml two-paragraphs) |# #| d. As you may have noticed, only the first of the two paragraphs was converted. Why? Because `string->xml` expects *one* xml structure. Hence, we'll need to group the two paragraphs into something, say a `div`. Verify that the following instruction adds the `div` tags appropriately. You need not enter anything. |# (define two-paragraph-div (string-append "

" two-paragraphs "

")) #| e. Predict the structure of the converted string. |# #| f. Check your answer experimentally. Then enter any notes. > (string->xml two-paragraph-div) |# #| g. Consider `g1` and `g2` below. Will they be the same or different? If they will be different, how? |# (define g1 (string->xml "

One

Two

")) (define g2 (string->xml "

One

Two

")) #| h. Check your answer experimentally. Enter any notes as to what you observe. |# #| c. Check your answer experimentally. Then add any notes. > (xml->string aphorisms) |# #| d. Using `(string->file str fname)`, save the converted string to a file named `pxml-02a.html` on your Desktop. (You may also save it elsewhere.) > |# #| e. Check the contents of the file by opening the file with GEdit. |# #| f. Load the file in your Web browser. It will be at something like `file:///home/username/Desktop/pxml-02a.html`. |# #| g. Using `(xml->file aphorisms)`, save the XML to a file named `pxml-02b.html` on your Desktop. (You may also save it elsewhere.) |# #| h. Check the contents of the file by opening the file with a text editor. |# #| i. Load the file in your Web browser. It will be at something like `file:///home/username/Desktop/pxml-02b.html`. |# #| A |# ; +---------------------------------------+-------------------------- ; | Exercise 3: Rewriting HTML as strings | ; +---------------------------------------+ #| You may recall that the file `excerpt.html` contains an excerpt from _Through the Looking Glass_. |# #| a. Review the HTML document by opening it in a text editor. |# #| b. Review the HTML document by opening it in your Web browser. |# #| c. Write instructions to read in `excerpt.html`, delete all spaces, and save it to `pxml-03a.html`. You will likely want to use `file->string`, `string-replace`, and `string->file`. |# #| d. What do you expect the new file to look like when you load it in your Web browser? |# #| e. Open the file in your Web browser to check your answer. Then enter any notes you have. You may also want to open the file in a text editor to check what is happening. |# #| g. Write a set of instructions to read in `excerpt.html`, replace all `em` tags with `strong` tags, and save the result back to `pxml-03b.html`. Once again, you should stick with the string representation, and use `file->string`, `string-replace`, and `string->file`. |# #| h. Check your results as above. |# ; +-----------------------------------------------------------------------+ ; | Exercise 4: Extracting information from HTML with regular expressions | ; +-----------------------------------------------------------------------+ #| a. Write instructions to load `excerpt.html` into a string, which we will call `excerpt`. |# (define excerpt undefined) #| b. Using regular expressions, determine how many times a quotation appears in `excerpt`. You will likely use `rex-find-matches` to extract all of them and then `length` to count them. |# (define count-of-quotations-rex undefined) #| c. Using regular expressions, determine how many times emphasized text appears in `excerpt`. |# (define count-of-emphasis-rex undefined) #| d. Using regular expressions, determine how many times emphasized text appears within a quotation. Spend no more than three minutes on this problem; it's okay if you don't get it quite right. |# (define count-of-emphasis-in-quotes-rex undefined) #| e. Write a procedure, `(count-paragraphs-rex str)`, that uses regular expressions to count how many paragraph tags appear in `str`. |# (define count-paragraphs (lambda (str) undefined)) #| f. Using regular expressions, determine what portions of `excerpt` are emphasized. Your result should be a list of strings. Do not spend more than three minutes on this part of the exercise. |# (define emphasized-portions-rex undefined) #| B |# ; +---------------------------------+-------------------------------- ; | Exercise 5: Switching to SXPath | ; +---------------------------------+ #| a. Store the contents of `excerpt.html` in an SXML structure. |# (define excerpt-sxpath undefined) #| b. Using `sxpath-match`, determine how many quotations are in `excerpt-sxpath`. |# (define count-of-quotations-sxpath undefined) #| c. Using `sxpath-match`, determine how many times emphasized text appears in `excerpt`. |# (define count-of-emphasis-sxpath undefined) #| d. Using `sxpath-match`, determine how many times emphasized text appears within a quotation. Spend no more than three minutes on this problem; it's okay if you don't get it quite right. |# (define count-of-emphasis-in-quotes-sxpath undefined) #| e. Write a procedure, `(count-paragraphs-sxpath str)`, that uses `sxpath-match` to determine how many paragraph tags appear in `str`. |# (define count-paragraphs-sxpath (lambda (str) undefined)) #| f. Using `sxpath`, determine what portions of `excerpt.html` are emphasized. Your result should be a list of strings. Do not spend more than three minutes on this part of the exercise. |# (define emphasized-portions-sxpath undefined) ; +----------------------------------+------------------------------- ; | Exercise 6: Comparing approaches | ; +----------------------------------+ #| In a paragraph or two, summarize the strengths and weaknesses of each approach to analyzing HTML documents. (The approaches are regular expressions/strings and `sxpath`.) |# #| A |# ; +-----------------------------------+------------------------------ ; | Exercise 7: Summarizing documents | ; +-----------------------------------+ #| In the reading, we said that it should be possible to extract all of the emphasized text from a document and then put it into a new document by putting the list of elements in a paragraph, the paragraph in a body element, and the body in an html element. That is, repeat the steps from the reading, using `q` rather than `em`, and making any other modifications you consider appropriate. |# #| a. Try doing that with all the quotations in the `excerpt.html` document. You should save the result in `excerpt-quotations.html`. |# ; +------------+----------------------------------------------------- ; | Submitting | ; +------------+ #| You should only submit your processing-xml.rkt file. |# #| AB |# ; +---------------------------+-------------------------------------- ; | For those with Extra Time | ; +---------------------------+ #| If you find that you have extra time, you should read over the following exercises and choose one or more to attempt. |# ; +-------------------------+---------------------------------------- ; | Extra 1: Inserting text | ; +-------------------------+ #| a. Write a procedure that inserts the text `"PAY ATTENTION:"` at the start of every quotation. Reminder: You can use `append` to join lists. In this case, you'll want to join a list of the tag (and, possibly, the attributes), a list of the string, and the rest of the contents of the element. Note that this process is complicated by the possible inclusion of attributes in the quotation. Fortunately, there's a `(has-attributes? element)` procedure that checks whether or not there's a set of attributes. |# (define pay-attention (lambda (sxml) undefined)) ; +---------------------------+-------------------------------------- ; | Extra 2: From XML to HTML | ; +---------------------------+ #| When we studied XML and HTML, we saw that XML is much more expressive than HTML, but that HTML provides a bit more standardization on the components of a document. It is therefore often useful to store information in the more expressive XML form and then use a program to automatically convert the XML to HTML for display. XSLT can do this, but it's sometimes a bit complicated, so we'll do so using Racket. Consider the following XML document, taken from the reading on XML, intended to represent a list of books. Project Gutenberg Thomas Jefferson The Declaration of Independence of the United States of America https://www.gutenberg.org/ebooks/1 Anonymous The United States of America The United States Bill of Rights https://www.gutenberg.org/ebooks/2 Lewis Carroll Charles Lutwidge Dogson Alice's Adventures in Wonderland https://www.gutenberg.org/ebooks/11 Through the Looking-Glass https://www.gutenberg.org/ebooks/12 We might want to turn that information into a Web page with quick links to each of the books mentioned. In particular, we'd like something like the following. Some online books at Project Gutenberg

In an abbreviated list representation, we might express that as follows. '(html (head (title "Some online books at Project Gutenberg")) (body (ul (li (a (@ (href "https://www.gutenberg.org/ebooks/1")) "The Declaration of Independence of the United States of America")) (li (a (@ (href "https://www.gutenberg.org/ebooks/2")) "The United States Bill of Rights")) (li (a (@ (href "https://www.gutenberg.org/ebooks/11")) "Alice in Wonderland")) (li (a (@ (href "https://www.gutenberg.org/ebooks/12")) "Through the Looking-Glass"))))) |# #| a. Write a procedure, `book->list-item` that takes a book in list-based XML form and converts it to corresponding HTML in the form listed above. > (define bill-of-rights '(bookinfo (@ (book-id "000002")) (author "Anonymous" (alternative "The United States of America")) (title "The United States Bill of Rights") (url "https://www.gutenberg.org/ebooks/2"))) > (book->list-item bill-of-rights) '(li (a (@ (href "https://www.gutenberg.org/ebooks/2")) "The United States Bill of Rights")) |# (define book->list-item (lambda (book) undefined)) #| b. Save the XML from above into `books.xml`. |# #| c. Using `file->xml` and `sxpath-match`, extract a list of the books from `books.xml`. |# #| d. Using a technique similar to that of earlier problems, turn your list of books into a Web page (in SXML list representation). That is, * Apply `book->list-item` to each element of the list. * Prepend `ul` using `cons`. * Wrap the new `ul` element in a `body`. * Create a `head` element with a title. * Wrap the `head` and `body` elements into an `html` element. |# #| e. Save your result to `books.html` and load it into your Web browser. |#