Skip to main content

Assignment 2: Document markup

Assigned
Wednesday, 23 January 2019
Due
Tuesday, 29 January 2019 by 10:30pm
Summary
We explore issues involved in preparing archival materials for processing on the computer by creating digital representations of selected issues of Grinnell’s Scarlet and Black.
Collaboration
You must work with your assigned partner(s) on this assignment. You may discuss this assignment with anyone, provided you credit such discussions when you submit the assignment.
Submitting
Email your answer to csc151-01-grader@grinnell.edu. The subject of your email should be [CSC151-01] Assignment 2 (Your Names) and should contain your answers to all parts of the assignment. Scheme code should be in the body of the message, not in an attachment.

Introduction

In order to process materials in digital form, one must first convert them to digital form. One of the goals of the computer scientists who support the digital humanities is to provide tools to help automate the process. For example, once we have scanned a document to create an image of the document, we would hope that optical character recognition (OCR) could automatically extract the text. Unfortunately, while OCR is a well-researched area, OCR of many digital documents is still far from perfect and requires some manual work to make the text acceptable. In addition, most annotation must be done manually.

How much effort is required for these steps? We will explore that question as we create sample archival materials from the archive of Grinnell’s Scarlet and Black, available at http://usiagrc.arcasearch.com/Research.aspx.

Part one: Marking existing text

The excavation associated with the new HSSC uncovered a part of Grinnell history, a large rock, which some call the “Peace Rock”, that was buried in 1914. The Scarlet and Black of 22 April 1914 has a small piece on an assault on that rock, then termed the “Scrap Rock”, presumably in reference to the annual “Class Scrap”.

We have transcribed the page that contains that note and prepared a PDF of that same page.

Create an XML document that appropriately represents the page. You are responsible for inserting appropriate markup, choosing tags and attributes you consider appropriate. As you mark up the text, you should reflect on the different ways that others may use the page; for example, while some are likely interested in the main articles, others might be exploring the use of advertisements or even issues of layout.

Part two: Exploring OCR’d text

On Friday, 8 May 1970, Grinnell College decided that the semester would end on Wednesday, 13 May 1970, the middle of week 14, and that all subsequent activity, including finals and commencement, would be canceled. A variety of associated information is presented in the Scarlet & Black of 15 May 1970.

We have prepared a copy of the OCR’d text from the front page of that issue and a PDF of that same page. You may note that while the OCR’d text contains most of the words on the page, it does not necessarily present them in the best form.

Identify five different types of errors in the OCR’d text, provide representative examples of each, and suggest techniques that might help the designer of an ORC algorithm address each problem.

Part three: Transcribing text

In recent years, the appropriate response to the American flag and the national anthem have been debated, spurred, in part, by the decision of some to kneel during the anthem to express their dissatisfaction with the state of our country, particularly the treatment of members of minority groups by those in authority. Such conversations are not new. On 24 April 1969, a group of students flew the US flag upside-down to protest the Vietnam war.

We have prepared a PDF of the front page of the 2 May 1969 issue of the Scarlet and Black that reports on the controversy.

Create an XML document that appropriately represents the page. You are responsible for both transcribing the text and inserting markup. You may choose what tags and attributes you consider appropriate.