In order to process materials in digital form, one must first convert them to digital form. One of the goals of the computer scientists who support the digital humanities is to provide tools to help automate the process. For example, once we have scanned a document to create an image of the document, we would hope that optical character recognition (OCR) could automatically extract the text. Unfortunately, while OCR is a well-researched area, OCR of many digital documents is still far from perfect and requires some manual work to make the text acceptable. In addition, most annotation must be done manually.
How much effort is required for these steps? We will explore that question as we create sample archival materials from the archive of Grinnell’s Scarlet and Black, available at http://usiagrc.arcasearch.com/Research.aspx.
The excavation associated with the new HSSC uncovered a part of Grinnell history, a large rock, which some call the “Peace Rock”, that was buried in 1914. The Scarlet and Black of 22 April 1914 has a small piece on an assault on that rock, then termed the “Scrap Rock”, presumably in reference to the annual “Class Scrap”.
We have transcribed the page that contains that note and prepared a PDF of that same page.
Create an XML document that appropriately represents the page. You are responsible for inserting appropriate markup, choosing tags and attributes you consider appropriate. As you mark up the text, you should reflect on the different ways that others may use the page; for example, while some are likely interested in the main articles, others might be exploring the use of advertisements or even issues of layout.
On Friday, 8 May 1970, Grinnell College decided that the semester would end on Wednesday, 13 May 1970, the middle of week 14, and that all subsequent activity, including finals and commencement, would be canceled. A variety of associated information is presented in the Scarlet & Black of 15 May 1970.
We have prepared a copy of the OCR’d text from the front page of that issue and a PDF of that same page. You may note that while the OCR’d text contains most of the words on the page, it does not necessarily present them in the best form.
Identify five different types of errors in the OCR’d text, provide representative examples of each, and suggest techniques that might help the designer of an ORC algorithm address each problem.
In recent years, the appropriate response to the American flag and the national anthem have been debated, spurred, in part, by the decision of some to kneel during the anthem to express their dissatisfaction with the state of our country, particularly the treatment of members of minority groups by those in authority. Such conversations are not new. On 24 April 1969, a group of students flew the US flag upside-down to protest the Vietnam war.
We have prepared a PDF of the front page of the 2 May 1969 issue of the Scarlet and Black that reports on the controversy.
Create an XML document that appropriately represents the page. You are responsible for both transcribing the text and inserting markup. You may choose what tags and attributes you consider appropriate.