Skip to main content

Class project: Exploring a repository

Proposal Due
Monday, 22 April 2019 by 10:30pm
Implementation Due
Tuesday, 30 April 2019 by 10:30pm
Presentations
Monday and Wednesday, 6 and 8 May 2019, in class
Summary
At this point in your career, you’ve learned a number of techniques for working with textual data. This project is an opportunity for you to explore some techniques in greater depth.
Purposes
To explore some aspect of the digital humanities in depth. To emphasize the more creative components of this course. To encourage more purposeful reflection on algorithms. To investigate the different approaches one might take with the same data set.
Collaboration
We encourage you to work in groups of size four. You may, however, work alone or work in a group of size two or three. You may discuss this assignment with anyone, provided you credit such discussions when you submit the assignment.
Submitting
Email your submissions to csc151-01-grader@grinnell.edu. The subject of your email should include [CSC 151] Project along with a list of all authors of the project. Your email should also include all appropriate attachments (e.g. your project proposal or all the required files for your final project submission).
Rubric
A draft grading rubric is available to give you a sense of what we will be looking for as we assess your projects.
Warning
So that this assignment is a learning experience for everyone, we will almost certainly spend class time publicly critiquing your work.

Assignment

Background: Specification

In this project, your group will consider a particular data set (a portion of the Scarlet and Black archives, design and implement a nontrivial algorithm or algorithms for manipulating the data set and extracting information, run the algorithm, and present the results of the algorithm. The primary intents are that you demonstrate your mastery of the concepts and skills introduced in the class in a novel way.

Data set

You will work with one decade (the 1960’s) of scanned and OCR’d issues of Grinnell’s Scarlet and Black. You can find the text files in /home/rebelsky/Desktop/SandB/ or on the WWW at https://www.cs.grinnell.edu/~rebelsky/SandB/. You can find the corresponding PDF files and other images of those issues at http://usiagrc.arcasearch.com/Research.aspx.

General expectations

Reasonable size: Your project should be of a scope that it can be completed by your group with approximately eight hours of work per team member over a two-week period (four hours per week).

Nontrivial, novel algorithm: Your project should demonstrate the group’s ability to design and implement a nontrivial algorithm that differs reasonably from any of the algorithms we have already defined. You might combine ideas we have discussed in new ways or you may develop a completely new algorithm.

Alternative outcomes: As you’ve likely discovered, we tend to underestimate how much time it takes to complete a computer project. Hence, your project should have three targets: (a) an intended outcome - what you expect to be able to achieve; (b) a satisficing outcome- something not as complex as the intended outcome, but complex enough that it meets the general expectations for the project; and (c) a reach outcome - something that you can try to achieve if the intended goal is more straightforward than you expected.

Sample categories

Some of you will immediately identify an approach you would like to take. However, others may need a few suggestions to get started. Here a few possible starting points.

Topic modeling. You wrote a simple topic modeling algorithm for a homework assignment. You might extend that algorithm, tune it for the particular data set, and apply it. You will likely need to find a few more extensions.

Statistical analysis. You might develop tools that allow you to understand some broad features of individual elements, such as sentence length or word choice, and then apply them to unearth potential relationships between different elements. Are there structures or words that occur more frequently in some parts or others and what does that say about the relationships between those parts?

Categorization. Both topic modeling and statistical analysis are tools that may help us identify new ways to categorize individual works. However, [audience] may also have some categories that they have identified, or you may have identified your own natural categories. You might develop an algorithm that takes a work and determines which of those categories it most naturally fits in.

Mapping. You might develop algorithms that select place names from the materials and visualize the use of those place names within the corpus. There are many ways you could treat those place names. You could, for example, show the sequence of the use of names, potentially finding meaning in how name use changes over the work. You could identify the parts each place name appears in and use that to reveal similarities and differences. You could look at nearby words and display those as a way of understanding the context in which the place names are used.

Visualization. You might explore new ways to visualize the materials, exploring connections or changes based on subject, length, word choice, metadata, and such. If the short examples above do not suffice, you might spend some time exploring the Web for other possibilities. For example, HyperCities provides a wide variety of projects that you might find of interest.

We will spend some time in class discussing possible approaches to the project and will form groups, in part, based on the approach that you wish to take.

Part One: Proposal

Your project proposal describes the core aspects of your project:

  • The general theme of the project. “We are attempting to understand the changing use of social-justice terms in the S&B in the 1960’s through word frequency analysis and careful visualization.” “We are attempting to understand the typical topics of different years of S&B articles using topic modeling.” “Given the horrible state of the S&B OCR’d text, we are working on improving that text.”
  • A high-level overview of the algorithm or algorithms you intend to implement.
  • A short description of the preferred outcome, satisficing outcome, and reach outcome

Your proposal should employ correct grammar and spelling. Approximately one or two pages should suffice.

We will do our best to respond to your proposal in a timely manner. However, given other constraints, we may not be able to do so.

Part Two: Project

After finishing your proposal, you should set to work on implementing your design, making sure to meet all of the specifications outlined above. Note that your project is likely to involves both computational analysis and “human” analysis. That is, once your algorithms “discover” an issue about the underlying text, you should look at some of the related texts directly to see what insight the results of the algorithms have begun to reveal.

Part Three: Report

A final project report should accompany your project code. In it, you will once again, provide a non-technical overview of the project along with a more detailed description of the algorithms. You will also discuss your results. In particular, you should suggest ways in which what you have discovered (in using the algorithm and your “human” analysis) might provide a starting point for future study and suggest possible directions. Have you identified an opportunity for close reading? Do your results suggest that it would be valuable to look at additional materials? Are there ways in which it would be worthwhile to turn to the literature to gather a broader context? Do we perhaps need to extend algorithms further or try other approaches?

While the third section of your report will necessarily be new, you may use your project proposal as a starting point for the first two sections. Our experience suggests that both sections will need some revision. You will often find that once you’ve started to use preliminary algorithms on particular inputs, you will change your goals somewhat and even the ways you think about the design of your algorithm. In addition, when you implement an algorithm, you often discover additional issues and subtleties that may lead you to update your description of the algorithm.

Your final project should also be accompanied by a set of straightforward instructions for running the code.

Part Four: Presentation

You will give a quick (three-minute) presentation to your classmates and a select group of visitors during the designated presentation days. In your presentation, you should describe your goals, your algorithm, and your findings. We will also reserve approximately two minutes for questions and answers.

Questions

Can we reuse code from the assignments and labs?
You may certainly reuse code from the assignments and labs, provided you cite that code. However, you should make sure that your project goes beyond what you did for the assignment or lab. Hence, you will likely want to extend or otherwise rewrite that code. (Even if you extend or rewrite code, you should still cite its origin and influence.)