In this project, your group will consider a particular data set (a portion of the Scarlet and Black archives, design and implement a nontrivial algorithm or algorithms for manipulating the data set and extracting information, run the algorithm, and present the results of the algorithm. The primary intents are that you demonstrate your mastery of the concepts and skills introduced in the class in a novel way.
You will work with one decade (the 1960’s) of scanned and OCR’d
issues of Grinnell’s Scarlet and Black. You can find the text
files in /home/rebelsky/Desktop/SandB/
or on the WWW at
https://www.cs.grinnell.edu/~rebelsky/SandB/. You can find the
corresponding PDF files and other images of those issues at
http://usiagrc.arcasearch.com/Research.aspx.
Reasonable size: Your project should be of a scope that it can be completed by your group with approximately eight hours of work per team member over a two-week period (four hours per week).
Nontrivial, novel algorithm: Your project should demonstrate the group’s ability to design and implement a nontrivial algorithm that differs reasonably from any of the algorithms we have already defined. You might combine ideas we have discussed in new ways or you may develop a completely new algorithm.
Alternative outcomes: As you’ve likely discovered, we tend to underestimate how much time it takes to complete a computer project. Hence, your project should have three targets: (a) an intended outcome - what you expect to be able to achieve; (b) a satisficing outcome- something not as complex as the intended outcome, but complex enough that it meets the general expectations for the project; and (c) a reach outcome - something that you can try to achieve if the intended goal is more straightforward than you expected.
Some of you will immediately identify an approach you would like to take. However, others may need a few suggestions to get started. Here a few possible starting points.
Topic modeling. You wrote a simple topic modeling algorithm for a homework assignment. You might extend that algorithm, tune it for the particular data set, and apply it. You will likely need to find a few more extensions.
Statistical analysis. You might develop tools that allow you to understand some broad features of individual elements, such as sentence length or word choice, and then apply them to unearth potential relationships between different elements. Are there structures or words that occur more frequently in some parts or others and what does that say about the relationships between those parts?
Categorization. Both topic modeling and statistical analysis are tools that may help us identify new ways to categorize individual works. However, [audience] may also have some categories that they have identified, or you may have identified your own natural categories. You might develop an algorithm that takes a work and determines which of those categories it most naturally fits in.
Mapping. You might develop algorithms that select place names from the materials and visualize the use of those place names within the corpus. There are many ways you could treat those place names. You could, for example, show the sequence of the use of names, potentially finding meaning in how name use changes over the work. You could identify the parts each place name appears in and use that to reveal similarities and differences. You could look at nearby words and display those as a way of understanding the context in which the place names are used.
Visualization. You might explore new ways to visualize the materials, exploring connections or changes based on subject, length, word choice, metadata, and such. If the short examples above do not suffice, you might spend some time exploring the Web for other possibilities. For example, HyperCities provides a wide variety of projects that you might find of interest.
We will spend some time in class discussing possible approaches to the project and will form groups, in part, based on the approach that you wish to take.
Your project proposal describes the core aspects of your project:
Your proposal should employ correct grammar and spelling. Approximately one or two pages should suffice.
We will do our best to respond to your proposal in a timely manner. However, given other constraints, we may not be able to do so.
After finishing your proposal, you should set to work on implementing your design, making sure to meet all of the specifications outlined above. Note that your project is likely to involves both computational analysis and “human” analysis. That is, once your algorithms “discover” an issue about the underlying text, you should look at some of the related texts directly to see what insight the results of the algorithms have begun to reveal.
A final project report should accompany your project code. In it, you will once again, provide a non-technical overview of the project along with a more detailed description of the algorithms. You will also discuss your results. In particular, you should suggest ways in which what you have discovered (in using the algorithm and your “human” analysis) might provide a starting point for future study and suggest possible directions. Have you identified an opportunity for close reading? Do your results suggest that it would be valuable to look at additional materials? Are there ways in which it would be worthwhile to turn to the literature to gather a broader context? Do we perhaps need to extend algorithms further or try other approaches?
While the third section of your report will necessarily be new, you may use your project proposal as a starting point for the first two sections. Our experience suggests that both sections will need some revision. You will often find that once you’ve started to use preliminary algorithms on particular inputs, you will change your goals somewhat and even the ways you think about the design of your algorithm. In addition, when you implement an algorithm, you often discover additional issues and subtleties that may lead you to update your description of the algorithm.
Your final project should also be accompanied by a set of straightforward instructions for running the code.
You will give a quick (three-minute) presentation to your classmates and a select group of visitors during the designated presentation days. In your presentation, you should describe your goals, your algorithm, and your findings. We will also reserve approximately two minutes for questions and answers.