As you likely know from your experience with computers, when we want to keep information on the computer for reuse, we store it in what we usually refer to as a file. There are many different kinds of files. Some files store images (in many different possible formats). Some files store formatted text. Some store data. Some store what we often refer to as “plain text”, text without additional markup or formatting information.
As you might expect, digitial humanists often work with both formatted and unformatted text files. However, they also work with a wide variety of other kinds of files, including images (there’s interesting work on the use of broadsides) and geographic data. Because text-processing algorithms are often more straightforward than image-processing algorithms, we will primarily focus on text-processing algorithms. However, we will touch upon some other kinds of humanistic data and algorithms later in the semester.
It is often much easier to work with formatted text. However, as you likely discovered in assignment two, creating formatted text is often a labor-intensive process. Formatting also introduces some additional complexities. In a few weeks, we will explore how we work with XML files, which, as you may recall, store hierarchically marked-up text. For now, we are going to start with plain text files.
Computer scientists and digital humanists work with text files in a variety of ways. They might, for example, search for particular words or attempt to rewrite the text in a file into a new form or a new language. They might look for some statistical properties of the text to try to gain some insight. We will consider similar issues.
The loudhum
package provides four basic operations for working with
text files: file->chars
, which reads the contents of a text file and
presents the contents as a list of characters; file->words
, which
reads the contents of a text file and presents the contents as a list
of strings, each of which represents one “word” in the file (using a
simple metric for “word”); file->lines
, which reads the contents
of a text file and presents the contents as a list of strings, each
of which represents one line of the input file; and file->string
,
which reads the contents of a text file and presents the contents as
a single string.
Suppose we had the previous paragraph in a file (using the not-quite-plain-text format we tend to use for writing these readings). Here’s what we might get reading it each way.
> (take (file->chars "/home/rebelsky/share/text/loudhum-textfile-procs.txt") 20)
'(#\T #\h #\e #\space #\` #\l #\o #\u #\d #\h #\u #\m #\` #\space #\p #\a #\c #\k #\a #\g)
> (take (file->words "/home/rebelsky/share/text/loudhum-textfile-procs.txt") 10)
'("The" "loudhum" "package" "provides" "four" "basic" "operations" "for" "working" "with")
> (take (file->lines "/home/rebelsky/share/text/loudhum-textfile-procs.txt") 3)
'("The `loudhum` package provides four basic operations for working with"
"text files: `file->chars`, which reads the contents of a text file and"
"presents the contents as a list of characters; `file->words`, which")
> (substring (file->string "/home/rebelsky/share/text/loudhum-textfile-procs.txt") 20 120)
"e provides four basic operations for working with\ntext files: `file->chars`, which reads the content"
At times, we’ll want to save the text we create to a file. The loudhum
package currently provides only one procedure for writing text to a
file: (string->file str fname)
writes the given string to the named
file.
> (file->string "/home/username/example.txt")
Error! . . ../../Applications/Racket v7.1/collects/racket/private/kw.rkt:1279:57: open-input-file: cannot open input file
Error! path: /home/username/example.txt
Error! system error: No such file or directory; errno=2
> (string->file "This is an example.\n" "/home/username/example.txt")
> (file->string "/home/username/example.txt")
"This is an example.\n"
Warning! The string->file
procedure will overwrite an existing file,
completely eliminating any previous content.
> (take (file->lines "/home/username/exam1.txt") 3)
'("Exam 1" "Random J. Student" "Time required: 10 hours")
> (string->file "I am the 1337 h4x0r. Phear me!" "/home/username/exam1.txt")
> (file->string "/home/username/exam1.txt")
"I am the 1337 h4x0r. Phear me!\n"
Racket is surprisingly clueless about finding files. We might say
“It’s right there.” But there is not clear to the computer.
Hence, we will generally identify files by their full path name.
As the example above suggested, when we are working on a Unix/Linux
system, we most typically use a path name of the form
"/home/username/path/to/file"
. For example, if exam1.rkt
is
sitting on your desktop, you would use
"/home/username/Desktkop/exam1.rkt"
.
If you are working on a Mac, you start the path with /Users
rather
than /home
, as in "/Users/username/Desktop/exam1.rkt"
.
Windows path names are much more complicated. Read the Racket documentation if you want to deal with files on Windows.
Suppose scene.txt
contains the following lines.
Prof: Student, how are you today?
Student: Please don't address me in the generic.
Prof: Stu, how are you today?
Student: I'm pretty well. Thanks for asking.
What output do you expect to get if you call file->chars
, file->words
,
file->lines
, and file->string
on that file?
*Note: You can only do these experiments if you’ve loaded the latest
version of the loudhum
package. If you don’t know how to do so,
check with your instructor, or wait until class time.
a. Using gedit, create the file described in check 1.
b. Check your answer to check 1 experimentally.