Skip to main content

Text and text files

Due
Wednesday, 6 February 2019
Summary
We consider some basic mechanisms for working with files that contain unformatted text.
Prerequisites
An abbreviated introduction to Racket. Characters and strings. List basics.

As you likely know from your experience with computers, when we want to keep information on the computer for reuse, we store it in what we usually refer to as a file. There are many different kinds of files. Some files store images (in many different possible formats). Some files store formatted text. Some store data. Some store what we often refer to as “plain text”, text without additional markup or formatting information.

As you might expect, digitial humanists often work with both formatted and unformatted text files. However, they also work with a wide variety of other kinds of files, including images (there’s interesting work on the use of broadsides) and geographic data. Because text-processing algorithms are often more straightforward than image-processing algorithms, we will primarily focus on text-processing algorithms. However, we will touch upon some other kinds of humanistic data and algorithms later in the semester.

It is often much easier to work with formatted text. However, as you likely discovered in assignment two, creating formatted text is often a labor-intensive process. Formatting also introduces some additional complexities. In a few weeks, we will explore how we work with XML files, which, as you may recall, store hierarchically marked-up text. For now, we are going to start with plain text files.

Computer scientists and digital humanists work with text files in a variety of ways. They might, for example, search for particular words or attempt to rewrite the text in a file into a new form or a new language. They might look for some statistical properties of the text to try to gain some insight. We will consider similar issues.

Reading from plain text files

The loudhum package provides four basic operations for working with text files: file->chars, which reads the contents of a text file and presents the contents as a list of characters; file->words, which reads the contents of a text file and presents the contents as a list of strings, each of which represents one “word” in the file (using a simple metric for “word”); file->lines, which reads the contents of a text file and presents the contents as a list of strings, each of which represents one line of the input file; and file->string, which reads the contents of a text file and presents the contents as a single string.

Suppose we had the previous paragraph in a file (using the not-quite-plain-text format we tend to use for writing these readings). Here’s what we might get reading it each way.

> (take (file->chars "/home/rebelsky/share/text/loudhum-textfile-procs.txt") 20)
'(#\T #\h #\e #\space #\` #\l #\o #\u #\d #\h #\u #\m #\` #\space #\p #\a #\c #\k #\a #\g)
> (take (file->words "/home/rebelsky/share/text/loudhum-textfile-procs.txt") 10)
'("The" "loudhum" "package" "provides" "four" "basic" "operations" "for" "working" "with")
> (take (file->lines "/home/rebelsky/share/text/loudhum-textfile-procs.txt") 3)
'("The `loudhum` package provides four basic operations for working with" 
  "text files: `file->chars`, which reads the contents of a text file and" 
  "presents the contents as a list of characters; `file->words`, which")
> (substring (file->string "/home/rebelsky/share/text/loudhum-textfile-procs.txt") 20 120)
"e provides four basic operations for working with\ntext files: `file->chars`, which reads the content"

Writing to plain text files

At times, we’ll want to save the text we create to a file. The loudhum package currently provides only one procedure for writing text to a file: (string->file str fname) writes the given string to the named file.

> (file->string "/home/username/example.txt")
Error! . . ../../Applications/Racket v7.1/collects/racket/private/kw.rkt:1279:57: open-input-file: cannot open input file
Error!   path: /home/username/example.txt
Error!   system error: No such file or directory; errno=2
> (string->file "This is an example.\n" "/home/username/example.txt")
> (file->string "/home/username/example.txt")
"This is an example.\n"

Warning! The string->file procedure will overwrite an existing file, completely eliminating any previous content.

> (take (file->lines "/home/username/exam1.txt") 3)
'("Exam 1" "Random J. Student" "Time required: 10 hours")
> (string->file "I am the 1337 h4x0r. Phear me!" "/home/username/exam1.txt")
> (file->string "/home/username/exam1.txt")
"I am the 1337 h4x0r. Phear me!\n"

Naming files

Racket is surprisingly clueless about finding files. We might say “It’s right there.” But there is not clear to the computer. Hence, we will generally identify files by their full path name. As the example above suggested, when we are working on a Unix/Linux system, we most typically use a path name of the form "/home/username/path/to/file". For example, if exam1.rkt is sitting on your desktop, you would use "/home/username/Desktkop/exam1.rkt".

If you are working on a Mac, you start the path with /Users rather than /home, as in "/Users/username/Desktop/exam1.rkt".

Windows path names are much more complicated. Read the Racket documentation if you want to deal with files on Windows.

Self checks

Check 1: Ways of reading files

Suppose scene.txt contains the following lines.

Prof: Student, how are you today?
Student: Please don't address me in the generic.
Prof: Stu, how are you today?
Student: I'm pretty well.  Thanks for asking.

What output do you expect to get if you call file->chars, file->words, file->lines, and file->string on that file?

Check 2: Experiments with reading files.

*Note: You can only do these experiments if you’ve loaded the latest version of the loudhum package. If you don’t know how to do so, check with your instructor, or wait until class time.

a. Using gedit, create the file described in check 1.

b. Check your answer to check 1 experimentally.