Files

Summary: Files permit you to save values between invocations of programs and to provide information to programs without typing the information interactively. In this reading, we explore key ideas in the use of files within Scheme.

Introduction

When a Scheme program is designed to work with large volumes of data, it is often more convenient for the user to prepare its input in one or more separate files, using an appropriate tool (such as a text editor or a statistical package), than to type the data in as the program is running. The Scheme program itself finds the files containing the data and reads them, without user intervention.

Similarly, when a Scheme program generates a lot of output, it is often more convenient to have it store the output in one or more files, instead of displaying it in the window that the interactive interface is using. Other programs can recover the results from such files if further processing is needed.

Note that files let us store values between invocation of Scheme programs (and other programs). This permanence is another benefit of using files.

In this reading, we consider the techniques used in Scheme to read data from files and to write data to files.

Input Ports

Scheme provides two basic operations for reading from files, read and read-char. The read procedure reads the next Scheme value from the given file and returns it as a Scheme value. The read-char procedure reads the next character from the file and returns that character.

For example, if the file began with 23512 11 13, the first two values returned by the read procedure would be the integers 23512 and 11, while the first two values returned by the read-char procedure would be #\2 and #\3. (On subsequent calls, read-char would return #\5 #\1 #\2 #\space #\1 #\1 #\space #\1 #\3.)

These procedures take as their argument an “input port” through which the data will be read in. In theory, any kind of a device that supplies data on demand can be on the other side of the input port, and some implementations of Scheme provide several ways of creating them. However, we'll consider only the “default input port”, through which data typed at the keyboard are transmitted to a Scheme program interactively, and “file input ports”, through which Scheme programs read data stored in files.

When DrScheme starts up, it automatically creates the default input port and connects the keyboard to it. This is the input port on which the read procedure normally operates. When the user exits from Scheme, this port is closed as part of the cleanup process.

To read data from a file, however, the programmer must explicitly open an input port and connect that file to it. There is a built-in Scheme procedure to do this: open-input-file takes one argument, a string, and returns an input port to which the file named by the string is connected. For instance, the call

(open-input-file "/home/rebelsky/glimmer/samples/sample.txt")

returns an input port to which the named file is connected.

Constructing the input port does you no good unless you give it a name, so open-input-file is almost always either named explicitly (e.g., with define or some variant of let) or used as the parameter to procedure call that expects a port.

(define source
  (open-input-file "_____"))

(let ((source (open-input-file "_____")))
  ...)

(define helper
  (lambda (source)
     ...))
...
  (helper (open-input-file "_____"))
...

When you're done with a port, you should make sure to close it again with close-input-port. To finish the examples above,

; Prepare to read from a file.
(define source
  (open-input-file "_____"))
; Read some parts of the file.
...
; We're done, so clean up.
(close-input-port source)

(let ((source (open-input-file "_____"))) ; Prepare to read from a file
  ; Read some or all of the file.
  ...
  ; We're done, so clean up.
  (close-input-port source))

(define helper
  (lambda (source)
     ...
     (close-input-port source)))

Reading One Character at a Time

As the example above suggests, an input port is often used as an argument to read-char, which reads in (and returns) one character from the file on the other side of the input port. It can also be used as parameter to peek-char, which looks through the input port to see what the next character in the file is, and returns that character, but does not actually read it in from the file. The difference is that you can peek at the next character as often as you like, and it remains accessible through the input port, but once you read in a character there is no way to “un-read” it -- the port advances inexorably to the next character in the file.

The file /home/rebelsky/glimmer/samples/hi.txt text file that contains one line, consisting of the cheerful greeting Hi there!. Let us see what happens when we read from this file using read-char.

> (define source
    (open-input-file "/home/rebelsky/glimmer/samples/hi.txt"))
> (read-char source)
#\H
> (peek-char source)
#\i
> (peek-char source)
#\i
> (read-char source)
#\i
> (read-char source)
#\space
> (close-input-port source)

Notice that the peek-char procedure peeks through the port to see what the next available character of the file is, and returns the character it sees. The read-char procedure pulls that character in through the port and returns it, leaving the port open with the following character accessible through it.

Finding the End of a File

Scheme automatically provides a sentinel for every file input port it opens. The sentinel is a special value known as the end-of-file object. It is returned by any of the input procedures when there is nothing left to be read from the file. MediaScript's default Scheme interpreter prints the end-of-file object as #<eof>. To continue the preceding example,

> (define source
    (open-input-file "/home/rebelsky/Web/Courses/CS151/2007S/Examples/hi.txt"))
> (read-char source)
#\H
> (read-char source)
#\i
> (read-char source)
#\space
> (read-char source)
#\T
> (read-char source)
#\h
> (read-char source)
#\e
> (read-char source)
#\r
> (read-char source)
#\e
> (read-char source)
#\!
> (read-char source)
#\newline
> (peek-char source)
#<eof>
> (read-char source)
#<eof>
> (read-char source)
#<eof>
> (close-input-port source)

The end-of-file object is not a character, and there is no standard Scheme name for the end-of-file object, but there is a primitive predicate eof-object? that detects it:

> (eof-object? (read-char source))
#t

Reading One Line From a File

As an example of the use of read-char, here's the definition of a procedure called read-line, which reads in characters through a given input port until it reaches the end of the file or encounters a #\newline character, then returns a string containing all of the characters that it has read in:

;;; Procedure:
;;;   read-line
;;; Parameters:
;;;   source, an input port
;;; Purpose:
;;;   Read one line of input from a source and return that line
;;;   as a string.
;;; Produces:
;;;   line, a string
;;; Preconditions:
;;;   The source is open for reading. [Unverified]
;;; Postconditions:
;;;   Has read one line of characters from the source (thereby affecting
;;;     future calls to read-char and peek-char).
;;;   line represents the characters in the file from the "current" point 
;;;     at the time read-line was called until the first end-of-line 
;;;     or end-of-file character.
;;;   line does not contain a newline.
(define read-line
  (lambda (source)
    ; Read all the characters remaining on the line and
    ; then convert them to a string.
    (list->string (read-line-of-chars source))))

;;; Procedure:
;;;   read-line-of-chars
;;; Parameters:
;;;   source, an input port
;;; Purpose:
;;;   Read one line of input from a source and return that line
;;;   as a list of characters.
;;; Produces:
;;;   chars, a list of characters.
;;; Preconditions:
;;;   The source is open for reading. [Unverified]
;;; Postconditions:
;;;   Has read characters from the source (thereby affecting
;;;     future calls to read-char and peek-char).
;;;   chars represents the characters in the file from the
;;;     "current" point at the time read-line was called
;;;     until the first end-of-line or end-of-file character.
;;;   chars does not contain a newline.
(define read-line-of-chars
  (lambda (source)
    ; If we're at the end of the line or the end of the file,
    ; then there are no more characters, so return the empty list.
    (cond
      ; If we're at the end of the file, there are no more characters,
      ; so return the empty list.
      ((eof-object? (peek-char source)) 
       null)
      ; If we're at the end of the line, we're done with the line
      ; skip over the end-of-line character and return the empty list.
      ((char=? (peek-char source) #\newline) 
       (read-char source) 
       null)
      ; Otherwise, read the current character, read the remaining
      ; characters, and join them together.
      (else 
       (cons (read-char source) (read-line-of-chars source))))))

There are many things we can now do with these procedures. For example, here's a simple procedure that takes a file name as an argument and prints the first line of a file.

;;; Procedure:
;;;   first-line
;;; Parameters:
;;;   file-name, a string that names a file.
;;; Purpose:
;;;   Reads and displays the first line of the file.
;;; Produces:
;;;   Absolutely nothing.
;;; Preconditions:
;;;   There is a file by the given name.
;;;   It is possible to write to the standard output port.
;;; Postconditions:
;;;   Does not affect the file.
;;;   The first line of the named file has been written to
;;;     the standard output.
(define first-line
  (lambda (file-name)
    (let ((source (open-input-file file-name)))
      (display "The first line of '")
      (display file-name)
      (display "' is")
      (newline)
      (display (read-line source))
      (newline)
      (close-input-port source))))

Note that read-line provides an instance of file recursion. That is, we are using recursion (having a procedure calling itself) but using attributes of the file to determine when we've reached the base case. Finding the end of the line is one typical base case. Another is the end of the file.

The `read` Procedure

It is also possible to read from a file using the one-argument form of the read procedure, which pulls a complete Scheme datum (instead of just one character) through a given input port. It also leaves the port open, with the next character or Scheme datum accessible through it.

Consider, again, the file described above with the form

23512 11 13

If we were to work with this file using read-char, we would see a sequence of values like the following

> (define source
    (open-input-file "/home/rebelsky/glimmer/samples/sample.txt"))
> (read-char source)
#\2
> (read-char source)
#\3
> (read-char source)
#\5
> (read-char source)
#\1
> (read-char source)
#\2
> (read-char source)
#\space
> (read-char source)
#\1
> (read-char source)
#\1
> (read-char source)
#\space
> (read-char source)
#\1
> (read-char source)
#\3
> (read-char source)
#\newline
> (read-char source)
#<eof>
> (close-input-port source)

If, however, we were to use read, we would see the following sequence.

> (define source
    (open-input-file "/home/rebelsky/glimmer/samples/sample.txt"))
> (read source)
23512
> (read source)
11
> (read source)
13
> (read source)
#<eof>
> (close-input-port source)

Whether you use read or read-char depends on your particular application.

Summing Files: Another Form of File Recursion

Here's another example of how to use Scheme's facilities for input from a file. The sum-of-file procedure takes one argument, a string that names a file full of numbers; the procedure opens that file, reads in the numbers it contains one by one, adds each one in turn to a running total, closes the file, and returns the total.

;;; Procedure: 
;;;   sum-of-file
;;; Parameters:
;;;   file-name, a string that names a file.
;;; Purpose:
;;;   Sums the values in the given file.
;;; Produces:
;;;   sum, a number.
;;; Preconditions:
;;;   file-name names a file. [Unverified]
;;;   That file contains only numbers. [Unverified]
;;; Postconditions:
;;;   Returns a number.
;;;   That number is the sum of all the numbers in the file.
;;;   Does not affect the file.
(define sum-of-file
  (lambda (file-name)
    (let kernel ((source (open-input-file file-name))) ; named let
      (let ((nextval (read source)))
        (cond
          ; Are we at the end of the file?
          ; Then stop and return 0 for "no numbers read".
          ; Here, we're taking advantage of 0 being the arithmetic identity.
          ((eof-object? nextval) (close-input-port source) 0)
          ; Have we just read a number?
          ; If so, add it to the sum of the remaining numbers.
          ((number? nextval) (+ nextval (kernel source)))
          ; Hmmm ... not a number.  Skip it.
          (else (kernel source)))))))

In the base case of the recursion, there are no numbers left in the file, and the call to the read procedure immediately returns the end-of-file object. The helper closes the file and returns 0.

If the value of (read source) is a number, it is added to the value of a recursive call to the helper, which is the sum of all the subsequent numbers in the file.

If the helper discovers a non-number in the file whose contents it is adding up, then we skip it. (We might also consider throwing an error, but then we'll also need to worry about cleaning up after ourselves, so skipping is the easiest strategy at this point.)

Writing Data

Scheme provides four basic output operations: write, display, newline, and write-char. We'll start with the first three, and then turn to write-char a bit later.

The write procedure can take one or two arguments. If given one argument, that argument is the value to be written. If given two arguments, the first argument is the value to be written and the second is the port to which to write the value. In each case, it prints out a representation of the value. This value is either printed to the screen (one argument) or the file that corresponds to the port (two arguments). The nature of the value that write returns is unspecified. That is, the printing is a side effect of the evaluation of the call to write, not its result.

> (write 23)
23> (write #\a)
#\a> (write "hello world")
"hello world"> (write (list 23 #\a "hello" null))
(23 #\a "hello" ())

Why are the values immediately followed by the prompt, rather than having the prompt on a subsequent line? Because Scheme wants to permit you to write more than one value on a line. Hence, you need to explicitly tell it when to move to another line. You do so with the newline procedure. This procedure takes either zero or one parameters. In the first case, it prints a carriage return to the screen. In the second, it prints a carriage return to the given file.

> (write "hello") (newline)
"hello"
> (write 23) (newline)
23
> (write "hello") (write "goodbye") (write 23) (newline)
"hello""goodbye"23

As the preceding suggests, the values written by write seem more designed for the computer than the human. What if we don't want the quotation marks, hash marks, and the ilk? Fortunately, Scheme provides a similar procedure, display, that displays its argument in a more human-readable form.

> (display 23) (newline)
23
> (display #\a) (newline)
a
> (display "hello") (newline)
hello
> (display (list 23 #\a "hello" null)) (newline)
(23 a hello ())
> (display "hello") (display #\a) (display "goodbye") (newline)
helloagoodbye

Creating New Files

To provide for the possibility of having Scheme create files and write data to those files, each of Scheme's output procedures can be provided with a parameter that specifies the output port through which the data will be written. As before, we'll consider only the default output port -- the interaction box, under DrScheme -- and file output ports, through which Scheme programs write data to files.

If you followed the discussion of input ports, you should encounter few surprises about output ports. The default output port is created when the Scheme interactive interface starts up and closed when it shuts down; in between, Scheme uses this port for most calls to write, display, and newline. To write data to a file instead, the programmer must explicitly invoke open-output-file, which returns a file output port; once this output port is given a name, it can be used as an extra argument to any of the output procedures, with the effect that the values will be written to the file rather than to the interaction window. When no more output is to be written to the file, the programmer must explicitly close the port by invoking close-output-port.

As an example, here's a procedure that takes two arguments -- the first a string that names the output file to be created, the second a positive integer -- and writes the exact divisors of the positive integer into the specified output file:

;;; Procedure:
;;;   integer-store-divisors
;;; Parameters:
;;;   dividend, a natural number
;;;   file-name, a string that names a file
;;; Purpose:
;;;   Compute all the divisors of dividend and store them
;;;   to the named file.
;;; Produces:
;;;   [None; Called for the side effect of creating a file]
;;; Preconditions:
;;;   It must be possible to open the desired output file.
;;;   dividend must be a non-negative, exact, integer-
;;; Postconditions: 
;;;   The file with name file-name now contains many integers.
;;;   All the values in that file evenly divide dividend.
(define integer-store-divisors
  (lambda (dividend file-name)
    (integer-store-divisors-kernel dividend (open-output-file file-name) 1)))

;;; Helper:
;;;   integer-store-divisors-kernel
;;; Parameters:
;;;   dividend, the number we're working with
;;;   target, an output port
;;;   trial-divisor, the smallest divisor we should try
;;; Purpose:
;;;   Stores all divisors of dividend that are at least as
;;;     large as trial-divisor to target.
;;; Produces:
;;;   Nothing.
;;; Preconditions:
;;;   It is possible to write to the target port.
;;;   Both trial-divisor and dividend are natural numbers.
;;; Postconditions:
;;;   All divisors of dividend that are at least as large as
;;;     trial-divisor have been added to target.
;;;   target is still open for writing
(define integer-store-divisors-kernel
  (lambda (dividend target trial-divisor)
    ; We only continue to work when the trial-divisor is not
    ; larger than the dividend.  Note that I'm using cond because
    ; cond permits multiple operations when the test succeeds.
    (cond ((<= trial-divisor dividend)
           ; Okay, does the current trial-divisor evenly divide
           ; dividend?
           (cond ((zero? (remainder dividend trial-divisor))
                  ; It does!  Write it to the file
                  (write trial-divisor target)
                  (newline target)))
           ; Continue with any other potential divisors
           (integer-store-divisors-kernel dividend target (+ 1 trial-divisor)))
          ; If the trial divisor is bigger than the dividend, then we're
          ; done, so close the port and stop.
          (else (close-output-port target)))))

What should happen if open-output-file is called using an existing file? It is actually up to the implementer. Some implementations refuse to overwrite a file and throw an error, making life difficult for those who expect to be able to reuse the file name, particularly during testing. Other implementations blithely go about their business, potentially overwriting important data. The GIMP's default Scheme (which we don't use), takes the second approach. The Scheme we use by default in MediaScript takes the first approach. Fortunately, both implementations supply a file-exists? predicate, which takes a string as a parameter and determines whether a corresponding file exists.

If you can't overwrite an existing file, the language should provide some support for getting rid of those files, so that programmers can reuse file names when they want to. The default Scheme implementation in MediaScript provides a delete-file procedure to do just that.

Neither file-exists? nor delete-file is a standard procedure. Hence, when you start using a new version of Scheme, and need to use files, one of the first thing you must do is check the documentation to see what additional file operations it supports.

Writing One Character At A Time

Besides write, display, and newline, Scheme provides a primitive procedure write-char that is used to create an output file one character at a time. It takes two arguments, the character to be written and the output port through which it is to be sent.

Miscellaneous Facilities

Scheme provides the type predicate input-port?, which can be applied to any object to determine whether it is an input port. It also provides the analogous output-port? predicate.

Short Reference

(close-input-port input-port): Close an open input port.
(close-output-port output-port): Close an open output port.
(display value): Print a representation of the value on the screen.
(display value output-port): Print a representation of the value to the specified port.
(eof-object? value): Determine if the given value is something returned by read to indicate the end of the file.
(file-exists? filename): Determine whether the specified file exists.
(input-port? value): Determine whether the given value is an input port.
(newline): Write a newline to the screen.
(newline): Write a newline to the screen.
(newline port): Write a newline to the specified port.
(open-input-file filename): Open the specified file for reading. Returns an input port.
(open-output-file filename): Open the specified file for writing. Returns an output port.
(output-port? value): Determine whether the given value is an output port.
(peek-char input-port): Determine the next character available on the specified port.
(read input-port): Read the next value available on the specified port.
(read-char input-port): Read the next character available on the specified port.
(write value): Print the verbose representation of the value on the screen.
(write value output-port): Print the verbose representation of the value to the specified port.