Skip to main content

Parsing integers

Part of an ongoing series of essays tentatively entitled Don’t embarrass me, Don’t embarrass yourself: Notes on thinking in C and Unix.

Frequently, when we write programs that take input (from the command line, from a file, or interactively), we need to parse that input to convert it into another form. For example, in a program that does computation, we might need to parse the textual input into an integer. What’s the best way to do that? Unfortunately, the answer may be it depends. What does it depend on? Let’s see.

For context, let’s think about a simple program that reads two integers and prints out their sum, such as one might write in the first few weeks of an introductory CS course. If I were called upon to write such a program, I’d probably come up with something like the following:

/**
 * sum.c
 *   Compute the sum of two values entered on the command line.
 *
 * <insert appropriate open source license>
 */

// +---------+-------------------------------------------------------
// | Headers |
// +---------+

#include <stdio.h>
#include <stdlib.h>

// +---------+-------------------------------------------------------
// | Helpers |
// +---------+

/**
 * Report on usage.
 */
void
usage ()
{
  fprintf (stderr, "Usage: sum INT1 INT2\n");
} // usage

// +------+----------------------------------------------------------
// | Main |
// +------+

int
main (int argc, char *argv[])
{
  // Sanity check
  if (argc != 3)
    {
      fprintf (stderr,
               "Incorrect number of parameters: Expected 2 params, got %d\n",
               argc - 1);
      usage ();
      return 1;
    }

  // Convert to integers
  int val1 = ??? (argv[1]);
  int val2 = ??? (argv[2]);

  // Do the real work
  printf ("%d\n", val1+val2);

  // And we're done
  return 0;
} // main

But how should I convert the strings (argv[1] and argv[2]) to integers? One approach is to use atoi [1]. We will, of course, need to include <stdlib.h> in order to make sure that the type signature is right.

  int val1 = atoi (argv[1]);
  int val2 = atoi (argv[2]);

Let’s see how well that works.

$ ./sum 2 3
5
$ ./sum 100 2
102
$ ./sum -5 11
6
$ ./sum
Incorrect number of parameters: Expected 2 params, got 0
Usage: sum INT1 INT2
$ ./sum 3 4 5
Incorrect number of parameters: Expected 2 params, got 3
Usage: sum INT1 INT2

It looks like it works relatively well. But wait! What if the user enters something other than an integer, or something that starts with digits but doesn’t end with digits?

$ ./sum one two
0
$ ./sum 312 five
312
$ ./sum 123twenty 2000four
2123
$ ./sum 1234567890000 2
1912276050

No, I don’t like those results at all. My program should issue an error message when the inputs are wrong. Let’s check the man page.

atoi() does not detect errors

Oh. I guess I shouldn’t use atoi if I want to write sensibly robust code. So, what other options are there?

Well, at least one of my colleagues, when presented with the problem of writing a program that reads in input, would write an interactive program, rather than a command-line program [3]. Now, there are a lot of ways to read values, but we’ll use scanf for our first version.

/**
 * isum.c
 *   Prompt the user for two integers and print their sum.
 *
 * <insert appropriate open source license>
 */

// +---------+-------------------------------------------------------
// | Headers |
// +---------+

#include <stdio.h>

// +------+----------------------------------------------------------
// | Main |
// +------+

int
main ()
{
  int val1;
  int val2;

  printf ("Please enter an integer: ");
  fflush (stdout);
  scanf ("%d", &val1);
  printf ("Please enter another integer: ");
  fflush (stdout);
  scanf ("%d", &val2);

  // Do the real work
  printf (""%d + %d = %d\n", val1, val2, val1+val2);

  // And we're done
  return 0;
} // main

Let’s see how well it works

$ ./isum
Please enter an integer: 3
Please enter another integer: 11
3 + 11 = 14
$ ./isum
Please enter an integer: -12
Please enter another integer: 25
-12 + 25 = 13
$ ./isum
Please enter an integer: -3
Please enter another integer: -4
-3 + -4 = -7

It looks pretty good, so far. But, once again, what happens when we enter values other than integers?

$ ./isum
Please enter an integer: five
Please enter another integer: 0 + 0 = 0

It’s not possible to tell from this screenshot, but the program actually printed the sum and exited immediately, without giving me the chance to enter a second value. Can you tell why [4]? Let’s try a few more examples.

$ ./isum
Please enter an integer: 3 5
Please enter another integer: 3 + 5 = 8

Here, we see similar behavior, but it almost makes sense.

$ ./isum
Please enter an integer: 200one
Please enter another integer: 200 + 0 = 200

It strikes me that the program should be friendlier. So, let’s see what other information we can get from scanf.

These functions return the number of input items successfully matched and assigned, which can be fewer than provided for, or even zero in the event of an early matching failure.

Okay, I guess we can check that.

  printf ("Please enter an integer: ");
  fflush (stdout);
  if (scanf ("%d", &val1) < 1)
    {
      fprintf (stderr, "That's not an integer!\n");
      return 1;
    }
  printf ("Please enter another integer: ");
  fflush (stdout);
  if (scanf ("%d", &val2) < 1)
    {
      fprintf (stderr, "That's not an integer!\n");
      return 1;
    }

Let’s see.

$ ./isum
Please enter an integer: one
That's not an integer!
$ ./isum
Please enter an integer: 1
Please enter another integer: two
That's not an integer!

Well, we’re getting some errors. Let’s try some more complex inputs.

$ ./isum
Please enter an integer: 3 5
Please enter another integer: 3 + 5 = 8
$ ./isum
Please enter an integer: 31hello
Please enter another integer: That's not an integer!
$ ./isum
Please enter an integer: 123456789012
Please enter another integer: 213
-1097262572 + 213 = -1097262359
$ ./isum
Please enter an integer: 23.34
Please enter another integer: That's not an integer!

I guess we still have some more work to do. Let’s see … we need to chop off the extraneous input (as in 31hello) and we somehow need to convince scanf to deal with things that look like integers, but won’t fit into an int. I suppose we could try to solve the first issue first, and see where that gets us. But it doesn’t look like there’s anything in the man page about dealing with integers with too many digits. So maybe we should try something different.

I’m going to return to the first program, since it’s more in my style of programming. Let’s see … the man page for atoi suggests strtol detects errors. Let’s give that a try. The type signature of strtol is

long int strtol(const char *nptr, char **endptr, int base);

We’d wanted to read integers, not long integers, but I guess that’s a start. How do we know if we’ve encountered something other than a sequence of digits? Well, if there’s something after the digits, endptr should point to that thing. So, I guess we need to pass in an extra pointer and then see if it points to null. If not, there was extra text. If so, then we’re okay. Since I am going to have to do the same check more than once, it’s probably better to write a helper procedure. Here’s my first attempt.

/**
 * Convert a string to an integer, which it stores in *ip.  Returns 1 if
 * it succeeds and 0 otherwise.
 */
int
str2int (char *str, int *ip)
{
  char *extra;
  long result = strtol (str, &extra, 10);
  if (*extra != '\0')
    return 0;
  *ip = (int) result;
  return 1;
} // str2int

Here’s how I’m going to use it.

  // Convert to integers
  if (! str2int (argv[1], &val1))
    {
      fprintf (stderr,
               "For parameter one, expected an integer, received '%s'.\n",
               argv[1]);
      return 1;
    }
  if (! str2int (argv[2], &val2))
    {
      fprintf (stderr,
               "For parameter two, expected an integer, received '%s'.\n",
               argv[2]);
      return 1;
    }

Let’s see if it works.

$ ./sum 2 6
8
$ ./sum 23hello 15
For parameter one, expected an integer, received '23hello'.
15
$ ./sum 23 two
For parameter two, expected an integer, received 'two'.
23

Yay! That’s much better. But what about too-large integers?

$ ./sum 1 10000000000
1410065409

Nope. That didn’t work. But we didn’t really expect it to. We were converting a long to an int without checking for safety. Let’s do that now. Note that we’ll need to include <limits.h> to get INT_MAX and INT_MIN.

/**
 * Convert a string to an integer, which it stores in *ip.  Returns 1 if
 * it succeeds and 0 otherwise.
 */
int
str2int (char *str, int *ip)
{
  char *extra;
  long result = strtol (str, &extra, 10);
  if (*extra != '\0')
    return 0;
  if (result < INT_MIN)
    return 0;
  if (result > INT_MAX)
    return 0;
  *ip = (int) result;
  return 1;
} // str2int

Now, let’s try it out.

$ ./sum 1 10000000000
For parameter two, expected an integer, received '10000000000'.

’Eh. That’s not the best error message. But at least it’s an error message. I’m sure that you can figure something out to improve it [5]. But let’s try something different, let’s make our program work with long values, rather than with integers. I think there are only a few lines to change, including eliminating those new lines we just added.

Here’s the new version of str2int, now renamed str2long.

/**
 * Convert a string to a long, which it stores in *lp.  Returns 1 if
 * it succeeds and 0 otherwise.
 */
int
str2long (char *str, long *lp)
{
  char *extra;
  *lp = strtol (str, &extra, 10);
  if (*extra != '\0')
    return 0;
  return 1;
} // str2long

We also have to change the declarations of val1 and val2. And we have to change the final print statement. That’s not so bad. Does it make a difference? Let’s see ….

$ ./sum 10000000000 1
10000000001

Okay, we can now deal with something slightly larger than the largest int. But what happens if the input is longer than the largest long?

$ ./sum 10000000000000000000 0
9223372036854775807
$ ./sum 10000000000000000000 1
-9223372036854775808
$ ./sum 1000000000000000000023 0
9223372036854775807
$ ./sum -1000000000000000000023 0
-9223372036854775808

Hmmm … it appears that the designers of strtol decided that it should return LONG_MAX for any input that represents an integer larger than LONG_MAX (and, LONG_MIN for any input that represents an integer smaller than LONG_MIN). The one disadvantage of that approach is that it means we can’t tell the difference between reading one of those extreme values, and reading something outside of the bounds. What next? Well, I suppose we could write our own [6].

At the core, converting a string to an integer is straightforward. You read each digit in turn. You multiply your prior result by ten, and add the next digit. If we just add a bit of error-checking code, we should be fine. Here’s the revised procedure, in the context of the slightly updated full program.

/**
 * sum.c
 *   Compute the sum of two values entered on the command line.
 *
 * <insert appropriate open source license>
 */

// +---------+-------------------------------------------------------
// | Headers |
// +---------+

#include <stdio.h>      // For printf and fprintf
#include <stdlib.h>     // For scanf
#include <limits.h>     // For LONG_MAX and LONG_MIN
#include <ctype.h>      // For isdigit and such.

// +---------+-------------------------------------------------------
// | Helpers |
// +---------+

/**
 * Convert a digit to an integer.  Does not do any sanity checking!
 */
int
convertDigit (char digit)
{
  return digit - '0';
} // convertDigit

/**
 * Convert a string to a long, which it stores in *lp.  
 *
 * Returns 
 *   0, upon success
 *   1, if presented with the empty string
 *   2, if the string includes non-digit characters other than initial 
 *      whitespace
 *   3, if the string represents a value outside the bounds of longs
 *   4, for other errors
 */
int
str2long (char *str, long *lp)
{
  long result = 0;
  long sign = 1;

  // Skip over whitespace
  while ((*str != '\0') && (isspace (*str)))
    str++;

  // Check for the sign
  if (*str == '-') 
    {
      str++;
      sign = -1;
    }
  else if (*str == '+')
    {
      str++;
    }

  // Sanity check
  if (!*str)
    return 1;

  // Read all of the digits
  while (isdigit (*str)) 
    {
      long increment = sign * convertDigit (*str);
      // Upper-bound check
      if ((sign == 1) && (result > (LONG_MAX - increment) / 10))
        return 3;
      // Lower-bound check
      if ((sign == -1) && (result < (LONG_MIN - increment) / 10))
        return 3;
      // Update the result
      result = result*10 + increment;
      // And move on to the next character
      str++;
    } // while

  // Sanity check
  if (*str != '\0')
    return 2;

  // I think that's it.
  *lp = result;
  return 0;
} // str2long

/**
 * Report on usage.
 */
void
usage ()
{
  fprintf (stderr, "Usage: sum INT1 INT2\n");
} // usage

// +------+----------------------------------------------------------
// | Main |
// +------+

int
main (int argc, char *argv[])
{
  long val1;
  long val2;
  int err;

  // Sanity check
  if (argc != 3)
    {
      fprintf (stderr,
               "Incorrect number of parameters: Expected 2 params, got %d\n",
               argc - 1);
      usage ();
      return 1;
    }

  // Convert to integers
  if ((err = str2long (argv[1], &val1)) != 0)
    {
      fprintf (stderr, 
               "Error %d: "
               "For parameter one, "
               "expected an integer in the range %ld to %ld, "
               "received '%s'.\n",
               err, LONG_MIN, LONG_MAX, argv[1]);
      return 1;
    }
  if ((err = str2long (argv[2], &val2)) != 0)
    {
      fprintf (stderr, 
               "Error %d: "
               "For parameter two, "
               "expected an integer in the range %ld to %ld, "
               "received '%s'.\n",
               err, LONG_MIN, LONG_MAX, argv[2]);
      return 1;
    }

  // Do the real work
  printf ("%ld\n", val1+val2);

  // And we're done
  return 0;
} // main

How well does it work? Let’s see.

$ sum 2 111
113
$ sum -15 23
8
$ sum 41 -100
-59
$ sum -10 -10
-20
$ sum 9223372036854775807 1
-9223372036854775808
$ sum 9223372036854775807 0
9223372036854775807
$ sum 9223372036854775808 0
Error 3: For parameter one, expected an integer in the range -9223372036854775808 to 9223372036854775807, received '9223372036854775808'.
$ sum 0 9223372036854775808 
Error 3: For parameter two, expected an integer in the range -9223372036854775808 to 9223372036854775807, received '9223372036854775808'.
$ sum 0 -9223372036854775808 
-9223372036854775808
$ sum -9223372036854775808 0
-9223372036854775808
$ sum -9223372036854775809 0
Error 3: For parameter one, expected an integer in the range -9223372036854775808 to 9223372036854775807, received '-9223372036854775809'.
$ sum 32.4 5
Error 2: For parameter one, expected an integer in the range -9223372036854775808 to 9223372036854775807, received '32.4'.
$ sum five six
Error 2: For parameter one, expected an integer in the range -9223372036854775808 to 9223372036854775807, received 'five'.
$ sum 5 six
Error 2: For parameter two, expected an integer in the range -9223372036854775808 to 9223372036854775807, received 'six'.

That seems pretty good. Of course, we may not quite be done. What if we want to support bases other than 10? What if we want to support other locales, with different sets of digits [7]? I suppose those are topics for future sections.

What morals should you take from this section? Well, one is that C libraries aren’t necessarily as well designed as one might hope. Another is that you have a responsibility to write robust code, even for simple programs. A third is that you should know the limits and capabilities of your utility functions. Unfortunately, many of the utility functions written in the early days of C are not particularly robust. A fourth, which you should have encountered already, is that it’s important to think about edge cases. In this instance, we wanted to check what happened when the value was too large (or too small). A fifth may be that when you write your own utility functions, it’s best to pass in pointers for the return values, and simply return success or failure.


Exercises

  1. Rewrite the interactive program to use our new procedure.

  2. Although we can now successfully read integers, we have not yet made either sum or isum sufficiently robust. In particular, we have not accounted for what happens if we the sum is larger than LONG_MAX or smaller than LONG_MIN. Revise both programs to deal with that issue.

  3. Rewrite str2long to take a base as a parameter.


[1] I believe that the name comes from ASCII to integer [2].

[2] ASCII is the American Standard Code for Information Interchange. It is pronounced as key.

[3] More precisely, my colleague would have students write an interactive program.

[4] scanf does not read over unused values. Hence, when it failed to find an integer, it left the read head at the start of five. So when we tried scanf again, we also failed to read an integer.

[5] We could, for example, return a code for the type of error.

[6] I had considered giving this as a homework assignment, but I think it’s useful to look at one possibility.

[7] Internationalization, or i18n, as many refer to it, is an important issue for the modern programmer. isdigit is, in fact, locale specific. That means our convertDigit probably should be, too.


Version 1.0 of 2017-01-08.