The joy of code: Finding repeated words
In a recent musing, I wrote about the problem of identifying repeated words in my writing. I claimed that it took me about ten minutes to write a short script to do so. Let’s see how.
I started by reflecting on the steps I’d have to use.
I know from experience that my life will be easier if I have one word per line. That means changing any sequences of
non-word characters
into newlines.That technique is likely to create some blank lines. I also have blank lines in my original essay. I should delete those lines.
I don’t want to differentiate between
The
andthe
, so I should turn all uppercase letters into lowercase.It is likely easier to count if I put the words into alphabetical order. Doing so ensures that I can look for neighboring lines that are the same.
I should then do the central goal of this project: counting the number of times each word appears, and presenting a list of (word/count) pairs [1].
So that I can tell which words I use most, I should then put that list in order by count.
I also came up with a few other steps that are likely to be useful, but not strictly necessary [2].
So, let’s see how we do all of those steps, using this musing as an example.
I didn’t mention I need to start with the contents of the musing
, but
that’s really where I have to start. So, let’s see …
$ cat joc-counting-repeats.md The joy of code: Finding repeated words ======================================= In [a recent musing](grammarly-repeated-words), I wrote about the problem of identifying repeated words in my writing. I claimed that it took me about ten minutes to write a short script to do so. Let's see how. I started by thinking about the steps I'd have to use. 1. I know from experience that my life will be easier if I have one word ...
Next, I need to insert a newline for the non-word
characters. I tend
to use sed
, the stream editor, for non-interactive editing. I’ll
start by identifying word
characters. Those will be the lowercase
letters, the uppercase letters [3], and the apostrophe [4,5]. In
sed
, I identify a set of characters by surrounding it with square
brackets. I negate that set by putting a caret after the open bracket.
So I’ll use [^a-zA-Z']
. I want one or more copies, so I add a
+
at the end. I want to replace that pattern with a newline, which
is represented as \n
. Okay, here goes.
$ cat joc-counting-repeats.md \ | sed -re "s/[^a-zA-Z']+/\n/g" The joy of code Finding repeated words In a recent musing grammarly repeated words I wrote ...
As I predicted, I ended up with a bunch of blank lines. I didn’t think to get rid of the link, so the words in the link [6] appear in this list I should fix that problem. I’ll do so after I get the basic set of instructions working.
Let’s get rid of the blank lines. The grep
program extracts lines
from a file. The -v
flag represents anything but
. The pattern
^$
is a blank line [7].
$ cat joc-counting-repeats.md \ | sed -re "s/[^a-zA-Z']+/\n/g" \ | grep -v '^$' The joy of code Finding repeated words In a recent musing grammarly repeated words I wrote about the problem of ...
Next, I want to turn uppercase letters into lowercase. While I could
use sed
, the tr
program works nicely for single-letter substitutions.
I’ll use that.
$ cat joc-counting-repeats.md \ | sed -re "s/[^a-zA-Z']+/\n/g" \ | grep -v '^$' \ | tr [:upper:] [:lower:] the joy of code finding repeated words in ...
That worked as well as I expected. Getting the words in alphabetical
order is simple, as *nix has a built-in sort
command.
$ cat joc-counting-repeats.md \ | sed -re "s/[^a-zA-Z']+/\n/g" \ | grep -v '^$' \ | tr [:upper:] [:lower:] \ | sort ' ' ' ' ' a a a a a ... a about about about about about about about about add after after all all all alphabetical alphabetical ...
Whoops! It appears that I left in some single quotation marks [8]. I wonder why [9]. In any case, I don’t really want them. I’ll add removing them to my list of future work.
Now I want to count. We’ve now hit the first tool that I don’t regularly
use. Let’s see … a Web search tells me that uniq -c
counts repeated
lines, but only if they appear in sequence. Let’s try.
$ cat joc-counting-repeats.md \ | sed -re "s/[^a-zA-Z']+/\n/g" \ | grep -v '^$' \ | tr [:upper:] [:lower:] \ | sort \ | uniq -c 12 ' 32 a 16 about 3 add 4 after 6 all 4 alphabetical 3 also 1 an 7 and 2 any 2 anything 2 apostrophe 2 appear 2 appears
Getting rid of the tick is definitely on my goal list! What next? I want
these in order from most frequently occurring to least frequently occurring.
That sounds like another job for sort
. This time, I’m sorting numbers,
and want them from largest to smallest, so I add the -nr
flag [10].
$ cat joc-counting-repeats.md \ | sed -re "s/[^a-zA-Z']+/\n/g" \ | grep -v '^$' \ | tr [:upper:] [:lower:] \ | sort \ | uniq -c \ | sort -nr 58 i 48 the 34 a 26 to 23 of 22 that 19 in 17 about 16 words 15 ' 14 pre ...
You know what I just realized? By putting in the sample output, I’m changing my word counts significantly. Oh well, I guess that’s how things go when you’re writing an example late at night [11].
In any case, I am now done with the base program. I’ve gotten the counts of the individual words that appear in this musing [12]. If I were taking my editing seriously, I could now see where they appear to decide if I want to make changes to avoid too much unnecessary repetition [14].
I’ll leave the other minor [15] extensions to another day.
[1] I might also do count/word pairs, depending on how I feel.
[2] If I were releasing this program as a tool for others, I would work on making it a bit more precise.
[3] Maybe I should convert all uppercase to lowercase before I do anything else. Oh well, I did say that I wanted to do this quickly, and that I could improve it before releasing it to others.
[4] I’d like to support words like wouldn’t
and it’s
.
[5] Some people call the '
symbol tick
. That’s certainly easier
to pronounce than apostrophe.
[6] grammarly
, repeated
, and "words’.
[7] In this case, the caret represents the start of the line and the dollar sign represents the end of the line.
[8] a.k.a. ticks
.
[9] Upon reflection, it’s because the term '^$'
appears in multiple
examples.
[10] The n
is for numeric
. The r
is for reverse
.
[11] It could be an interesting exercise to see how long it would take me to reach a fixed point.
[12] More precisely, in a sequence of versions of this musing.
[14] You should read the prior musing for my thoughts on repeated words.
[15] Or perhaps not so minor.