Counting Words

This shell script counts total and unique words:

#!/bin/bash
set -e

# total word count
zcat ./warandpeace.txt.gz \
  | tr '[:upper:]' '[:lower:]' \
  | grep -oE "(\\w|'|’|-)+" \
  | wc -l 

#count unique words
zcat ./warandpeace.txt.gz \
  | tr '[:upper:]' '[:lower:]' \
  | grep -oE "(\\w|'|’|-)+" \
  | sort \
  | uniq \
  | wc -l

There are multiple approaches to doing the same thing in Deno using proc. You can run this in-process as a pure Typescript/JavaScript solution, run it as a shell script, or translate each command in the shell script into run methods.

⚠️ The tr used to convert to lowercase is not fully unicode compliant. Expect counts to be a little different between this code and the code that uses JavaScript's .toLocaleLowercase(), which is fully unicode compliant.

Direct Translation from Bash

This is the equivalent to the shell script using proc methods. This substitutes gunzip for zcat, translates each output to a number, and runs the operations concurrently (and in parallel) - since that is easy to do. Otherwise it is doing exactly the same thing.

Otherwise, this is a direct translation where proc just controls the streaming from process to process. All the same child processes are being launched.

const [total, unique] = await Promise.all([
  read(fromFileUrl(import.meta.resolve("./warandpeace.txt.gz")))
    .run("gunzip")
    .run("tr", "[:upper:]", "[:lower:]")
    .run("grep", "-oE", "(\\w|'|’|-)+")
    .run("wc", "-l")
    .lines
    .map((n) => parseInt(n, 10))
    .first,

  read(fromFileUrl(import.meta.resolve("./warandpeace.txt.gz")))
    .run("gunzip")
    .run("tr", "[:upper:]", "[:lower:]")
    .run("grep", "-oE", "(\\w|'|’|-)+")
    .run("sort")
    .run("uniq")
    .run("wc", "-l")
    .lines
    .map((n) => parseInt(n, 10))
    .first,
]);

console.log(total);
console.log(unique);

Embedding a Shell Script

Another approach is to embed a shell script. No translation required here. This is a bash script run using /bin/bash. This moves the entire workload and its management into other processes. Consider this solution if your application is doing lots of other things concurrently.

Note that you give up some control over error handling with this approach, so be sure to test for the types of errors you think you may encounter. Shell scripts are notorious for edge-case bugs - which is why we reach for a "real" programming language when things start to get complex.

This is also a simple example of a generated script. We are injecting the full path of our text file as determined by the Deno script.

This example shows the total count.

await run(
  "/bin/bash",
  "-c",
  ` set -e
    zcat "${fromFileUrl(import.meta.resolve("./warandpeace.txt.gz"))}" \
      | tr '[:upper:]' '[:lower:]' \
      | grep -oE "(\\w|'|’|-)+" \
      | wc -l
  `,
)
  .lines
  .forEach((line) => console.log(line));

Doing All the Work in Deno

This is a streaming solution staying fully in Deno, in a single Typescript/JavaScript VM (not using child processes at all). The avoids (most of) the memory overhead that would be needed to process the document in memory (non-streaming), and it is fast.

This demonstrates transformer-composition in proc. Because transformers are just functions of iterable collections, you can compose them into logical units the same way you would any other code.

Transformer for Unique Words

We could shell out to sort and uniq, but this way is much faster. It only needs a little extra memory. It dumps the words, one at a time, into a Set. Then it yields the contents of the Set.

The set of unique words is much smaller than the original document, so the memory required is quite small.

export async function* distinct(words: AsyncIterable<string>) {
  const uniqueWords = new Set();
  for await (const word of words) {
    uniqueWords.add(word);
  }
  yield* uniqueWords;
}

Transformer to Split into Words

Convert each line to lower case. Use Regex to split the line into words. Remove anything without a character (all symbols), anything with a number, and "CHAPTER" titles. The symbol characters in the regular expression are specific to the test document and probably won't work generally.

The document we are targeting, ./warandpeace.txt.gz, uses extended unicode letters and a few unicode symbols as well. We know that the Typescript solution below works correctly with unicode characters (note the u flag on the regular expression). Some of the *nix utilities were written a long time ago and still do not support unicode. In particular, tr does not translate case correctly all of the time, and I am not sure what grep is doing - it sort of works, but the regular expression language has subtle differences to what I am used to. A benefit of working in a tightly spec'd language like Typescript is you know what your code should be doing at all times. The counts are very close, but they are not exactly the same, so we know something is a little bit off with tr and/or grep.

export function split(lines: AsyncIterable<string>) {
  return enumerate(lines)
    .map((it) => it.toLocaleLowerCase())
    .flatMap((it) =>
      [...it.matchAll(/(\p{L}|\p{N}|['’-])+/gu)]
        .map((a) => a[0])
    )
    .filterNot((it) =>
      /^['’-]+$/.test(it) ||
      /[0-9]/.test(it) ||
      /CHAPTER/.test(it)
    );
}

Putting It All Together

Read the file. Uncompress it and convert to lines (string). Use the transformer function we created earlier, split, to split into words.

const words = read(
  fromFileUrl(import.meta.resolve("./warandpeace.txt.gz")),
)
  .transform(gunzip)
  .transform(toLines)
  .transform(split);

Now we need to get (1) a count of all words and (2) a count of unique words. We can use tee to create two copies of the stream - since we have to count twice. This gets around the limitation of being able to use an iterable only once and means we don't have to do extra work splitting the document into words two times.

const [w1, w2] = words.tee();

We can count the words in the first copy directly. For the second copy, we use the distinct transformer before counting.

const [count, unique] = await Promise.all([
  w1.count(),
  w2.transform(distinct).count(),
]);

console.log(`Total word count:  ${count.toLocaleString()}`);
console.log(`Unique word count: ${unique.toLocaleString()}`);

The results:

Total word count:  563,977
Unique word count: 18,609

Clean, readable code. Understandable error handling. Fast. The only downside is that the processing is done in-process (we only have one thread to work with in JavaScript). If you are doing other things at the same time, this will slow them down.