Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Counting Words

Word counting demonstrates the elegance of process pipelines, showing how complex text analysis can be built from simple Unix tools chained together.

Basic Word Counting

The simplest approach uses the wc command to count total words in a file:

import { run } from "jsr:@j50n/proc@0.24.6";

const wordCount = await run("wc", "-w", "book.txt").lines.first;
console.log(`Total words: ${wordCount}`);

Finding Unique Words

To count unique words, you need to extract individual words, normalize their case, and eliminate duplicates. This pipeline breaks text into words, converts everything to lowercase, sorts the results, and removes duplicates:

const uniqueWords = await run("cat", "book.txt")
  .run("tr", "-cs", "A-Za-z", "\n") // Extract words
  .run("tr", "A-Z", "a-z") // Lowercase
  .run("sort") // Sort
  .run("uniq") // Unique
  .lines
  .count();

console.log(`Unique words: ${uniqueWords}`);

Analyzing Word Frequency

For more sophisticated analysis, you can find the most frequently used words by adding frequency counting and sorting by occurrence:

const topWords = await run("cat", "book.txt")
  .run("tr", "-cs", "A-Za-z", "\n")
  .run("tr", "A-Z", "a-z")
  .run("sort")
  .run("uniq", "-c")
  .run("sort", "-rn")
  .run("head", "-10")
  .lines
  .collect();

console.log("Top 10 words:");
topWords.forEach((line) => console.log(line));

topWords.forEach(line => console.log(line));

## Pure JavaScript Version

Do it all in JavaScript:

<!-- NOT TESTED: Illustrative example -->
```typescript
import { read } from "jsr:@j50n/proc@0.24.6";

const wordCounts = await read("book.txt")
  .lines
  .flatMap(line => line.toLowerCase().match(/\w+/g) || [])
  .reduce((acc, word) => {
    acc[word] = (acc[word] || 0) + 1;
    return acc;
  }, {});

const topWords = Object.entries(wordCounts)
  .sort((a, b) => b[1] - a[1])
  .slice(0, 10);

console.log("Top 10 words:");
topWords.forEach(([word, count]) => {
  console.log(`${count} ${word}`);
});

Compressed Files

Count words in a compressed file:

const wordCount = await read("book.txt.gz")
  .transform(new DecompressionStream("gzip"))
  .lines
  .flatMap((line) => line.match(/\w+/g) || [])
  .count();

console.log(`Total words: ${wordCount}`);

Multiple Files

Count words across multiple files:

import { enumerate } from "jsr:@j50n/proc@0.24.6";

const files = ["book1.txt", "book2.txt", "book3.txt"];

const results = await enumerate(files)
  .concurrentMap(async (file) => {
    const words = await read(file)
      .lines
      .flatMap((line) => line.match(/\w+/g) || [])
      .count();
    return { file, words };
  }, { concurrency: 3 })
  .collect();

results.forEach(({ file, words }) => {
  console.log(`${file}: ${words} words`);
});

Filter Stop Words

Exclude common words:

const stopWords = new Set([
  "the",
  "a",
  "an",
  "and",
  "or",
  "but",
  "in",
  "on",
  "at",
  "to",
  "for",
]);

const meaningfulWords = await read("book.txt")
  .lines
  .flatMap((line) => line.toLowerCase().match(/\w+/g) || [])
  .filter((word) => !stopWords.has(word))
  .reduce((acc, word) => {
    acc[word] = (acc[word] || 0) + 1;
    return acc;
  }, {});

Word Length Distribution

Analyze word lengths:

const lengthDist = await read("book.txt")
  .lines
  .flatMap((line) => line.match(/\w+/g) || [])
  .reduce((acc, word) => {
    const len = word.length;
    acc[len] = (acc[len] || 0) + 1;
    return acc;
  }, {});

console.log("Word length distribution:");
Object.entries(lengthDist)
  .sort((a, b) => parseInt(a[0]) - parseInt(b[0]))
  .forEach(([len, count]) => {
    console.log(`${len} letters: ${count} words`);
  });

Real-World Example: War and Peace

Analyze Tolstoy’s War and Peace:

const [totalWords, uniqueWords] = await Promise.all([
  // Total words
  read("warandpeace.txt.gz")
    .transform(new DecompressionStream("gzip"))
    .lines
    .flatMap((line) => line.match(/\w+/g) || [])
    .count(),

  // Unique words
  read("warandpeace.txt.gz")
    .transform(new DecompressionStream("gzip"))
    .lines
    .flatMap((line) => line.toLowerCase().match(/\w+/g) || [])
    .reduce((acc, word) => {
      acc.add(word);
      return acc;
    }, new Set())
    .then((set) => set.size),
]);

console.log(`Total words: ${totalWords.toLocaleString()}`);
console.log(`Unique words: ${uniqueWords.toLocaleString()}`);
console.log(
  `Vocabulary richness: ${(uniqueWords / totalWords * 100).toFixed(1)}%`,
);

Performance Comparison

Shell Pipeline (fast)

// Uses native Unix tools
const count = await run("cat", "book.txt")
  .run("wc", "-w")
  .lines.first;

JavaScript (flexible)

// More control, type-safe
const count = await read("book.txt")
  .lines
  .flatMap((line) => line.match(/\w+/g) || [])
  .count();

Hybrid (best of both)

// Use Unix tools for heavy lifting, JavaScript for logic
const words = await run("cat", "book.txt")
  .run("tr", "-cs", "A-Za-z", "\n")
  .lines
  .filter((word) => word.length > 5) // JavaScript filter
  .count();

Next Steps