Counting Words

A classic example that shows the power of process pipelines.

Simple Word Count

Count total words in a file:

import { run } from "jsr:@j50n/proc@0.23.3";

const wordCount = await run("wc", "-w", "book.txt").lines.first;
console.log(`Total words: ${wordCount}`);

Unique Words

Count unique words:

const uniqueWords = await run("cat", "book.txt")
  .run("tr", "-cs", "A-Za-z", "\n")  // Extract words
  .run("tr", "A-Z", "a-z")            // Lowercase
  .run("sort")                         // Sort
  .run("uniq")                         // Unique
  .lines
  .count();

console.log(`Unique words: ${uniqueWords}`);

Word Frequency

Find most common words:

const topWords = await run("cat", "book.txt")
  .run("tr", "-cs", "A-Za-z", "\n")
  .run("tr", "A-Z", "a-z")
  .run("sort")
  .run("uniq", "-c")
  .run("sort", "-rn")
  .run("head", "-10")
  .lines
  .collect();

console.log("Top 10 words:");
topWords.forEach(line => console.log(line));

Pure JavaScript Version

Do it all in JavaScript:

import { read } from "jsr:@j50n/proc@0.23.3";

const wordCounts = await read("book.txt")
  .lines
  .flatMap(line => line.toLowerCase().match(/\w+/g) || [])
  .reduce((acc, word) => {
    acc[word] = (acc[word] || 0) + 1;
    return acc;
  }, {});

const topWords = Object.entries(wordCounts)
  .sort((a, b) => b[1] - a[1])
  .slice(0, 10);

console.log("Top 10 words:");
topWords.forEach(([word, count]) => {
  console.log(`${count} ${word}`);
});

Compressed Files

Count words in a compressed file:

const wordCount = await read("book.txt.gz")
  .transform(new DecompressionStream("gzip"))
  .lines
  .flatMap(line => line.match(/\w+/g) || [])
  .count();

console.log(`Total words: ${wordCount}`);

Multiple Files

Count words across multiple files:

import { enumerate } from "jsr:@j50n/proc@0.23.3";

const files = ["book1.txt", "book2.txt", "book3.txt"];

const results = await enumerate(files)
  .concurrentMap(async (file) => {
    const words = await read(file)
      .lines
      .flatMap(line => line.match(/\w+/g) || [])
      .count();
    return { file, words };
  }, { concurrency: 3 })
  .collect();

results.forEach(({ file, words }) => {
  console.log(`${file}: ${words} words`);
});

Filter Stop Words

Exclude common words:

const stopWords = new Set([
  "the", "a", "an", "and", "or", "but", "in", "on", "at", "to", "for"
]);

const meaningfulWords = await read("book.txt")
  .lines
  .flatMap(line => line.toLowerCase().match(/\w+/g) || [])
  .filter(word => !stopWords.has(word))
  .reduce((acc, word) => {
    acc[word] = (acc[word] || 0) + 1;
    return acc;
  }, {});

Word Length Distribution

Analyze word lengths:

const lengthDist = await read("book.txt")
  .lines
  .flatMap(line => line.match(/\w+/g) || [])
  .reduce((acc, word) => {
    const len = word.length;
    acc[len] = (acc[len] || 0) + 1;
    return acc;
  }, {});

console.log("Word length distribution:");
Object.entries(lengthDist)
  .sort((a, b) => parseInt(a[0]) - parseInt(b[0]))
  .forEach(([len, count]) => {
    console.log(`${len} letters: ${count} words`);
  });

Real-World Example: War and Peace

Analyze Tolstoy's War and Peace:

const [totalWords, uniqueWords] = await Promise.all([
  // Total words
  read("warandpeace.txt.gz")
    .transform(new DecompressionStream("gzip"))
    .lines
    .flatMap(line => line.match(/\w+/g) || [])
    .count(),
  
  // Unique words
  read("warandpeace.txt.gz")
    .transform(new DecompressionStream("gzip"))
    .lines
    .flatMap(line => line.toLowerCase().match(/\w+/g) || [])
    .reduce((acc, word) => {
      acc.add(word);
      return acc;
    }, new Set())
    .then(set => set.size)
]);

console.log(`Total words: ${totalWords.toLocaleString()}`);
console.log(`Unique words: ${uniqueWords.toLocaleString()}`);
console.log(`Vocabulary richness: ${(uniqueWords / totalWords * 100).toFixed(1)}%`);

Performance Comparison

Shell Pipeline (fast)

// Uses native Unix tools
const count = await run("cat", "book.txt")
  .run("wc", "-w")
  .lines.first;

JavaScript (flexible)

// More control, type-safe
const count = await read("book.txt")
  .lines
  .flatMap(line => line.match(/\w+/g) || [])
  .count();

Hybrid (best of both)

// Use Unix tools for heavy lifting, JavaScript for logic
const words = await run("cat", "book.txt")
  .run("tr", "-cs", "A-Za-z", "\n")
  .lines
  .filter(word => word.length > 5)  // JavaScript filter
  .count();

Next Steps

Process Pipelines - Chain commands together
Concurrent Processing - Process multiple files
Streaming Large Files - Handle huge files

Keyboard shortcuts

proc