Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Data Transforms

Transform structured data between formats with streaming support and high performance.

⚠️ Experimental API (v0.24.0+): Data transforms are new and under active development. While tests pass and performance is reasonable, expect API changes and edge cases as we improve correctness and streaming performance. Production use should include thorough testing of your specific data patterns.

Choosing Your Approach

proc offers several ways to process data. Here’s how to choose:

ApproachBest ForPerformance
flatdata CLILarge files (100MB+), batch processingHighest
Data TransformsIn-process conversion, filtering, enrichmentGood to High
Process PipelinesShell-like operations, text processingVaries
Async IterablesCustom logic, API data, any async sourceVaries

Decision guide:

  • Converting CSV/TSV/JSON files? → Data Transforms (this chapter)
  • Processing 100MB+ files for maximum speed? → flatdata CLI
  • Running shell commands and piping output? → Process Pipelines
  • Working with API responses or custom data? → Async Iterables

Overview

The data transforms module converts between CSV, TSV, JSON, and Record formats. All transforms stream data without loading everything into memory.

Import

Data transforms are a separate import to keep the core library lightweight:

// Core library
import { enumerate, read, run } from "jsr:@j50n/proc";

// Data transforms (separate import)
import {
  fromCsvToRows,
  fromJsonToRows,
  fromTsvToRows,
  toJson,
  toRecord,
  toTsv,
} from "jsr:@j50n/proc/transforms";

Quick Start

import { read } from "jsr:@j50n/proc";
import { fromCsvToRows, toTsv } from "jsr:@j50n/proc/transforms";

// Convert CSV to TSV
await read("data.csv")
  .transform(fromCsvToRows())
  .transform(toTsv())
  .writeTo("data.tsv");

Key Benefits

🚀 Streaming & Performance

  • Streaming design: Constant memory usage regardless of file size
  • LazyRow optimization: Faster parsing for CSV/TSV when accessing selective fields
  • flatdata CLI: WASM-powered tool for very large files

📊 Format Support

  • CSV: Universal compatibility with proper RFC 4180 compliance
  • TSV: Fast, simple tab-separated format
  • JSON Lines: Full object structure preservation
  • Record: High-performance binary-safe format

🔄 Flexible Data Types

  • Row arrays: string[][] for simple tabular data
  • LazyRow: Optimized read-only access with lazy conversion
  • Objects: Full JSON object support with optional validation

When to Use Each Format

CSV - Universal Compatibility

// Best for: Compatibility, Excel integration, legacy systems
await read("legacy-data.csv")
  .transform(fromCsvToRows())
  .transform(toRecord()) // Convert to faster format
  .writeTo("optimized.record");

TSV - Speed + Readability

// Best for: Fast processing, human-readable data
await read("logs.tsv")
  .transform(fromTsvToRows())
  .filter((row) => row[2] === "ERROR")
  .transform(toTsv())
  .writeTo("errors.tsv");

JSON Lines - Rich Objects

// Best for: Complex nested data, APIs, configuration
await read("events.jsonl")
  .transform(fromJsonToRows())
  .filter((event) => event.severity === "high")
  .transform(toJson())
  .writeTo("alerts.jsonl");

Record - Maximum Performance

// Best for: High-throughput processing, internal formats
await read("big-data.record")
  .transform(fromRecordToRows())
  .map((row) => [row[0], processValue(row[1]), row[2]])
  .transform(toRecord())
  .writeTo("processed.record");

LazyRow: Optimized Data Access

LazyRow provides a read-only interface optimized for field access without upfront parsing costs:

import { fromCsvToLazyRows } from "jsr:@j50n/proc@0.24.6/transforms";

// Parse CSV into LazyRow format
const lazyRows = await read("data.csv")
  .transform(fromCsvToLazyRows())
  .collect();

// Efficient field access
for (const row of lazyRows) {
  const name = row.getField(0); // Fast field access
  const age = row.getField(1); // No parsing until needed

  if (parseInt(age) > 18) {
    console.log(`Adult: ${name}`);
  }
}

LazyRow Benefits

  • Zero conversion cost: Choose optimal backing based on source
  • Lazy evaluation: Parse fields only when accessed
  • Caching: Repeated access uses cached results

Real-World Examples

Data Pipeline

// Process sales data: CSV → filter → enrich → JSON
await read("sales.csv")
  .transform(fromCsvToLazyRows())
  .filter((row) => parseFloat(row.getField(3)) > 1000) // Amount > $1000
  .map((row) => ({
    id: row.getField(0),
    customer: row.getField(1),
    amount: parseFloat(row.getField(3)),
    processed: new Date().toISOString(),
  }))
  .transform(toJson())
  .writeTo("high-value-sales.jsonl");

Format Conversion

// Convert legacy CSV to Record format for efficient processing
await read("legacy.csv")
  .transform(fromCsvToRows())
  .transform(toRecord())
  .writeTo("optimized.record");

Log Processing

// Parse structured logs and extract errors
await read("app.log.tsv")
  .transform(fromTsvToRows())
  .filter((row) => row[2] === "ERROR")
  .map((row) => ({
    timestamp: row[0],
    service: row[1],
    level: row[2],
    message: row[3],
  }))
  .transform(toJson())
  .writeTo("errors.jsonl");

Memory Efficiency

All transforms use streaming processing:

// ✅ Processes 10GB file with constant ~128KB memory usage
await read("huge-dataset.csv")
  .transform(fromCsvToRows())
  .filter((row) => row[0].startsWith("2024"))
  .transform(toTsv())
  .writeTo("filtered.tsv");

// ❌ Don't do this - loads everything into memory
const allRows = await read("huge-dataset.csv")
  .transform(fromCsvToRows())
  .collect(); // Memory explosion!

Error Handling

Transforms use strict error handling:

try {
  await read("data.csv")
    .transform(fromCsvToRows())
    .transform(toJson())
    .writeTo("output.jsonl");
} catch (error) {
  if (error.message.includes("Invalid UTF-8")) {
    console.error("File encoding issue");
  } else if (error.message.includes("CSV")) {
    console.error("Malformed CSV data");
  }
}

See Also