Data Transforms
Transform structured data between formats with streaming support and high performance.
⚠️ Experimental API (v0.24.0+): Data transforms are new and under active development. While tests pass and performance is reasonable, expect API changes and edge cases as we improve correctness and streaming performance. Production use should include thorough testing of your specific data patterns.
Choosing Your Approach
proc offers several ways to process data. Here’s how to choose:
| Approach | Best For | Performance |
|---|---|---|
| flatdata CLI | Large files (100MB+), batch processing | Highest |
| Data Transforms | In-process conversion, filtering, enrichment | Good to High |
| Process Pipelines | Shell-like operations, text processing | Varies |
| Async Iterables | Custom logic, API data, any async source | Varies |
Decision guide:
- Converting CSV/TSV/JSON files? → Data Transforms (this chapter)
- Processing 100MB+ files for maximum speed? → flatdata CLI
- Running shell commands and piping output? → Process Pipelines
- Working with API responses or custom data? → Async Iterables
Overview
The data transforms module converts between CSV, TSV, JSON, and Record formats. All transforms stream data without loading everything into memory.
Import
Data transforms are a separate import to keep the core library lightweight:
// Core library
import { enumerate, read, run } from "jsr:@j50n/proc";
// Data transforms (separate import)
import {
fromCsvToRows,
fromJsonToRows,
fromTsvToRows,
toJson,
toRecord,
toTsv,
} from "jsr:@j50n/proc/transforms";
Quick Start
import { read } from "jsr:@j50n/proc";
import { fromCsvToRows, toTsv } from "jsr:@j50n/proc/transforms";
// Convert CSV to TSV
await read("data.csv")
.transform(fromCsvToRows())
.transform(toTsv())
.writeTo("data.tsv");
Key Benefits
🚀 Streaming & Performance
- Streaming design: Constant memory usage regardless of file size
- LazyRow optimization: Faster parsing for CSV/TSV when accessing selective fields
- flatdata CLI: WASM-powered tool for very large files
📊 Format Support
- CSV: Universal compatibility with proper RFC 4180 compliance
- TSV: Fast, simple tab-separated format
- JSON Lines: Full object structure preservation
- Record: High-performance binary-safe format
🔄 Flexible Data Types
- Row arrays:
string[][]for simple tabular data - LazyRow: Optimized read-only access with lazy conversion
- Objects: Full JSON object support with optional validation
When to Use Each Format
CSV - Universal Compatibility
// Best for: Compatibility, Excel integration, legacy systems
await read("legacy-data.csv")
.transform(fromCsvToRows())
.transform(toRecord()) // Convert to faster format
.writeTo("optimized.record");
TSV - Speed + Readability
// Best for: Fast processing, human-readable data
await read("logs.tsv")
.transform(fromTsvToRows())
.filter((row) => row[2] === "ERROR")
.transform(toTsv())
.writeTo("errors.tsv");
JSON Lines - Rich Objects
// Best for: Complex nested data, APIs, configuration
await read("events.jsonl")
.transform(fromJsonToRows())
.filter((event) => event.severity === "high")
.transform(toJson())
.writeTo("alerts.jsonl");
Record - Maximum Performance
// Best for: High-throughput processing, internal formats
await read("big-data.record")
.transform(fromRecordToRows())
.map((row) => [row[0], processValue(row[1]), row[2]])
.transform(toRecord())
.writeTo("processed.record");
LazyRow: Optimized Data Access
LazyRow provides a read-only interface optimized for field access without upfront parsing costs:
import { fromCsvToLazyRows } from "jsr:@j50n/proc@0.24.6/transforms";
// Parse CSV into LazyRow format
const lazyRows = await read("data.csv")
.transform(fromCsvToLazyRows())
.collect();
// Efficient field access
for (const row of lazyRows) {
const name = row.getField(0); // Fast field access
const age = row.getField(1); // No parsing until needed
if (parseInt(age) > 18) {
console.log(`Adult: ${name}`);
}
}
LazyRow Benefits
- Zero conversion cost: Choose optimal backing based on source
- Lazy evaluation: Parse fields only when accessed
- Caching: Repeated access uses cached results
Real-World Examples
Data Pipeline
// Process sales data: CSV → filter → enrich → JSON
await read("sales.csv")
.transform(fromCsvToLazyRows())
.filter((row) => parseFloat(row.getField(3)) > 1000) // Amount > $1000
.map((row) => ({
id: row.getField(0),
customer: row.getField(1),
amount: parseFloat(row.getField(3)),
processed: new Date().toISOString(),
}))
.transform(toJson())
.writeTo("high-value-sales.jsonl");
Format Conversion
// Convert legacy CSV to Record format for efficient processing
await read("legacy.csv")
.transform(fromCsvToRows())
.transform(toRecord())
.writeTo("optimized.record");
Log Processing
// Parse structured logs and extract errors
await read("app.log.tsv")
.transform(fromTsvToRows())
.filter((row) => row[2] === "ERROR")
.map((row) => ({
timestamp: row[0],
service: row[1],
level: row[2],
message: row[3],
}))
.transform(toJson())
.writeTo("errors.jsonl");
Memory Efficiency
All transforms use streaming processing:
// ✅ Processes 10GB file with constant ~128KB memory usage
await read("huge-dataset.csv")
.transform(fromCsvToRows())
.filter((row) => row[0].startsWith("2024"))
.transform(toTsv())
.writeTo("filtered.tsv");
// ❌ Don't do this - loads everything into memory
const allRows = await read("huge-dataset.csv")
.transform(fromCsvToRows())
.collect(); // Memory explosion!
Error Handling
Transforms use strict error handling:
try {
await read("data.csv")
.transform(fromCsvToRows())
.transform(toJson())
.writeTo("output.jsonl");
} catch (error) {
if (error.message.includes("Invalid UTF-8")) {
console.error("File encoding issue");
} else if (error.message.includes("CSV")) {
console.error("Malformed CSV data");
}
}
See Also
- CSV Transforms — Detailed CSV parsing and generation
- TSV Transforms — Tab-separated value processing
- JSON Transforms — JSON Lines with validation
- Record Format — High-performance binary format
- LazyRow Optimization — Optimized data access patterns
- Performance Guide — Benchmarks and optimization tips
- flatdata CLI — WASM-powered processing at 330 MB/s