Working with Text Data

Streaming data doesn't have to be line-delimited text, but it probably will be most of the time. Many *nix tools work with this type of data or some variation of it.

Line-delimited text data is simply:

  • utf-8 encoded bytes
  • logically separated into lines with \n or alternately \r\n (Windows style) characters

Here is how you process text data in proc.

UTF-8 Lines

This is the "normal" way to work with line-delimited text. It should be a good solution most of the time.

The lines method converts a line at a time.

await run("ls", "-la")
  .lines
  .forEach((it) => console.log(it));

Alternately you can use transform with the toLines transformer function.

await read(resolve("./warandpeace.txt.gz"))
  .transform(toLines)
  .forEach((it) => console.log(it));

The Enumerable.run method will automatically treat string values as lines, adding \n to them and converting back into utf-8 encoded bytes.

Note that this always assumes string data passed to it is line-delimited. If that isn't the case (you may be working with buffered text data that is not delimited at all, for example), you must convert text data back to Uint8Array yourself or \n characters will be added.

Traditional Text and Lines

Deno provides a native TextDecoderStream to bulk-convert Uint8Array data into text. The boundaries are arbitrary. The characters will always be correct, but this can break within a word or within a control-character sequence. TextDecoderStream supports many standard character encodings.

To parse this data into lines, Deno provides TextLineStream. This splits the data into lines on \n and optionally \r.

These are meant to be used together to convert to text then split into lines.

The traditional stream implementation is a little slower than the utf-8-specialized transformers, but they support different character encodings and allow some flexibility in defining the split.

await read(resolve("./warandpeace.txt.gz"))
  .transform(gunzip)
  .transform(new TextDecoderStream())
  .transform(new TextLineStream())
  .map((line) => line.toLowerCase())
  .forEach((line) => console.log(line));

Note that most of the library assumes strings and arrays of strings represent line data. For text that is not divided on lines, you can use TextEncoderStream to convert back to utf-8 bytes. Note that unlike TextDecoderStream this does not support multiple encodings. This is in line with the official specification.

Not All Text Data is Text Data

There are many command-line utilities that use ANSI color and position sequences, as well as raw carriage-returns (\r) to enhance the user experience at the console. The codes are normally interpreted by the terminal, but if you dump them to file, you can see they make a mess. You've probably seen this in log files before.

This type of streamed text data can't be strictly interpreted as lines. You may be able to hack around the fluff. Use stripColor (Deno std library) to remove ANSI escape codes from strings. If the utility is using raw \r, you may have to deal with that as well.

The best solution is to turn off color and progress for command-line utilities you use for processing. This is not always possible (Debian apt is a famous example of this).

Reference the ANSI escape code wiki page.

You can always get around this problem by never attempting to split on lines.

await run("apt", "install", "build-essential")
  .writeTo(Deno.stdout.writable, { noclose: true });