-
-
Notifications
You must be signed in to change notification settings - Fork 162
TSV2 Proposal
Now named QTT: Quoted, Typed Tables
QTT is an interchange format for tables that's a graceful upgrade of TSV. A TSV file is a valid QTT file.
It uses QSN (Quoted String Notation) as a building block.
TSV2 is a textual interchange format for tables, like JSON is for nested data. It's most easily described as a TSV file with some nice properties:
- The header row is required. It may or may not have type annotations.
- Every row is exactly one line (unlike CSV and TSV).
- It can represent any byte or character in a field. This is done with single quoted strings like
'foo\n'
and'nul byte: \x00 !'
and'\u1234'
. Most fields should not be quoted.
In canonical form, a TSV2 file is a valid UTF-8 text file. There may be reasons to use non-canonical forms (e.g. for compatibility with old TSV readers).
NOTE: → are tab characters (written in HTML as →
).
name → age:int → checksum:bytes → ratio:float → child:boolean 'Homer Simpson' → 10 → '\xab\xcd' → 0.12 → false Marge Simpson → 20 → '\xef\xff' → 0.34 → false Bart Simpson → 30 → '\xdd\xee' → 0.56789 → true 'Lisa Simpson' → 40 → '\xdd\xee' → 0.56789 → true
The default column type is string
.
This example is inconsistent to illustrate the principle that single quotes are optional in some cases but required in others. They're recommended when a field contains spaces, but required when a field contains newlines or tabs.
- Oil's xargs "dialect" (each Keyword in Oil)
- It provides xargs with
rows
, rather than faking it with-n 4
for a row of 4 items.- xargs can similarly use the names. We could provide some alternative to the
-I {}
string.
- xargs can similarly use the names. We could provide some alternative to the
- It provides xargs with
- awk "dialect" (See http://www.oilshell.org/blog/tags.html?tag=awk#awk)
- It provides awk with named columns instead of
$1 $2 $3
. - It provides awk with types.
- It provides awk with named columns instead of
- Scripts that scrape coreutils. The output of
ps
andls
can be massaged into Tables. - Oil's augmentation of the
find
command (TBD) - Oil's history mechanism (TODO: issue #320)
- ... any other tool that can write text to stdout :)
A line is not a record in CSV, e.g.
"record
with
spaces",10
is a single record. This doesn't work well with line-oriented shell tools.
Also, CSV in practice tends to be ill-specified. (e.g. hardware outputs really broken CSV files.)
- You can't have tabs in TSV fields.
- If you have newlines, you no longer have the property that a record is a line (i.e. the same problem that CSV has).
- TSV has no printable representation for
NUL
bytes like'\x00'
.
Because JSON uses double quotes, and JSON represents unicode strings. TSV2 represents byte strings, which can be utf-8 encoded.
Strictly speaking, no. Hence the "major version number bump". However:
- A TSV2 reader can always read a TSV file correctly. Spaces are allowed in raw TSV2 fields, even though a more canonical representation is:
-
'spaces and single quotes'<TAB>'another field'
, rather than -
no single quotes<TAB>another field
.
-
- If a TSV reader consumes a TSV2 file, then it will "work", except some fields may have leading
'
characters. This only happens for "unprintable" characters and tabs, so in the common case there are NO corrections necessary.
Based on JSON primitive types, but less tied to JavaScript:
-
string
(default) -
bytes
(can represent arbitrary bytes like invalid UTF-8) - "number" is split into
int
andfloat
, because most languages have that distinction boolean
Nothing as far as the TSV2 format is concerned. string
is an indication to the language's library to try to decode utf-8 and escape sequences into a string data structure, if applicable. (e.g. Python, Java, and JavaScript provide random code point access; Go and Rust use UTF-8 internally.)
See the "regular expressions" at json.org. We use true
, false
, and null
.
- The file must have valid header.
- Column names should match
[a-zA-Z_][a-zA-Z0-9_]+
. (Should we allow spaces? Language can convert to _ ?)
- Column names should match
- The number of columns in the TSV2 stream is determined the header. If any row has an invalid number of cells, a fatal error must be raised.
- Invalid string syntax, e.g.
- a field starts with
'
but doesn't end with one. - invalid escape sequences
- a field starts with
- Invalid syntax in any other column:
- int
- float
- boolean
- The TSV2 file is valid UTF-8.
-
NUL
bytes are written'\x00'
(only allow\xAB
inbytes
columns?)
-
-
'fields with spaces have single quotes'
-- this makes TSV readers less likely to work on TSV2 files, but that's OK. You have the option to emit non-canonical TSV2. - empty strings should be
''
!
- Python: if you do repr() and the string does not contain a single quote, Python will use single quotes. And then it will be a valid TSV2 string.
- Oil: TODO
- JavaScript: if it doesn't contain
\xFF
codes, it should work? Well there's also the "bell" character and such.
-
uxy, very similar, and scrapes existing tools. Discussed on lobste.rs.
align
andtrim
are useful.
- String and Bytes columns can't contain
null
-
int
andboolean
columns can't either
-
- What about
float
columns?null
could be useful forNA
. It can be represented by IEEE floats.