Feature request: add Unicode separated values (USV) #245

joelparkerhenderson · 2019-05-20T07:51:47Z

Miller is great! I would like to donate to you or your favorite charity to help encourage a new feature: Unicode separated values (USV) which uses Unicode unit separator U+241F and Unicode record separator U+241E.

Unicode separated values (USV) are much like comma separated values (CSV), tab separated values (TSV) a.k.a. tab delimited format (TDF), and ASCII separated values (ASV) a.k.a. DEL (Delimited ASCII) a.k.a. ASCII 30-31.

The advantages of USV for me are that USV handles text that happens to contain commas and/or tabs and/or newlines, and also having a visual character representation.

For example USV is great for me within typical source code, such as Unix scripts, because the characters show up, and also easy to copy/paste, and also easy to use within various kinds of editor search boxes.

USV uses a typical backslash to escape.

When data are solely for machines, then for me the choice of characters doesn't matter. When data are potentially for reading or editing, such as by a programmer, then I prefer typically-visible characters (U+241F & U+241F) over typically-invisible zero-width characters (ASCII 30 & 31).

In addition, Unicode U+241F & U+241E are semantically meaningful, and use an international standard, and are able to work well in any typical Unicode language and any typical Unicode font.

Thank you for your consideration.

johnkerl · 2019-05-21T00:27:19Z

Great feature request!! Donate to any charity of your choosing; this is open source.:) I'll have more cycles this summer. USV should be an easy codemod away from CSV. :)

johnkerl · 2019-05-21T01:13:48Z

cc @sjackman

sjackman · 2019-05-21T18:23:16Z

Sounds interesting! I had trouble finding a description of the difference between unicode characters
Record Separator U+001E
Unit Separator U+001F
and
Unicode Character 'SYMBOL FOR RECORD SEPARATOR' (U+241E)
Unicode Character 'SYMBOL FOR UNIT SEPARATOR' (U+241F)
https://unicode.org/charts/PDF/U2400.pdf

This Wikipedia article says: https://en.wikipedia.org/wiki/Unicode_control_characters#Control_pictures

Unicode provides graphic characters for representing C0 control codes and other control characters in the Control Pictures block. They are visual representations, not the actual control codes themselves.

Which gives me the impression that the files should contain U+001E and U+001F and your text editor should represent them using the symbols from U+241E and U+241F. I tested MacVim and Visual Studio Code, and neither editor displays U+001E and U+001F using the symbols unfortunately.

joelparkerhenderson · 2019-05-21T21:30:12Z

Which gives me the impression that the files should contain U+001E and U+001F and your text editor should represent them using the symbols from U+241E and U+241F.

You're right. That's the theory. :)

I tested MacVim and Visual Studio Code, and neither editor displays

You're right. That's the blocker for me. USV is a pragmatic immediate solution to the lack of displays, and also to the counterpoint, which is how to copy/paste a non-visible character such as into a search box.

sjackman · 2019-05-21T22:18:10Z

Miller could support both I suppose. There's a good chance it could already work by setting --rs and --fs to the appropriate values. The change to Miller would be only adding convenient command line options. @joelparkerhenderson Have you tested setting --rs and --fs to u001x?

@johnkerl Could --rs and --fs accept unicode codes to make that a bit easier? For example: mlr --rs=u241e --fs=u241f

Does the + character in U+241E mean anything in particular? I'm not the Unicode guru.

joelparkerhenderson · 2019-05-21T23:11:30Z

Great idea-- yes you're right, Miller is able to do this.

$ printf "a␟b␟c␞d␟e␟f␞g␟h␟i" > example.usv
$ mlr --tsv --fs '␟' --rs '␞' cut -f a example.usv
a␞d␞g␞

So you're right, the feature request is just the convenient command line option:

$ mlr --usv cut -f a example.usv

Does the + character in U+241E mean anything in particular?

Metaphorically yes. The original Unicode creators liked the idea of "a union of all character sets", and the mathematical symbol for union of sets looks like a letter U with a plus inside. So the creators decided to use "U+" as a prefix.

sjackman · 2019-05-21T23:24:50Z

I know this union of sets symbol ∪ https://www.fileformat.info/info/unicode/char/222a/index.htm but not one with a plus sign inside.

sjackman · 2019-05-21T23:25:42Z

Glad to hear that works out of the box! Perhaps then --asv for ASCII separated values that uses 1E and 1F and --usv for unicode separated values that uses 241E and 241F?
@johnkerl What do you think?

johnkerl · 2019-05-21T23:45:20Z

I love it! Thanks guys for figuring all this out -- the next commit will be all the simpler. :D

sjackman · 2019-05-21T23:51:34Z

USV uses a typical backslash to escape.

@joelparkerhenderson You mean if the content itself contains the delimiters 1E and 1F or 241E and 241E and 241F they are escaped with backslash? The downside of using backslash as the escape character is that backslashes need to be escaped with backslashes \\. The data that I work with would never contain 1E and 1F or 241E and 241F but it may contain a backslash. Could a unicode character be used as the escape character (perhaps U+001B Escape) rather than the common backslash? Or even better for me, just say that it's a hard error if the data contains the delimiter sequences.

@joelparkerhenderson @johnkerl What do you both think?

johnkerl · 2019-05-21T23:53:03Z

Glurk, the main attraction of special characters as delimiters is they only appear as delimiters, not data ...

sjackman · 2019-05-22T00:00:47Z

That's my opinion too. So you're a fan of the hard error if the data contains the delimiter sequence?

joelparkerhenderson · 2019-05-22T01:10:39Z

I mean the content delimiters are 241E & 241F, and not 1E & 1F.

My use case is content fields that I am certain never use 241E & 241F. I'm fine with a hard error.

You make a good point about the backslash. I rescind that aspect because I am reading about your point, and looking at the existing IANA TSV spec, which disallows tabs in the content. I like the idea of using IANA's approach, and disallowing 241E & 241F in the content.

https://www.iana.org/assignments/media-types/text/tab-separated-values

johnkerl · 2019-09-02T21:04:49Z

f8cf06d is the feature; on-line help & docs up next.

johnkerl · 2019-09-03T02:42:27Z

90e0759 is the rest. This will go out in 5.6.0.

Thanks for the GREAT feature request! And, sorry for the very long delay on this easy change. :^/

johnkerl · 2019-09-13T01:29:35Z

Thank you @joelparkerhenderson!

johnkerl changed the title ~~Feature request: $50 donation to help add Unicode separated values (USV)~~ Feature request: add Unicode separated values (USV) May 21, 2019

johnkerl added the on deck label Jun 2, 2019

johnkerl added active and removed on deck labels Sep 2, 2019

johnkerl closed this as completed Sep 3, 2019

johnkerl removed the active label Sep 13, 2019

frosencrantz mentioned this issue May 21, 2022

[usv] Unicode Separators swapped saulpw/visidata#1383

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: add Unicode separated values (USV) #245

Feature request: add Unicode separated values (USV) #245

joelparkerhenderson commented May 20, 2019

johnkerl commented May 21, 2019

johnkerl commented May 21, 2019

sjackman commented May 21, 2019

joelparkerhenderson commented May 21, 2019

sjackman commented May 21, 2019

joelparkerhenderson commented May 21, 2019 •

edited

Loading

sjackman commented May 21, 2019

sjackman commented May 21, 2019

johnkerl commented May 21, 2019

sjackman commented May 21, 2019

johnkerl commented May 21, 2019

sjackman commented May 22, 2019

joelparkerhenderson commented May 22, 2019

johnkerl commented Sep 2, 2019

johnkerl commented Sep 3, 2019

johnkerl commented Sep 13, 2019

Feature request: add Unicode separated values (USV) #245

Feature request: add Unicode separated values (USV) #245

Comments

joelparkerhenderson commented May 20, 2019

johnkerl commented May 21, 2019

johnkerl commented May 21, 2019

sjackman commented May 21, 2019

joelparkerhenderson commented May 21, 2019

sjackman commented May 21, 2019

joelparkerhenderson commented May 21, 2019 • edited Loading

sjackman commented May 21, 2019

sjackman commented May 21, 2019

johnkerl commented May 21, 2019

sjackman commented May 21, 2019

johnkerl commented May 21, 2019

sjackman commented May 22, 2019

joelparkerhenderson commented May 22, 2019

johnkerl commented Sep 2, 2019

johnkerl commented Sep 3, 2019

johnkerl commented Sep 13, 2019

joelparkerhenderson commented May 21, 2019 •

edited

Loading