Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: add Unicode separated values (USV) #245

Closed
joelparkerhenderson opened this issue May 20, 2019 · 16 comments
Closed

Feature request: add Unicode separated values (USV) #245

joelparkerhenderson opened this issue May 20, 2019 · 16 comments

Comments

@joelparkerhenderson
Copy link

Miller is great! I would like to donate to you or your favorite charity to help encourage a new feature: Unicode separated values (USV) which uses Unicode unit separator U+241F and Unicode record separator U+241E.

Unicode separated values (USV) are much like comma separated values (CSV), tab separated values (TSV) a.k.a. tab delimited format (TDF), and ASCII separated values (ASV) a.k.a. DEL (Delimited ASCII) a.k.a. ASCII 30-31.

The advantages of USV for me are that USV handles text that happens to contain commas and/or tabs and/or newlines, and also having a visual character representation.

For example USV is great for me within typical source code, such as Unix scripts, because the characters show up, and also easy to copy/paste, and also easy to use within various kinds of editor search boxes.

USV uses a typical backslash to escape.

When data are solely for machines, then for me the choice of characters doesn't matter. When data are potentially for reading or editing, such as by a programmer, then I prefer typically-visible characters (U+241F & U+241F) over typically-invisible zero-width characters (ASCII 30 & 31).

In addition, Unicode U+241F & U+241E are semantically meaningful, and use an international standard, and are able to work well in any typical Unicode language and any typical Unicode font.

Thank you for your consideration.

@johnkerl
Copy link
Owner

Great feature request!! Donate to any charity of your choosing; this is open source.:) I'll have more cycles this summer. USV should be an easy codemod away from CSV. :)

@johnkerl
Copy link
Owner

cc @sjackman

@sjackman
Copy link
Contributor

Sounds interesting! I had trouble finding a description of the difference between unicode characters
Record Separator U+001E
Unit Separator U+001F
and
Unicode Character 'SYMBOL FOR RECORD SEPARATOR' (U+241E)
Unicode Character 'SYMBOL FOR UNIT SEPARATOR' (U+241F)
https://unicode.org/charts/PDF/U2400.pdf

This Wikipedia article says: https://en.wikipedia.org/wiki/Unicode_control_characters#Control_pictures

Unicode provides graphic characters for representing C0 control codes and other control characters in the Control Pictures block. They are visual representations, not the actual control codes themselves.

Which gives me the impression that the files should contain U+001E and U+001F and your text editor should represent them using the symbols from U+241E and U+241F. I tested MacVim and Visual Studio Code, and neither editor displays U+001E and U+001F using the symbols unfortunately.

@joelparkerhenderson
Copy link
Author

Which gives me the impression that the files should contain U+001E and U+001F and your text editor should represent them using the symbols from U+241E and U+241F.

You're right. That's the theory. :)

I tested MacVim and Visual Studio Code, and neither editor displays

You're right. That's the blocker for me. USV is a pragmatic immediate solution to the lack of displays, and also to the counterpoint, which is how to copy/paste a non-visible character such as into a search box.

@sjackman
Copy link
Contributor

Miller could support both I suppose. There's a good chance it could already work by setting --rs and --fs to the appropriate values. The change to Miller would be only adding convenient command line options. @joelparkerhenderson Have you tested setting --rs and --fs to u001x?

@johnkerl Could --rs and --fs accept unicode codes to make that a bit easier? For example: mlr --rs=u241e --fs=u241f

Does the + character in U+241E mean anything in particular? I'm not the Unicode guru.

@joelparkerhenderson
Copy link
Author

joelparkerhenderson commented May 21, 2019

Great idea-- yes you're right, Miller is able to do this.

$ printf "a␟b␟c␞d␟e␟f␞g␟h␟i" > example.usv
$ mlr --tsv --fs '␟' --rs '␞' cut -f a example.usv
a␞d␞g␞

So you're right, the feature request is just the convenient command line option:

$ mlr --usv cut -f a example.usv

Does the + character in U+241E mean anything in particular?

Metaphorically yes. The original Unicode creators liked the idea of "a union of all character sets", and the mathematical symbol for union of sets looks like a letter U with a plus inside. So the creators decided to use "U+" as a prefix.

@sjackman
Copy link
Contributor

I know this union of sets symbol https://www.fileformat.info/info/unicode/char/222a/index.htm but not one with a plus sign inside.

@sjackman
Copy link
Contributor

Glad to hear that works out of the box! Perhaps then --asv for ASCII separated values that uses 1E and 1F and --usv for unicode separated values that uses 241E and 241F?
@johnkerl What do you think?

@johnkerl
Copy link
Owner

I love it! Thanks guys for figuring all this out -- the next commit will be all the simpler. :D

@johnkerl johnkerl changed the title Feature request: $50 donation to help add Unicode separated values (USV) Feature request: add Unicode separated values (USV) May 21, 2019
@sjackman
Copy link
Contributor

USV uses a typical backslash to escape.

@joelparkerhenderson You mean if the content itself contains the delimiters 1E and 1F or 241E and 241E and 241F they are escaped with backslash? The downside of using backslash as the escape character is that backslashes need to be escaped with backslashes \\. The data that I work with would never contain 1E and 1F or 241E and 241F but it may contain a backslash. Could a unicode character be used as the escape character (perhaps U+001B Escape) rather than the common backslash? Or even better for me, just say that it's a hard error if the data contains the delimiter sequences.

@joelparkerhenderson @johnkerl What do you both think?

@johnkerl
Copy link
Owner

Glurk, the main attraction of special characters as delimiters is they only appear as delimiters, not data ...

@sjackman
Copy link
Contributor

That's my opinion too. So you're a fan of the hard error if the data contains the delimiter sequence?

@joelparkerhenderson
Copy link
Author

I mean the content delimiters are 241E & 241F, and not 1E & 1F.

My use case is content fields that I am certain never use 241E & 241F. I'm fine with a hard error.

You make a good point about the backslash. I rescind that aspect because I am reading about your point, and looking at the existing IANA TSV spec, which disallows tabs in the content. I like the idea of using IANA's approach, and disallowing 241E & 241F in the content.

https://www.iana.org/assignments/media-types/text/tab-separated-values

@johnkerl
Copy link
Owner

johnkerl commented Sep 2, 2019

f8cf06d is the feature; on-line help & docs up next.

@johnkerl johnkerl added active and removed on deck labels Sep 2, 2019
@johnkerl
Copy link
Owner

johnkerl commented Sep 3, 2019

90e0759 is the rest. This will go out in 5.6.0.

Thanks for the GREAT feature request! And, sorry for the very long delay on this easy change. :^/

@johnkerl johnkerl closed this as completed Sep 3, 2019
@johnkerl johnkerl removed the active label Sep 13, 2019
@johnkerl
Copy link
Owner

Thank you @joelparkerhenderson!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants