-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: add Unicode separated values (USV) #245
Comments
Great feature request!! Donate to any charity of your choosing; this is open source.:) I'll have more cycles this summer. USV should be an easy codemod away from CSV. :) |
cc @sjackman |
Sounds interesting! I had trouble finding a description of the difference between unicode characters This Wikipedia article says: https://en.wikipedia.org/wiki/Unicode_control_characters#Control_pictures
Which gives me the impression that the files should contain |
You're right. That's the theory. :)
You're right. That's the blocker for me. USV is a pragmatic immediate solution to the lack of displays, and also to the counterpoint, which is how to copy/paste a non-visible character such as into a search box. |
Miller could support both I suppose. There's a good chance it could already work by setting @johnkerl Could Does the |
Great idea-- yes you're right, Miller is able to do this.
So you're right, the feature request is just the convenient command line option:
Metaphorically yes. The original Unicode creators liked the idea of "a union of all character sets", and the mathematical symbol for union of sets looks like a letter U with a plus inside. So the creators decided to use "U+" as a prefix. |
I know this union of sets symbol |
Glad to hear that works out of the box! Perhaps then |
I love it! Thanks guys for figuring all this out -- the next commit will be all the simpler. :D |
@joelparkerhenderson You mean if the content itself contains the delimiters 1E and 1F or 241E and 241E and 241F they are escaped with backslash? The downside of using backslash as the escape character is that backslashes need to be escaped with backslashes @joelparkerhenderson @johnkerl What do you both think? |
Glurk, the main attraction of special characters as delimiters is they only appear as delimiters, not data ... |
That's my opinion too. So you're a fan of the hard error if the data contains the delimiter sequence? |
I mean the content delimiters are 241E & 241F, and not 1E & 1F. My use case is content fields that I am certain never use 241E & 241F. I'm fine with a hard error. You make a good point about the backslash. I rescind that aspect because I am reading about your point, and looking at the existing IANA TSV spec, which disallows tabs in the content. I like the idea of using IANA's approach, and disallowing 241E & 241F in the content. https://www.iana.org/assignments/media-types/text/tab-separated-values |
f8cf06d is the feature; on-line help & docs up next. |
90e0759 is the rest. This will go out in 5.6.0. Thanks for the GREAT feature request! And, sorry for the very long delay on this easy change. :^/ |
Thank you @joelparkerhenderson! |
Miller is great! I would like to donate to you or your favorite charity to help encourage a new feature: Unicode separated values (USV) which uses Unicode unit separator U+241F and Unicode record separator U+241E.
Unicode separated values (USV) are much like comma separated values (CSV), tab separated values (TSV) a.k.a. tab delimited format (TDF), and ASCII separated values (ASV) a.k.a. DEL (Delimited ASCII) a.k.a. ASCII 30-31.
The advantages of USV for me are that USV handles text that happens to contain commas and/or tabs and/or newlines, and also having a visual character representation.
For example USV is great for me within typical source code, such as Unix scripts, because the characters show up, and also easy to copy/paste, and also easy to use within various kinds of editor search boxes.
USV uses a typical backslash to escape.
When data are solely for machines, then for me the choice of characters doesn't matter. When data are potentially for reading or editing, such as by a programmer, then I prefer typically-visible characters (U+241F & U+241F) over typically-invisible zero-width characters (ASCII 30 & 31).
In addition, Unicode U+241F & U+241E are semantically meaningful, and use an international standard, and are able to work well in any typical Unicode language and any typical Unicode font.
Thank you for your consideration.
The text was updated successfully, but these errors were encountered: