-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mlr --otsv
does not handle broken quotes correctly compared to --ocsv
#1533
Comments
Hi @SpikyClip is there an RFC for TSV? If yes, could you share the URL? I think there is only the one for CSV, in fact maybe the way to have a TSV that behaves like a CSV is the solution. I'm not sure I understand your goal, but I'll try. If you run
you get
|
I get your point that RFC4180 formally applies to CSV files and not necessarily TSV files. But from my experience the two are essentially treated the same except that the delimiter is different. To expand on my use case, a common package used in my field is Besides, its unusual considering Thanks for the |
In my experience RFC4180 is rightly and usually taken into account for CSVs. Not for TSV.
|
@SpikyClip there's the rub! I had agreed with your statement here, and this is what Miller once did, but as of #923 this is no longer the case. Formerly, Miller's TSV was RFC4180 CSV with commas replaced by tabs. Now, while TSV does not have an RFC, as of #923 I follow the behavior as described at https://miller.readthedocs.io/en/6.12.0/file-formats/#csvtsvasvusvetc. Namely: embedded newlines and tabs are encoded as |
Thanks John, I came across the issue to add I have no issues with this behaviour, though I think it would be a good idea to mention in the docs that
For my purposes I am simply going to Am happy to close this issue if you are. |
Thanks @SpikyClip !! And thanks for reminding me of the name "IANA" for the (pseudo-)spec which I ended up following on #923. Let's leave this issue open as a doc issue. I'll incorporate your suggestions above, and I'll also use the term "IANA" for clarity, etc. |
mlr --otsv
does not handle broken quotes correctly compared to --ocsv
version:mlr 6.12.0
I have a TSV file that contains uneven double quotes
"
that often results in downstream tools expecting RFC compliant TSVs dropping rows silently (e.g. Rread_tsv
). This is a problem as it may not be immediately obvious that rows are getting dropped. Here is a simulated dataset (broken.tsv.txt
):It is expected that the following would make the TSV RFC compliant:
However it does not. Though
--ocsv
works:I tried the same with the same error in csv format:
So this definitely feels like a bug. I was under the impression that
--tsv
is essentially--csv
with tab as the field separator, so it should follow all the standard RFC compliance rules around double quotes. I tried to see if I could trick it by converting to csv then to tsv but it didnt work:I think this could be fixed by transferring the
--csv
behaviour around stray double quotes such that it also works for--tsv
.--tsvlite
could remain unaffected in case this behaviour is not desired.broken.csv
broken.tsv.txt
The text was updated successfully, but these errors were encountered: