-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delimiter Discovery #400
Comments
Hey @cawoodm, |
While I am not against the idea, I can't say that I fully support the idea. However, if you come up with a clean |
@wdavidw I looked on GitHub to see what other people were doing. node-csv-string detect function, which basically looks for the first occurrence of one of the delimiters, I guess is fine for most cases. Another more advanced implementation was the detect-csv determineMost function , which looks at a sample and returns the delimiter with the highest occurancy count. What do you think ? |
I would tend to discover the character, like in the second methods, after filtering any already used character in options (eg quotes, row delimiters, ...) and general ascii characters ([a-zA-z0-9]) (including accented characters). |
@wdavidw master...hadyrashwan:node-csv:patch-1 When running tests, the below happens, not sure why:
Question:
Missing parts:
Appreciate your feedback :) |
I'll take some time to review later. In the mean time, what do you mean by "We are committing the dist files". |
A few notes for now:
|
When I'm working I always see the build files in the dist folders added to git and not ignored. Some projects add those build files in the git ignore file. Just want to make sure that I'm not adding those files by mistake. |
A couple of comments on the method of detecting delimiters:
Python has a great example of handling these in their implementation. The |
Are there any plans to open a PR for that? As far as I can see the current changes are only present on a branch. I would definitely love to see that feature. |
Here's another algorithm for detecting the delimiter that seems like a good idea: |
I didn't have the time yet. The solution needs to deal with the streaming nature of the parser. The solution would be to extract a limited amount of bytes from the stream, apply a detection algorythm such as the one proposed above, then replay the bytes stored on the side with the detected delimiter. I need some time to do it correctly. |
Also highly interested in that. My current workaround is to clone the original stream, read the first chunk on a duplicate, detect the delimiter, then get back and handle the first stream. So basically: const [mainStream, delimiterStream] = stream.tee();
const reader = delimiterStream.getReader();
const { value } = await reader.read();
const delimiter = discoverDelimiter(value);
reader.cancel();
return { delimiter, stream: mainStream }; Looks somewhat messy, but it works. I have to dynamically handle different types of files avoiding static configuration as much as possible. |
Summary
The library should detect common delimiters
,
or;
.Motivation
Many international users will be working with commas or semi-colons on a daily basis depending on who generated their CSV. Currently we need to manually parse the first line and see which delimiter is used.
Alternative
Draft
Additional context
Since it may be difficult to detect "just any" delimiter the developer can pass an array of common delimiters which they expect.
The text was updated successfully, but these errors were encountered: