Delimiter Discovery #400

cawoodm · 2023-08-31T09:07:40Z

Summary

The library should detect common delimiters , or ;.

Motivation

Many international users will be working with commas or semi-colons on a daily basis depending on who generated their CSV. Currently we need to manually parse the first line and see which delimiter is used.

Alternative

let delimiter = csvString.match(/[,;]/)?.[0];
let data = csv.parse(csvString, {delimiter};

Draft

let data = csv.parse(csvString, {
  detect_delimiters: [',', ';'],
};

Additional context

Since it may be difficult to detect "just any" delimiter the developer can pass an array of common delimiters which they expect.

The text was updated successfully, but these errors were encountered:

hadyrashwan · 2024-01-12T07:18:45Z

Hey @cawoodm,
Anyone is working on this one ? I take work on it.

wdavidw · 2024-01-12T08:04:43Z

While I am not against the idea, I can't say that I fully support the idea. However, if you come up with a clean delimiter_auto option, I'll probably merge it.

hadyrashwan · 2024-01-15T08:51:39Z

@wdavidw I looked on GitHub to see what other people were doing.

node-csv-string detect function, which basically looks for the first occurrence of one of the delimiters, I guess is fine for most cases.

Another more advanced implementation was the detect-csv determineMost function , which looks at a sample and returns the delimiter with the highest occurancy count.

What do you think ?

wdavidw · 2024-01-15T09:29:51Z

I would tend to discover the character, like in the second methods, after filtering any already used character in options (eg quotes, row delimiters, ...) and general ascii characters ([a-zA-z0-9]) (including accented characters).

hadyrashwan · 2024-01-29T08:50:39Z

@wdavidw
I created a small proof of concept for the auto_delimiter option.

master...hadyrashwan:node-csv:patch-1

When running tests, the below happens, not sure why:

All tests pass when I run the test script. Except the encoding with BOM option.
When I run the encoding tests (packages/csv-parse/test/option.encoding.coffee) on its own, it works.
I added a small test for \t based on the delimiter tests to see how the logic runs it did detect the delimiter successfully, however it did not pass the test.

Question:

We are committing the dist files, is this expected ?

Missing parts:

Handling of characters coming from escape, quote, and record delimiter options.
Add more tests.
Add a references in TS definition.
Add a new page about the auto_delimiter in the docs.

Appreciate your feedback :)

wdavidw · 2024-01-29T13:19:00Z

I'll take some time to review later. In the mean time, what do you mean by "We are committing the dist files".

wdavidw · 2024-01-29T13:56:10Z

A few notes for now:

delimiter_auto and not auto_delimiter, __discoverDelimiterAuto and not __autoDiscoverDelimiter
Disabled by default, default value is false
When normalizing the option, add consistency check, for exemple it cannot equal the values of record_delimiter (all those rules require tests)
Dont convert to string, you shall compare values directly inside the buffer
Write more unit tests but in particular one which write data one byte at a time (see https://github.com/adaltas/node-csv/blob/master/packages/csv-parse/test/api.stream.events.coffee#L53 as an example)
My strategy would be to discover delimiter before any parsing is done, here is how I will start my experiment
1. Work around the __needMoreData, if delimiter_auto is activated, and only in the first line, you shall allocated a safe buffer size dedicated to discovery
2. Start discovery (your __autoDiscoverDelimiter function) just after bom handling and before actual parsing (https://github.com/adaltas/node-csv/blob/master/packages/csv-parse/lib/api/index.js#L109)

hadyrashwan · 2024-01-30T07:16:01Z

I'll take some time to review later. In the mean time, what do you mean by "We are committing the dist files".

When I'm working I always see the build files in the dist folders added to git and not ignored.

Some projects add those build files in the git ignore file.

Just want to make sure that I'm not adding those files by mistake.

NoahCzelusta · 2024-03-07T16:52:11Z

A couple of comments on the method of detecting delimiters:

We cannot safely assume that the most common character is THE CSV delimiter. The CSV delimiter is the character that consistently split the row into the same number of columns on each row.
CSVs can safely store strings that can contain the delimiter, so parsing has to be a little more intelligent (either considering quotes or allowing for a small degree of inconsistency in column count per row.

Python has a great example of handling these in their implementation. The pandas library uses this implementation but only reads from the first line of the file (here).

carlbleick · 2024-04-10T09:04:36Z

Are there any plans to open a PR for that? As far as I can see the current changes are only present on a branch.

I would definitely love to see that feature.

vincerubinetti · 2024-05-21T19:04:32Z

Here's another algorithm for detecting the delimiter that seems like a good idea:
https://stackoverflow.com/a/19070276/2180570

wdavidw · 2024-05-22T09:55:34Z

I didn't have the time yet. The solution needs to deal with the streaming nature of the parser. The solution would be to extract a limited amount of bytes from the stream, apply a detection algorythm such as the one proposed above, then replay the bytes stored on the side with the detected delimiter. I need some time to do it correctly.

schibrikov · 2024-10-31T00:11:16Z

Also highly interested in that. My current workaround is to clone the original stream, read the first chunk on a duplicate, detect the delimiter, then get back and handle the first stream.

So basically:

const [mainStream, delimiterStream] = stream.tee();
const reader = delimiterStream.getReader();
const { value } = await reader.read();
const delimiter = discoverDelimiter(value);
reader.cancel();

return { delimiter, stream: mainStream };

Looks somewhat messy, but it works. I have to dynamically handle different types of files avoiding static configuration as much as possible.

cawoodm added the enhancement label Aug 31, 2023

loucadufault mentioned this issue May 22, 2024

Support multiple quote characters when parsing #431

Open

missinglink mentioned this issue Nov 8, 2024

Defining delimiter inside CSV files to import pelias/csv-importer#111

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delimiter Discovery #400

Delimiter Discovery #400

cawoodm commented Aug 31, 2023

hadyrashwan commented Jan 12, 2024

wdavidw commented Jan 12, 2024

hadyrashwan commented Jan 15, 2024

wdavidw commented Jan 15, 2024

hadyrashwan commented Jan 29, 2024 •

edited

Loading

wdavidw commented Jan 29, 2024

wdavidw commented Jan 29, 2024 •

edited

Loading

hadyrashwan commented Jan 30, 2024 •

edited

Loading

NoahCzelusta commented Mar 7, 2024 •

edited

Loading

carlbleick commented Apr 10, 2024

vincerubinetti commented May 21, 2024

wdavidw commented May 22, 2024

schibrikov commented Oct 31, 2024 •

edited

Loading

Delimiter Discovery #400

Delimiter Discovery #400

Comments

cawoodm commented Aug 31, 2023

hadyrashwan commented Jan 12, 2024

wdavidw commented Jan 12, 2024

hadyrashwan commented Jan 15, 2024

wdavidw commented Jan 15, 2024

hadyrashwan commented Jan 29, 2024 • edited Loading

wdavidw commented Jan 29, 2024

wdavidw commented Jan 29, 2024 • edited Loading

hadyrashwan commented Jan 30, 2024 • edited Loading

NoahCzelusta commented Mar 7, 2024 • edited Loading

carlbleick commented Apr 10, 2024

vincerubinetti commented May 21, 2024

wdavidw commented May 22, 2024

schibrikov commented Oct 31, 2024 • edited Loading

hadyrashwan commented Jan 29, 2024 •

edited

Loading

wdavidw commented Jan 29, 2024 •

edited

Loading

hadyrashwan commented Jan 30, 2024 •

edited

Loading

NoahCzelusta commented Mar 7, 2024 •

edited

Loading

schibrikov commented Oct 31, 2024 •

edited

Loading