Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support CSV dialects #3864

Open
dafeder opened this issue Oct 28, 2022 · 6 comments
Open

Support CSV dialects #3864

dafeder opened this issue Oct 28, 2022 · 6 comments

Comments

@dafeder
Copy link
Member

dafeder commented Oct 28, 2022

There are a lot of permutations of CSV out there, from TSV to things like semicolon-delimited files, to different escaping methods, etc. Even though both DKAN's native CSV parser and the mysql LOAD DATA importer can be configured to support most of these permutations, there is no easy way to do this in DKAN, on either a per-resource or system-wide level.

Frictionless Data project has a spec designed to address just this issue, CSV Dialect. We should explore ways to support different dialects in importers, and figure out the most efficient way to communicate which dialect to use to the importer on a per-resource basis.

@stefan-korn
Copy link
Contributor

@dafeder : This is of interest for us. In Germany the delimiter is usually a semicolon, and even MS Excel or the liking use the semicolon delimiter by default in german versions.

Am I right that currently the delimiter is hardcoded in this place: https://github.com/GetDKAN/dkan/blob/2.x/modules/datastore/src/Service/ImportService.php#L166

And right now there is no configuration option for this?

Regarding

figure out the most efficient way to communicate which dialect to use to the importer on a per-resource basis.

Do you already have something in mind? Would extending the distribution schema about an optional field to define the CSV dialect be a viable option from your point of view?

@dafeder
Copy link
Member Author

dafeder commented Apr 25, 2024

We are in a tricky spot because we are trying to stay as close to DCAT as possible, but this is kind of outside the scope of DCAT. I think as a stopgap we should figure out some relatively straightforward way to override that hardcoded value, but it may be that a better solution is to have a system outside of the metastore completely for storing file resources, perhaps as part of the datastore, and decouple that as much as possible from the metastore. This is sort of already the case but Resources are basically just a URL and a timestamp at the moment.

@dafeder
Copy link
Member Author

dafeder commented Apr 25, 2024

Also, there is a way to do this, sort of, with event listeners. The ImportService::EVENT_CONFIGURE_PARSER event would allow you to change the delimiter character, but you would need to define all your conditional logic there. Will make a note to document this in a recipe, but something like:

$events[Import::EVENT_CONFIGURE_PARSER][] = [‘set’];

[...]

public function set(Event $event) {
    $parserConfiguration = $event->getData();
    $parserConfiguration['delimeter'] = ';';
    $event->setData($parserConfiguration);
}

h/t @janette

@stefan-korn
Copy link
Contributor

@dafeder : Thanks a lot for the hint. This works nicely to change it to semicolon overall. I missed that out. Still often times I am only looking for the good old hooks and forgetting about the new synfony events ... By the way and off-topic: is there a documentation standard like for the hooks for events? I just searched a bit and could not find anything fruitful about this.

Regarding conditional logic: the event gets only the parser configuration as data? So I have no clue about the resource that is parsed here? Or am I missing something again?

@dafeder
Copy link
Member Author

dafeder commented Apr 29, 2024

I think you're right, it's basically all or nothing, sorry to lead you astray there. And yeah, documenting those events has been on our to-do list for a long time now, this is a good reminder.

@stefan-korn
Copy link
Contributor

fyi #4176

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants