Change delimiter for CSV parsing #4176
Replies: 3 comments 1 reply
-
Very cool @stefan-korn. Hopefully we won't need this soon but for now this could be helpful to people. One thing, I imagine this has no effect on the MySQL importer? If not I think that should be noted in the module page or README. It may not be obvious to people who have not gotten too deep into the code that the MySQL importer (which I think at one point we clocked at having 60x better performance) does not use the CSV parser at all. I think we need to add an event to this part of the mysql import module so that there is an opportunity to change the delimiter. It should be noted that in the mysql importer, we do inspect the mimetype and change the delimiter to the tab character if it's tsv. This is obviously not the best method because these files are often just labled as txt files or even csv. But interesting to note that with your module you can do semicolon delimiters but not tabs in the CSV parser, while we can do tabs but not semicolons in the mysql importer. |
Beta Was this translation helpful? Give feedback.
-
@dafeder : Thanks for pointing this out. Didn't have a look in the MySQL importer yet, and maybe was wishfully thinking it relies on the CSV parser as well :-) but then it probably could not be 60x faster ... I have added a submodule that allows to choose the delimiter for MySQL importer too (submodule is named datastore_mysql_import_tweak). It also decorates the dkan.datastore.service.factory.import and with decoration_priority I could manage to have it been called at the right time I suppose. One more thing to note: And one more general question: |
Beta Was this translation helpful? Give feedback.
-
It seems very strange to me that allow_delimiter_in_query would be necessary to set the delimiter in LOAD DATA? We're talking about two completely unrelated kinds of delimiters here -- the one allow_delimiter_in_query is supposed to refer to is simply having a semicolon at the end of a MYSQL query string. Are you saying that a |
Beta Was this translation helpful? Give feedback.
-
Made a small module that allows to change the delimiter (and the quoting character) for the CSV parsing (see #3864 ). (Config form that allows to set the delimiter)
Maybe later on adding a possibility to change this per resource rather than for all.
Objections:
I also tried with tabulator delimiter, but could not get this to work properly, so let this out for the moment (comma, semicolon and whitespace only at the moment).
The quoting character does not have that big of an influence as far as I can tell. Seems to work with quoted strings in CSV as well as without.
Regarding escape and record end character, I was unsure how to provide this correctly and what to offer as alternatives, so left this out for the moment.
Beta Was this translation helpful? Give feedback.
All reactions