Refactor and rethink the tabular ingest subsystem in v6. #8526

landreev · 2022-03-22T19:09:03Z

[a placeholder for now, mostly]

I would like us to very seriously reconsider and possibly reimplement and/or refactor our system of ingesting and storing tabular data. It's a complicated setup, with a lot of technical debt. Some of the complexities are due to some legacy reasons that may not be very relevant anymore.

I'm creating this as an umbrella issue for keeping track of such ingest-related issues.

An example of an ingest-related implementation issue:

Add mechanism for storing ingested tabular data files WITHOUT the variable name header #8524 (brand new)

Numerous issues where an option for skipping ingest is requested:

I would definitely like to finally have this implemented. The reasons it hasn't been implemented yet are kind of legacy too by now.

An example of an issue where specific ingestable format is discussed, and a need for a specific service offered on a tabular file is questioned:

Is RData/Rds download and ingest necessary? #7249

[I'll keep updating this list]

rhijmans · 2023-02-21T04:40:53Z

I would like to add my voice to the RData/RDS debate.

To add to @reikoch 's points (#6678)" RData is not a typical file format It contains an R environment with data, you really have no idea what you might be getting when you load it. The standard way of opening a RData file will just dump the objects in your working environment. That is, if you can load it. Loading may fail because R package that are needed are not available, or because the data objects in the session have pointers to e.g. a C object or a data.base that you do not have. It is a reproducibility nightmare that should not have been included as an option.

RDS is much better. One object per file, when you read it you assign it to a variable name. But even that format ought to be discouraged. It would be much better to use .csv for tabular data because it is a plain, well known format that is software agnostic. The only good reason for using a RDS would be when you wanted to preserve e.g. a fitted model or some other complex data type that cannot be written to a .csv (There there are good other formats in many cases for, such as geotiff for raster data).

While you should of course support RData that has been uploaded, you would do the world a big favor if you did not make it easy to upload new RData files. I suppose people could always work around that via a zip file, but that is fine, that would be on them. At least the sites would not be encouraging bad behavior.

pdurbin · 2023-02-21T06:44:06Z

@rhijmans thanks for your thoughts on this. @siacus and I just had a quick chat about it. I believe @landreev knows the most about how we create RData files (and tabular files) as a more preservation-friendly format than proprietary formats (Stata, SPSS, etc.), so I'll chat with him next.

amberleahey · 2023-05-25T00:45:22Z

I'll also chime in say this refactor and rethink ingest would be helpful, while I'm here I would like to suggest we think about increasing the size of the tabular data ingest to support files in excess of 500MB (current limit), and expand support for CSV to include CSVW format JSON to provide metadata for CSV files https://csvw.org/
We would also like to improve adding/editing value and category labels using the Data Curation Tool, so maybe this work could kick start some new features in DV as well!
I'm sure there are more things we could improve/support with a refactor/redesign of this important DV feature...

jggautier · 2024-11-06T14:44:51Z

A user used the feedback form on Harvard Dataverse today to leave feedback related to how Dataverse handles tabular files. I'm including it in this Github issue since this GitHub issue is the most recent issue about the handling of tabular data.

The file formats do not conform to common standards, so off the shelf tools like Pandas, Excel, Google sheets Qlik, PowerBI and Tableau don't work without a conversion with R or Python first. You could have educated your clients with some basic standard on syntax, or offer a csv download option.

Later the user wrote:

I have also included a python example on how the data could also have been saved, using standards. I illustrate the simple zipped csv standard and the advanced parquet. But options include metadata into the file itself, and do not rely on dedicated one-off code stored elsewhere. Both standards do lead to larger data file sizes.

I put the python example they included in this zip file: dataverse-glopops.ipynb.zip

amberleahey · 2024-11-06T14:52:21Z

A user used the feedback form on Harvard Dataverse today to leave feedback related to how Dataverse handles tabular files. I'm including it in this Github issue since it's the most recent issue about the handling of tabular data.

The file formats do not conform to common standards, so off the shelf tools like Pandas, Excel, Google sheets Qlik, PowerBI and Tableau don't work without a conversion with R or Python first. You could have educated your clients with some basic standard on syntax, or offer a csv download option.

I'll update this comment if I'm able to learn more from the user.

Many thanks, yes we get this feedback a lot. We suggested an alternative to Tab using CSV through the R software conversion packages developed by others in the R community. Easily be added to DV. We can build out options for CSV and Tab as they serve two different needs. Other formats are also possible. I'm not sure where the ticket went but we already have a proof-of-concept built in our development environment , happy to share more details and would love to see this go forward!

reikoch · 2024-11-06T15:02:49Z

csv and tsv generally are pretty common. It would be nice if clearly defined variant of these formats were used - which separator, which quote/unquote character, stability against embedded newlines etc. Another drawback of these formats is their lack of type safety, for instance is 1DEC just a string or a date? There are options to waterproof both issues, for instance see https://datapackage.org/.

pdurbin · 2024-11-06T16:41:43Z

@jggautier interesting. I can't find the ticket in RT. Please forward it to me when you get a chance.

Is the problem the ".tab" extension? For a long, long, long time I've wanted us to switch to ".tsv". Here's someone on Twitter saying they couldn't open a ".tab" file but renaming it to ".tsv" fixed it": #2720 (comment)

@amberleahey sure, offering both CSV and TSV sounds find. @reikoch seems to suggest that there are variants of these formats. To me, the separator for TSV is the tab character. I'm not sure what the variations would be. From what I've heard CSV can have variants, various quoting and escaping conventions, perhaps, but I haven't looked into it.

@amberleahey you might be remembering how @lubitchv pointed us toward https://github.com/dusadrian/DDIwR at https://groups.google.com/g/dataverse-community/c/JG6NkJ7ZQW0/m/pRvbS4YVAQAJ

@reikoch Data Package looks interesting. Thanks.

@landreev sorry for hijacking your issue. Maybe we can move the discussion to https://dataverse.zulipchat.com or https://groups.google.com/g/dataverse-community 😅

jggautier · 2024-11-06T16:51:06Z

Hey @pdurbin. There's no RT ticket. The feedback form I mentioned is the orange "Feedback" link we see at the bottom of Harvard Dataverse, which leads to the Google Form at https://docs.google.com/forms/d/e/1FAIpQLSf4VinfXacN2fsZxWn1_ITCAlUqzdBT7cNk5C5DjvKBRf6wtQ/viewform.

It's not clear to me what the problem is exactly, and I might not be able to learn more from the user. If I do, I'll update that comment. But at the least, the user's comments could help with figuring out what questions need answers. I don't think this conversation is hijacking this GitHub issue.

pdurbin · 2024-11-12T16:27:48Z

Many thanks, yes we get this feedback a lot. We suggested an alternative to Tab using CSV through the R software conversion packages developed by others in the R community. Easily be added to DV. We can build out options for CSV and Tab as they serve two different needs. Other formats are also possible. I'm not sure where the ticket went but we already have a proof-of-concept built in our development environment , happy to share more details and would love to see this go forward!

@amberleahey I assume this new issue is related:

Feature Request: Integrate data conversion that allows additional data file formats to be offered in download options #11015

amberleahey · 2024-11-21T16:54:48Z

yes, let's think more about data conversion. Glad it is cross linked to this ticket now, thanks!

Another long-standing request has been to relabel the ingested derivative format and label it 'Preservation copy' or something to prevent this from being the default download and as much as possible promote the original file as the access copy (the tsv is not usable in many cases).

pdurbin · 2024-11-21T17:41:20Z

@amberleahey sure. Please feel free to create a dedicated issue about calling those files "preservation copy" or whatever.

Also, this issue is somewhat related:

Display deposited (rather than ingested) copy of tabular files #7956

pdurbin mentioned this issue Oct 1, 2022

semicolon separated values file with .csv extension fails ingest #8990

Open

This was referenced Oct 8, 2022

RData ingest #6985

Closed

replace download option "RData" with RDS #6678

Open

Is RData/Rds download and ingest necessary? #7249

Closed

rdata ingest defaults #3999

Closed

DS-INRAE added this to Recherche Data Gouv Jul 10, 2024

DS-INRAE moved this to 🔍 Interest in Recherche Data Gouv Jul 10, 2024

pdurbin mentioned this issue Nov 12, 2024

Feature Request: Integrate data conversion that allows additional data file formats to be offered in download options #11015

Open

vkush mentioned this issue Dec 5, 2024

Tabular files: semicolon is not supported as a separator by ".csv" files nfdi4cat/repo4cat#37

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor and rethink the tabular ingest subsystem in v6. #8526

Refactor and rethink the tabular ingest subsystem in v6. #8526

landreev commented Mar 22, 2022 •

edited by pdurbin

Loading

rhijmans commented Feb 21, 2023

pdurbin commented Feb 21, 2023

amberleahey commented May 25, 2023

jggautier commented Nov 6, 2024 •

edited

Loading

amberleahey commented Nov 6, 2024

reikoch commented Nov 6, 2024

pdurbin commented Nov 6, 2024 •

edited

Loading

jggautier commented Nov 6, 2024

pdurbin commented Nov 12, 2024

amberleahey commented Nov 21, 2024

pdurbin commented Nov 21, 2024

Refactor and rethink the tabular ingest subsystem in v6. #8526

Refactor and rethink the tabular ingest subsystem in v6. #8526

Comments

landreev commented Mar 22, 2022 • edited by pdurbin Loading

rhijmans commented Feb 21, 2023

pdurbin commented Feb 21, 2023

amberleahey commented May 25, 2023

jggautier commented Nov 6, 2024 • edited Loading

amberleahey commented Nov 6, 2024

reikoch commented Nov 6, 2024

pdurbin commented Nov 6, 2024 • edited Loading

jggautier commented Nov 6, 2024

pdurbin commented Nov 12, 2024

amberleahey commented Nov 21, 2024

pdurbin commented Nov 21, 2024

landreev commented Mar 22, 2022 •

edited by pdurbin

Loading

jggautier commented Nov 6, 2024 •

edited

Loading

pdurbin commented Nov 6, 2024 •

edited

Loading