Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor and rethink the tabular ingest subsystem in v6. #8526

Open
landreev opened this issue Mar 22, 2022 · 11 comments
Open

Refactor and rethink the tabular ingest subsystem in v6. #8526

landreev opened this issue Mar 22, 2022 · 11 comments

Comments

@landreev
Copy link
Contributor

landreev commented Mar 22, 2022

[a placeholder for now, mostly]

I would like us to very seriously reconsider and possibly reimplement and/or refactor our system of ingesting and storing tabular data. It's a complicated setup, with a lot of technical debt. Some of the complexities are due to some legacy reasons that may not be very relevant anymore.

I'm creating this as an umbrella issue for keeping track of such ingest-related issues.

An example of an ingest-related implementation issue:

Numerous issues where an option for skipping ingest is requested:

I would definitely like to finally have this implemented. The reasons it hasn't been implemented yet are kind of legacy too by now.

An example of an issue where specific ingestable format is discussed, and a need for a specific service offered on a tabular file is questioned:

[I'll keep updating this list]

@rhijmans
Copy link

I would like to add my voice to the RData/RDS debate.

To add to @reikoch 's points (#6678)" RData is not a typical file format It contains an R environment with data, you really have no idea what you might be getting when you load it. The standard way of opening a RData file will just dump the objects in your working environment. That is, if you can load it. Loading may fail because R package that are needed are not available, or because the data objects in the session have pointers to e.g. a C object or a data.base that you do not have. It is a reproducibility nightmare that should not have been included as an option.

RDS is much better. One object per file, when you read it you assign it to a variable name. But even that format ought to be discouraged. It would be much better to use .csv for tabular data because it is a plain, well known format that is software agnostic. The only good reason for using a RDS would be when you wanted to preserve e.g. a fitted model or some other complex data type that cannot be written to a .csv (There there are good other formats in many cases for, such as geotiff for raster data).

While you should of course support RData that has been uploaded, you would do the world a big favor if you did not make it easy to upload new RData files. I suppose people could always work around that via a zip file, but that is fine, that would be on them. At least the sites would not be encouraging bad behavior.

@pdurbin
Copy link
Member

pdurbin commented Feb 21, 2023

@rhijmans thanks for your thoughts on this. @siacus and I just had a quick chat about it. I believe @landreev knows the most about how we create RData files (and tabular files) as a more preservation-friendly format than proprietary formats (Stata, SPSS, etc.), so I'll chat with him next.

@amberleahey
Copy link

I'll also chime in say this refactor and rethink ingest would be helpful, while I'm here I would like to suggest we think about increasing the size of the tabular data ingest to support files in excess of 500MB (current limit), and expand support for CSV to include CSVW format JSON to provide metadata for CSV files https://csvw.org/
We would also like to improve adding/editing value and category labels using the Data Curation Tool, so maybe this work could kick start some new features in DV as well!
I'm sure there are more things we could improve/support with a refactor/redesign of this important DV feature...

@DS-INRAE DS-INRAE moved this to 🔍 Interest in Recherche Data Gouv Jul 10, 2024
@jggautier
Copy link
Contributor

jggautier commented Nov 6, 2024

A user used the feedback form on Harvard Dataverse today to leave feedback related to how Dataverse handles tabular files. I'm including it in this Github issue since this GitHub issue is the most recent issue about the handling of tabular data.

The file formats do not conform to common standards, so off the shelf tools like Pandas, Excel, Google sheets Qlik, PowerBI and Tableau don't work without a conversion with R or Python first. You could have educated your clients with some basic standard on syntax, or offer a csv download option.

Later the user wrote:

I have also included a python example on how the data could also have been saved, using standards. I illustrate the simple zipped csv standard and the advanced parquet. But options include metadata into the file itself, and do not rely on dedicated one-off code stored elsewhere. Both standards do lead to larger data file sizes.

I put the python example they included in this zip file: dataverse-glopops.ipynb.zip

@amberleahey
Copy link

A user used the feedback form on Harvard Dataverse today to leave feedback related to how Dataverse handles tabular files. I'm including it in this Github issue since it's the most recent issue about the handling of tabular data.

The file formats do not conform to common standards, so off the shelf tools like Pandas, Excel, Google sheets Qlik, PowerBI and Tableau don't work without a conversion with R or Python first. You could have educated your clients with some basic standard on syntax, or offer a csv download option.

I'll update this comment if I'm able to learn more from the user.

Many thanks, yes we get this feedback a lot. We suggested an alternative to Tab using CSV through the R software conversion packages developed by others in the R community. Easily be added to DV. We can build out options for CSV and Tab as they serve two different needs. Other formats are also possible. I'm not sure where the ticket went but we already have a proof-of-concept built in our development environment , happy to share more details and would love to see this go forward!

@reikoch
Copy link

reikoch commented Nov 6, 2024

csv and tsv generally are pretty common. It would be nice if clearly defined variant of these formats were used - which separator, which quote/unquote character, stability against embedded newlines etc. Another drawback of these formats is their lack of type safety, for instance is 1DEC just a string or a date? There are options to waterproof both issues, for instance see https://datapackage.org/.

@pdurbin
Copy link
Member

pdurbin commented Nov 6, 2024

@jggautier interesting. I can't find the ticket in RT. Please forward it to me when you get a chance.

Is the problem the ".tab" extension? For a long, long, long time I've wanted us to switch to ".tsv". Here's someone on Twitter saying they couldn't open a ".tab" file but renaming it to ".tsv" fixed it": #2720 (comment)

@amberleahey sure, offering both CSV and TSV sounds find. @reikoch seems to suggest that there are variants of these formats. To me, the separator for TSV is the tab character. I'm not sure what the variations would be. From what I've heard CSV can have variants, various quoting and escaping conventions, perhaps, but I haven't looked into it.

@amberleahey you might be remembering how @lubitchv pointed us toward https://github.com/dusadrian/DDIwR at https://groups.google.com/g/dataverse-community/c/JG6NkJ7ZQW0/m/pRvbS4YVAQAJ

@reikoch Data Package looks interesting. Thanks.

@landreev sorry for hijacking your issue. Maybe we can move the discussion to https://dataverse.zulipchat.com or https://groups.google.com/g/dataverse-community 😅

@jggautier
Copy link
Contributor

Hey @pdurbin. There's no RT ticket. The feedback form I mentioned is the orange "Feedback" link we see at the bottom of Harvard Dataverse, which leads to the Google Form at https://docs.google.com/forms/d/e/1FAIpQLSf4VinfXacN2fsZxWn1_ITCAlUqzdBT7cNk5C5DjvKBRf6wtQ/viewform.

It's not clear to me what the problem is exactly, and I might not be able to learn more from the user. If I do, I'll update that comment. But at the least, the user's comments could help with figuring out what questions need answers. I don't think this conversation is hijacking this GitHub issue.

@pdurbin
Copy link
Member

pdurbin commented Nov 12, 2024

Many thanks, yes we get this feedback a lot. We suggested an alternative to Tab using CSV through the R software conversion packages developed by others in the R community. Easily be added to DV. We can build out options for CSV and Tab as they serve two different needs. Other formats are also possible. I'm not sure where the ticket went but we already have a proof-of-concept built in our development environment , happy to share more details and would love to see this go forward!

@amberleahey I assume this new issue is related:

@amberleahey
Copy link

yes, let's think more about data conversion. Glad it is cross linked to this ticket now, thanks!

Another long-standing request has been to relabel the ingested derivative format and label it 'Preservation copy' or something to prevent this from being the default download and as much as possible promote the original file as the access copy (the tsv is not usable in many cases).

@pdurbin
Copy link
Member

pdurbin commented Nov 21, 2024

@amberleahey sure. Please feel free to create a dedicated issue about calling those files "preservation copy" or whatever.

Also, this issue is somewhat related:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🔍 Interest
Development

No branches or pull requests

6 participants