-
Notifications
You must be signed in to change notification settings - Fork 490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor and rethink the tabular ingest subsystem in v6. #8526
Comments
I would like to add my voice to the RData/RDS debate. To add to @reikoch 's points (#6678)" RData is not a typical file format It contains an R environment with data, you really have no idea what you might be getting when you load it. The standard way of opening a RData file will just dump the objects in your working environment. That is, if you can load it. Loading may fail because R package that are needed are not available, or because the data objects in the session have pointers to e.g. a C object or a data.base that you do not have. It is a reproducibility nightmare that should not have been included as an option. RDS is much better. One object per file, when you read it you assign it to a variable name. But even that format ought to be discouraged. It would be much better to use .csv for tabular data because it is a plain, well known format that is software agnostic. The only good reason for using a RDS would be when you wanted to preserve e.g. a fitted model or some other complex data type that cannot be written to a .csv (There there are good other formats in many cases for, such as geotiff for raster data). While you should of course support RData that has been uploaded, you would do the world a big favor if you did not make it easy to upload new RData files. I suppose people could always work around that via a zip file, but that is fine, that would be on them. At least the sites would not be encouraging bad behavior. |
I'll also chime in say this refactor and rethink ingest would be helpful, while I'm here I would like to suggest we think about increasing the size of the tabular data ingest to support files in excess of 500MB (current limit), and expand support for CSV to include CSVW format JSON to provide metadata for CSV files https://csvw.org/ |
A user used the feedback form on Harvard Dataverse today to leave feedback related to how Dataverse handles tabular files. I'm including it in this Github issue since this GitHub issue is the most recent issue about the handling of tabular data.
Later the user wrote:
I put the python example they included in this zip file: dataverse-glopops.ipynb.zip |
Many thanks, yes we get this feedback a lot. We suggested an alternative to Tab using CSV through the R software conversion packages developed by others in the R community. Easily be added to DV. We can build out options for CSV and Tab as they serve two different needs. Other formats are also possible. I'm not sure where the ticket went but we already have a proof-of-concept built in our development environment , happy to share more details and would love to see this go forward! |
csv and tsv generally are pretty common. It would be nice if clearly defined variant of these formats were used - which separator, which quote/unquote character, stability against embedded newlines etc. Another drawback of these formats is their lack of type safety, for instance is 1DEC just a string or a date? There are options to waterproof both issues, for instance see https://datapackage.org/. |
@jggautier interesting. I can't find the ticket in RT. Please forward it to me when you get a chance. Is the problem the ".tab" extension? For a long, long, long time I've wanted us to switch to ".tsv". Here's someone on Twitter saying they couldn't open a ".tab" file but renaming it to ".tsv" fixed it": #2720 (comment) @amberleahey sure, offering both CSV and TSV sounds find. @reikoch seems to suggest that there are variants of these formats. To me, the separator for TSV is the tab character. I'm not sure what the variations would be. From what I've heard CSV can have variants, various quoting and escaping conventions, perhaps, but I haven't looked into it. @amberleahey you might be remembering how @lubitchv pointed us toward https://github.com/dusadrian/DDIwR at https://groups.google.com/g/dataverse-community/c/JG6NkJ7ZQW0/m/pRvbS4YVAQAJ @reikoch Data Package looks interesting. Thanks. @landreev sorry for hijacking your issue. Maybe we can move the discussion to https://dataverse.zulipchat.com or https://groups.google.com/g/dataverse-community 😅 |
Hey @pdurbin. There's no RT ticket. The feedback form I mentioned is the orange "Feedback" link we see at the bottom of Harvard Dataverse, which leads to the Google Form at https://docs.google.com/forms/d/e/1FAIpQLSf4VinfXacN2fsZxWn1_ITCAlUqzdBT7cNk5C5DjvKBRf6wtQ/viewform. It's not clear to me what the problem is exactly, and I might not be able to learn more from the user. If I do, I'll update that comment. But at the least, the user's comments could help with figuring out what questions need answers. I don't think this conversation is hijacking this GitHub issue. |
@amberleahey I assume this new issue is related: |
yes, let's think more about data conversion. Glad it is cross linked to this ticket now, thanks! Another long-standing request has been to relabel the ingested derivative format and label it 'Preservation copy' or something to prevent this from being the default download and as much as possible promote the original file as the access copy (the tsv is not usable in many cases). |
@amberleahey sure. Please feel free to create a dedicated issue about calling those files "preservation copy" or whatever. Also, this issue is somewhat related: |
[a placeholder for now, mostly]
I would like us to very seriously reconsider and possibly reimplement and/or refactor our system of ingesting and storing tabular data. It's a complicated setup, with a lot of technical debt. Some of the complexities are due to some legacy reasons that may not be very relevant anymore.
I'm creating this as an umbrella issue for keeping track of such ingest-related issues.
An example of an ingest-related implementation issue:
Numerous issues where an option for skipping ingest is requested:
I would definitely like to finally have this implemented. The reasons it hasn't been implemented yet are kind of legacy too by now.
An example of an issue where specific ingestable format is discussed, and a need for a specific service offered on a tabular file is questioned:
[I'll keep updating this list]
The text was updated successfully, but these errors were encountered: