Files being converted automatically from .csv to .tab #6385

tainguyenbui · 2019-11-20T16:11:28Z

Hi Dataverse Team!

Description
As part of the data ingestion, uploading file or replacing an existing file in a dataset, Dataverse is currently converting light-weight .csv files into .tab

Additionally, this conversion to tabular also produces that a dataset is locked until the tabular file is created, not allowing the dataset to be published automatically.

The problem we have
The platform we are developing, that uses Dataverse as a data repository does not need .csv files to be converted, in fact, it is currently being developed to cater for .csv files only and it is likely to continue being like that in the future.

Due to the change of extension, when we attempt to replace what we initially uploaded to a newer version, we experience issues due to content-type mismatch, one file being a comma separated value and the other a tabular separated file.

Additionally, in our current implementation, we would like to replace the file and, if successful with a 200 OK response, then publish the dataset automatically with the new version of the file. This is currently not possible due to the dataset being locked whilst converting the file to .tab

Desired behaviour
We would love to have the ability to avoid those .csv files being converted into different formats. This would also save the lock of the dataset while converting .csv into .tab, hence, we would be able to publish the dataset straight away.

Possible solutions

To avoid having any issues with current implementations, we see that it might be possible to keep the current behaviour as default, and then have a flag that skips the ingestion as it is already being done for large files.
Create a lightweight version of the replace endpoint

We are open to discussions since we are the ones very interested in keeping the original .csv files untouched in the dataset.

Thanks a lot in advance for your help

Regards,
Tai

The text was updated successfully, but these errors were encountered:

djbrooke · 2019-11-20T17:44:56Z

@tainguyenbui thanks for the detailed writeup. Since this is a core part of the application, we've been resistant to change it (see #2199 (comment)) but new use cases are always helpful for revisiting functionality. You should be able to get a .csv file out by specifying "original" as part of the API call, but we had not considered the locking delay implications.

@landreev @scolapasta (and whoever else is interested) let's catch up about this sometime over the next few days. Community comments welcome here as well!

tainguyenbui · 2019-11-20T18:49:48Z

@djbrooke, the workflow we are following is a bit different to just getting the .csv, we are also very interested in uploading original versions of files to datasets as well as replacing existing files.

The application retrieves the dataset information about a dataset, which includes information such as file ids, file names etc. We then have created a tool that reads .csv files and displays information about the .csv. The user is able to modify that .csv and then we would proceed to upload the new version of the .csv file and publish the new dataset version.

Since the .tab file and the .csv files have different content-type, as well as the ingestion to .tab not working in all scenarios, we end up with errors that are difficult to handle, such as errors due to different content-types.

For that reason, we thought that just dealing with .csv could help us simplify, reduce troubles in the process.

djbrooke · 2019-11-22T17:19:17Z

@tainguyenbui and @MYF95 thanks for talking about this earlier.

For the issues with the dataset locking, it's good to hear that you're using a queue to minimize the impacts of the delays.

For the replace issue, @landreev just mentioned that we have "forceReplace" that should allow you to upload a csv even though it expects a tab for content-type. Editing the example from the docs (http://guides.dataverse.org/en/latest/api/native-api.html#replacing-files) it would be like:

curl -H “X-Dataverse-key:$API_TOKEN” -X POST -F ‘[email protected]’
-F ’jsonData={“description”:“My description.“,”categories”:[“Data”],“forceReplace”:true}’
“https://demo.dataverse.org/api/files/$FILE_ID/replace”

Hopefully this does the trick and this is just a documentation issue where we can do better. Let me know!

tainguyenbui · 2019-11-25T08:49:26Z

@djbrooke I've tried the forceReplace parameter within jsonData and it seems to work as expected. I think, for now, we can make use of this functionality plus making sure that we always retrieve original files.

Additionally, the two properties below when retrieving dataset information let us know that despite the file being a .tab, there is an original file with a different extension

"originalFileFormat": "text/
"originalFormatLabel": "Comma Separated Values",

original response:

{
    "status": "OK",
    "data": {
        "id": 389605,
        "identifier": "FK2/B3DD2U",
        "persistentUrl": "https://doi.org/10.70122/FK2/B3DD2U",
        "protocol": "doi",
        "authority": "10.70122",
        "publisher": "Demo Dataverse",
        "publicationDate": "2019-09-05",
        "storageIdentifier": "file://10.70122/FK2/B3DD2U",
        "latestVersion": {
            "id": 52897,
            "storageIdentifier": "file://10.70122/FK2/B3DD2U",
            "versionState": "DRAFT",
            "productionDate": "Production Date",
            "UNF": "UNF:6:lBZpHhKmpLnF4RfvE5yKcg==",
            "lastUpdateTime": "2019-11-22T16:15:51Z",
            "createTime": "2019-11-22T16:15:51Z",
            "license": "CC0",
            "termsOfUse": "CC0 Waiver",
            "fileAccessRequest": false,
            "metadataBlocks": {
                "citation": {
                    "displayName": "Citation Metadata",
                    "fields": [
                        {
                            "typeName": "title",
                            "multiple": false,
                            "typeClass": "primitive",
                            "value": "Test"
                        },
                        {
                            "typeName": "author",
                            "multiple": true,
                            "typeClass": "compound",
                            "value": [
                                {
                                    "authorName": {
                                        "typeName": "authorName",
                                        "multiple": false,
                                        "typeClass": "primitive",
                                        "value": "Nguyen, Tai"
                                    }
                                }
                            ]
                        },
                        {
                            "typeName": "datasetContact",
                            "multiple": true,
                            "typeClass": "compound",
                            "value": [
                                {
                                    "datasetContactName": {
                                        "typeName": "datasetContactName",
                                        "multiple": false,
                                        "typeClass": "primitive",
                                        "value": "Nguyen, Tai"
                                    },
                                    "datasetContactEmail": {
                                        "typeName": "datasetContactEmail",
                                        "multiple": false,
                                        "typeClass": "primitive",
                                        "value": "[email protected]"
                                    }
                                }
                            ]
                        },
                        {
                            "typeName": "dsDescription",
                            "multiple": true,
                            "typeClass": "compound",
                            "value": [
                                {
                                    "dsDescriptionValue": {
                                        "typeName": "dsDescriptionValue",
                                        "multiple": false,
                                        "typeClass": "primitive",
                                        "value": "Some test"
                                    }
                                }
                            ]
                        },
                        {
                            "typeName": "subject",
                            "multiple": true,
                            "typeClass": "controlledVocabulary",
                            "value": [
                                "Other"
                            ]
                        },
                        {
                            "typeName": "depositor",
                            "multiple": false,
                            "typeClass": "primitive",
                            "value": "Nguyen, Tai"
                        },
                        {
                            "typeName": "dateOfDeposit",
                            "multiple": false,
                            "typeClass": "primitive",
                            "value": "2019-09-03"
                        }
                    ]
                }
            },
            "files": [
                {
                    "description": "",
                    "label": "test.tab",
                    "restricted": false,
                    "version": 2,
                    "datasetVersionId": 52897,
                    "dataFile": {
                        "id": 395776,
                        "persistentId": "",
                        "pidURL": "",
                        "filename": "test.tab",
                        "contentType": "text/tab-separated-values",
                        "filesize": 3135,
                        "description": "",
                        "storageIdentifier": "16e93e5f42f-64d5bc454468",
                        "originalFileFormat": "text/csv",
                        "originalFormatLabel": "Comma Separated Values",
                        "originalFileSize": 3182,
                        "UNF": "UNF:6:lBZpHhKmpLnF4RfvE5yKcg==",
                        "rootDataFileId": 395739,
                        "previousDataFileId": 395739,
                        "md5": "1650524475138646bb44371fbb18adcd",
                        "checksum": {
                            "type": "MD5",
                            "value": "1650524475138646bb44371fbb18adcd"
                        },
                        "creationDate": "2019-11-22"
                    }
                }
            ]
        }
    }
}

thanks a lot @djbrooke @landreev and everyone involved

pdurbin · 2019-11-25T11:50:29Z

there is an original file

@tainguyenbui you might be interested in the discussion at #4000 which talks about original files and how CSV is just as good of an archival format as TSV.

The other day @djbrooke and I talked about this issue and how we might want to clarify and separate two different goals of ingest (both in terms of code/behavior and docs, I'd say):

Provide an archival, non-proprietary format (not necessary for CSV). See also Inconsistent term for .tab format #6330
Make tabular data searchable (variable names, etc. in Advanced Search) and explorable (in Data Explorer and other "external tools"). This page talks a bit about how information (variable names, etc.) from the tabular files is stored in the database: http://guides.dataverse.org/en/4.18.1/user/tabulardataingest/ingestprocess.html

tainguyenbui · 2019-11-25T11:55:15Z

@pdurbin, thanks for the links.

I had a look previously to issue #4000, unfortunately our files are already given by an existing application and it is going to be the file that the platform users are going to be using for now, hence why we wouldn't want to break what the users already have in place.

It sounds great that there are some discussions on how to make enhancements to the existing data ingestion and data navigation.

djbrooke · 2019-11-25T17:58:10Z

Hi @tainguyenbui, good to hear the forcereplace is working for you. I'm going to close this issue for now since we won't make any changes to the ingest process. @pdurbin thanks also for summarizing the info from our conversation the other day!

pdurbin · 2022-03-22T18:48:16Z

Optional tabular ingest skipping #8525

djbrooke self-assigned this Nov 20, 2019

djbrooke closed this as completed Nov 25, 2019

pdurbin mentioned this issue Nov 25, 2019

404, not found when providing format=original in Data Access API #6408

Closed

landreev mentioned this issue Mar 22, 2022

Refactor and rethink the tabular ingest subsystem in v6. #8526

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files being converted automatically from .csv to .tab #6385

Files being converted automatically from .csv to .tab #6385

tainguyenbui commented Nov 20, 2019 •

edited

Loading

djbrooke commented Nov 20, 2019

tainguyenbui commented Nov 20, 2019

djbrooke commented Nov 22, 2019 •

edited

Loading

tainguyenbui commented Nov 25, 2019

pdurbin commented Nov 25, 2019

tainguyenbui commented Nov 25, 2019

djbrooke commented Nov 25, 2019

pdurbin commented Mar 22, 2022

Files being converted automatically from .csv to .tab #6385

Files being converted automatically from .csv to .tab #6385

Comments

tainguyenbui commented Nov 20, 2019 • edited Loading

djbrooke commented Nov 20, 2019

tainguyenbui commented Nov 20, 2019

djbrooke commented Nov 22, 2019 • edited Loading

tainguyenbui commented Nov 25, 2019

pdurbin commented Nov 25, 2019

tainguyenbui commented Nov 25, 2019

djbrooke commented Nov 25, 2019

pdurbin commented Mar 22, 2022

tainguyenbui commented Nov 20, 2019 •

edited

Loading

djbrooke commented Nov 22, 2019 •

edited

Loading