Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Files being converted automatically from .csv to .tab #6385

Closed
tainguyenbui opened this issue Nov 20, 2019 · 8 comments
Closed

Files being converted automatically from .csv to .tab #6385

tainguyenbui opened this issue Nov 20, 2019 · 8 comments
Assignees

Comments

@tainguyenbui
Copy link
Contributor

tainguyenbui commented Nov 20, 2019

Hi Dataverse Team!

Description
As part of the data ingestion, uploading file or replacing an existing file in a dataset, Dataverse is currently converting light-weight .csv files into .tab

Additionally, this conversion to tabular also produces that a dataset is locked until the tabular file is created, not allowing the dataset to be published automatically.

The problem we have
The platform we are developing, that uses Dataverse as a data repository does not need .csv files to be converted, in fact, it is currently being developed to cater for .csv files only and it is likely to continue being like that in the future.

Due to the change of extension, when we attempt to replace what we initially uploaded to a newer version, we experience issues due to content-type mismatch, one file being a comma separated value and the other a tabular separated file.

Additionally, in our current implementation, we would like to replace the file and, if successful with a 200 OK response, then publish the dataset automatically with the new version of the file. This is currently not possible due to the dataset being locked whilst converting the file to .tab

Desired behaviour
We would love to have the ability to avoid those .csv files being converted into different formats. This would also save the lock of the dataset while converting .csv into .tab, hence, we would be able to publish the dataset straight away.

Possible solutions

  • To avoid having any issues with current implementations, we see that it might be possible to keep the current behaviour as default, and then have a flag that skips the ingestion as it is already being done for large files.
  • Create a lightweight version of the replace endpoint

We are open to discussions since we are the ones very interested in keeping the original .csv files untouched in the dataset.

Thanks a lot in advance for your help

Regards,
Tai

@djbrooke
Copy link
Contributor

@tainguyenbui thanks for the detailed writeup. Since this is a core part of the application, we've been resistant to change it (see #2199 (comment)) but new use cases are always helpful for revisiting functionality. You should be able to get a .csv file out by specifying "original" as part of the API call, but we had not considered the locking delay implications.

@landreev @scolapasta (and whoever else is interested) let's catch up about this sometime over the next few days. Community comments welcome here as well!

@djbrooke djbrooke self-assigned this Nov 20, 2019
@tainguyenbui
Copy link
Contributor Author

@djbrooke, the workflow we are following is a bit different to just getting the .csv, we are also very interested in uploading original versions of files to datasets as well as replacing existing files.

The application retrieves the dataset information about a dataset, which includes information such as file ids, file names etc. We then have created a tool that reads .csv files and displays information about the .csv. The user is able to modify that .csv and then we would proceed to upload the new version of the .csv file and publish the new dataset version.

Since the .tab file and the .csv files have different content-type, as well as the ingestion to .tab not working in all scenarios, we end up with errors that are difficult to handle, such as errors due to different content-types.

For that reason, we thought that just dealing with .csv could help us simplify, reduce troubles in the process.

@djbrooke
Copy link
Contributor

djbrooke commented Nov 22, 2019

@tainguyenbui and @MYF95 thanks for talking about this earlier.

For the issues with the dataset locking, it's good to hear that you're using a queue to minimize the impacts of the delays.

For the replace issue, @landreev just mentioned that we have "forceReplace" that should allow you to upload a csv even though it expects a tab for content-type. Editing the example from the docs (http://guides.dataverse.org/en/latest/api/native-api.html#replacing-files) it would be like:

curl -H “X-Dataverse-key:$API_TOKEN” -X POST -F ‘[email protected]
-F ’jsonData={“description”:“My description.“,”categories”:[“Data”],“forceReplace”:true}’
https://demo.dataverse.org/api/files/$FILE_ID/replace”

Hopefully this does the trick and this is just a documentation issue where we can do better. Let me know!

@tainguyenbui
Copy link
Contributor Author

@djbrooke I've tried the forceReplace parameter within jsonData and it seems to work as expected. I think, for now, we can make use of this functionality plus making sure that we always retrieve original files.

Additionally, the two properties below when retrieving dataset information let us know that despite the file being a .tab, there is an original file with a different extension

"originalFileFormat": "text/
"originalFormatLabel": "Comma Separated Values",

original response:

{
    "status": "OK",
    "data": {
        "id": 389605,
        "identifier": "FK2/B3DD2U",
        "persistentUrl": "https://doi.org/10.70122/FK2/B3DD2U",
        "protocol": "doi",
        "authority": "10.70122",
        "publisher": "Demo Dataverse",
        "publicationDate": "2019-09-05",
        "storageIdentifier": "file://10.70122/FK2/B3DD2U",
        "latestVersion": {
            "id": 52897,
            "storageIdentifier": "file://10.70122/FK2/B3DD2U",
            "versionState": "DRAFT",
            "productionDate": "Production Date",
            "UNF": "UNF:6:lBZpHhKmpLnF4RfvE5yKcg==",
            "lastUpdateTime": "2019-11-22T16:15:51Z",
            "createTime": "2019-11-22T16:15:51Z",
            "license": "CC0",
            "termsOfUse": "CC0 Waiver",
            "fileAccessRequest": false,
            "metadataBlocks": {
                "citation": {
                    "displayName": "Citation Metadata",
                    "fields": [
                        {
                            "typeName": "title",
                            "multiple": false,
                            "typeClass": "primitive",
                            "value": "Test"
                        },
                        {
                            "typeName": "author",
                            "multiple": true,
                            "typeClass": "compound",
                            "value": [
                                {
                                    "authorName": {
                                        "typeName": "authorName",
                                        "multiple": false,
                                        "typeClass": "primitive",
                                        "value": "Nguyen, Tai"
                                    }
                                }
                            ]
                        },
                        {
                            "typeName": "datasetContact",
                            "multiple": true,
                            "typeClass": "compound",
                            "value": [
                                {
                                    "datasetContactName": {
                                        "typeName": "datasetContactName",
                                        "multiple": false,
                                        "typeClass": "primitive",
                                        "value": "Nguyen, Tai"
                                    },
                                    "datasetContactEmail": {
                                        "typeName": "datasetContactEmail",
                                        "multiple": false,
                                        "typeClass": "primitive",
                                        "value": "[email protected]"
                                    }
                                }
                            ]
                        },
                        {
                            "typeName": "dsDescription",
                            "multiple": true,
                            "typeClass": "compound",
                            "value": [
                                {
                                    "dsDescriptionValue": {
                                        "typeName": "dsDescriptionValue",
                                        "multiple": false,
                                        "typeClass": "primitive",
                                        "value": "Some test"
                                    }
                                }
                            ]
                        },
                        {
                            "typeName": "subject",
                            "multiple": true,
                            "typeClass": "controlledVocabulary",
                            "value": [
                                "Other"
                            ]
                        },
                        {
                            "typeName": "depositor",
                            "multiple": false,
                            "typeClass": "primitive",
                            "value": "Nguyen, Tai"
                        },
                        {
                            "typeName": "dateOfDeposit",
                            "multiple": false,
                            "typeClass": "primitive",
                            "value": "2019-09-03"
                        }
                    ]
                }
            },
            "files": [
                {
                    "description": "",
                    "label": "test.tab",
                    "restricted": false,
                    "version": 2,
                    "datasetVersionId": 52897,
                    "dataFile": {
                        "id": 395776,
                        "persistentId": "",
                        "pidURL": "",
                        "filename": "test.tab",
                        "contentType": "text/tab-separated-values",
                        "filesize": 3135,
                        "description": "",
                        "storageIdentifier": "16e93e5f42f-64d5bc454468",
                        "originalFileFormat": "text/csv",
                        "originalFormatLabel": "Comma Separated Values",
                        "originalFileSize": 3182,
                        "UNF": "UNF:6:lBZpHhKmpLnF4RfvE5yKcg==",
                        "rootDataFileId": 395739,
                        "previousDataFileId": 395739,
                        "md5": "1650524475138646bb44371fbb18adcd",
                        "checksum": {
                            "type": "MD5",
                            "value": "1650524475138646bb44371fbb18adcd"
                        },
                        "creationDate": "2019-11-22"
                    }
                }
            ]
        }
    }
}

thanks a lot @djbrooke @landreev and everyone involved

@pdurbin
Copy link
Member

pdurbin commented Nov 25, 2019

there is an original file

@tainguyenbui you might be interested in the discussion at #4000 which talks about original files and how CSV is just as good of an archival format as TSV.

The other day @djbrooke and I talked about this issue and how we might want to clarify and separate two different goals of ingest (both in terms of code/behavior and docs, I'd say):

@tainguyenbui
Copy link
Contributor Author

@pdurbin, thanks for the links.

I had a look previously to issue #4000, unfortunately our files are already given by an existing application and it is going to be the file that the platform users are going to be using for now, hence why we wouldn't want to break what the users already have in place.

It sounds great that there are some discussions on how to make enhancements to the existing data ingestion and data navigation.

@djbrooke
Copy link
Contributor

Hi @tainguyenbui, good to hear the forcereplace is working for you. I'm going to close this issue for now since we won't make any changes to the ingest process. @pdurbin thanks also for summarizing the info from our conversation the other day!

@pdurbin
Copy link
Member

pdurbin commented Mar 22, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants