Duplicate upload of the same file #357

eaquigley · 2014-07-09T15:37:42Z

Author Name: Elizabeth Quigley (@eaquigley)
Original Redmine Issue: 3772, https://redmine.hmdc.harvard.edu/issues/3772
Original Date: 2014-03-25
Original Assignee: Leonid Andreev

Noticed during testing that a user can upload a file multiple times at once without issues.

eaquigley · 2014-07-09T15:37:42Z

Original Redmine Comment
Author Name: Philip Durbin (@pdurbin)
Original Date: 2014-06-02T18:48:06Z

Elizabeth Quigley wrote:

Noticed during testing that a user can upload a file multiple times at once without issues.

I'm curious what we want users to see. "Copy of file1"? Or should they get an error?

pdurbin · 2014-07-14T14:57:09Z

See also some discussion with @jwhitney and @posixeleni about duplicate files in the context of the Data Deposit API as its implemented in DVN 3.x: http://irclog.iq.harvard.edu/dataverse/2014-07-14

eaquigley · 2014-07-14T16:20:00Z

What do we want this do? Should we have the system check for:
-Duplicate file names
-Duplicate content (i.e.-MD5 is checked to see if it is the same)
-Or both
After this check has been done and there are duplicates in either the title or the content, we need to have a warning pop up asking the user if they want to allow another copy of the file or cancel.

Example of a way Spotify does something like this when they recognize a duplicate file:

landreev · 2014-09-08T19:54:12Z

OK, in its current form, the ticket still doesn't say how exactly this should be handled.
There's a discussion of different approaches; but it leaves it as an open question.
I'd like to get this out of the way; so let's try and finalize it quickly:

Elizabeth, per your last comment:

Yes, I feel we should definitely check if the content (md5) is the same as another file;
Yes, I like your analogy with the "song's already in playlist" popup from Spotify. I'll implement a similar warning.
Not sure about duplicate filenames. I.e., I agree we shouldn't allow them. But not sure if we should bother showing a warning popup for that. In DVN 3.* we allow the upload to proceed, but modify the filename quietly, making it unique. If a user is uploading README.txt, and there's already a file with this name, we add it to the study as README-1.txt. (and if that already exists also, we try README-2.txt, etc.)

What do you think? If you would rather just have a warning, I could do that too. (but that would give them a possibility to ignore the warning and add a file with the same name but different content... - kinda sounds like a mess - ?)

eaquigley · 2014-09-08T19:59:52Z

@landreev The way duplicate filenames are done in DVN 3.* works for me. Since a user can edit a file name after its uploaded, they can always change a duplicate filename then so no need for a warning.

landreev · 2014-09-08T20:00:49Z

Great, thanks.
I'll go ahead and implement this asap.

pdurbin · 2014-09-09T12:15:22Z

What do we want to happen if users attempt to upload the same files via SWORD?

As https://redmine.hmdc.harvard.edu/issues/3301 explains, right now an error states, "Filename 50by1000.tab already exists."

akio-sone · 2014-09-09T12:49:43Z

Phil,
your question sounds like the Deposit-API might have a processing route
different from the one used by the web-GUI.

Are you seeking a separate processing route for the Deposit API for some
design reasons rather than using the unified, core deposit-call that
avoids duplicated code?

On 9/9/2014 8:15 AM, Philip Durbin wrote:

What do we want to happen if users attempt to upload the same files via
SWORD?

As https://redmine.hmdc.harvard.edu/issues/3301 explains, right now an
error states, "Filename 50by1000.tab already exists."

—
Reply to this email directly or view it on GitHub
#357 (comment).

Akio Sone
Odum Inst.
UNC at Chapel Hill

pdurbin · 2014-09-09T13:02:32Z

@akio-sone in DVN 3.x SWORD code duplicates code elsewhere in the system, unfortunately. Refactoring to use common code would have been too much effort.

In Dataverse 4.0, as much as possible, I would like SWORD to use the same code path as the GUI. @landreev and I have already talked about how I should switch the SWORD code to his new back end method (#611 I think) for expanding zip files.

mercecrosas · 2014-09-09T13:58:19Z

The aim (if not now, later) should be to use the same API. Same applies to
the data and metadata API - it should be the same as what the web UI uses
to access/download data and metadata.

On Tue, Sep 9, 2014 at 9:02 AM, Philip Durbin [email protected]
wrote:

@akio-sone
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_akio-2Dsone&d=AAMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=MoES6dokjPLLcKaEAd7qaCuTcYZ4jLjEOBQnbbJ9BaA&m=KtwmOoqrJVmJKIgYLA5oSgxX8i2mH2ahRfpmGN4sMdo&s=03nyOc5bbAfkpKrfdo4NYw05Nt8lxTnkiVkr498JiVY&e=
in DVN 3.x SWORD code duplicates code elsewhere in the system,
unfortunately. Refactoring to use common code would have been too much
effort.

In Dataverse 4.0, as much as possible, I would like SWORD to use the same
code path as the GUI. @landreev
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_landreev&d=AAMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=MoES6dokjPLLcKaEAd7qaCuTcYZ4jLjEOBQnbbJ9BaA&m=KtwmOoqrJVmJKIgYLA5oSgxX8i2mH2ahRfpmGN4sMdo&s=Zctk1zkuHautme7LgX67a-rU5u7G8SUOE2Fu_W51MgU&e=
and I have already talked about how I should switch the SWORD code to his
new back end method (#611
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_IQSS_dataverse_issues_611&d=AAMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=MoES6dokjPLLcKaEAd7qaCuTcYZ4jLjEOBQnbbJ9BaA&m=KtwmOoqrJVmJKIgYLA5oSgxX8i2mH2ahRfpmGN4sMdo&s=KTYcNEf4CaGnqE8LJ0VVwf9vkqkbWE5An5PgxKoRepY&e=
I think) for expanding zip files.

Reply to this email directly or view it on GitHub
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_IQSS_dataverse_issues_357-23issuecomment-2D54965029&d=AAMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=MoES6dokjPLLcKaEAd7qaCuTcYZ4jLjEOBQnbbJ9BaA&m=KtwmOoqrJVmJKIgYLA5oSgxX8i2mH2ahRfpmGN4sMdo&s=llt0lcVQf5Nue-HcDsIDTkdna9t1Pcq-RdwiR3CWDB4&e=
.

sbarbosadataverse · 2014-09-09T14:09:12Z

While I agree with having distinct file names, please note that
contributors who have multiple data waves will use identical names and they
usually have documentation noting such names. The change then doesn't just
become one on the data file page, but also their documentation which
becomes a hassle for them. The readme files usually have the files names in
them. They may not be hip on being" forced" to change names
Cheers

On Mon, Sep 8, 2014 at 3:54 PM, landreev [email protected] wrote:

OK, in its current form, the ticket still doesn't say how exactly this
should be handled.
There's a discussion of different approaches; but it leaves it as an open
question.
I'd like to get this out of the way; so let's try and finalize it quickly:

Elizabeth, per your last comment:

Yes, I feel we should definitely check if the content (md5) is the

same as another file;

Yes, I like your analogy with the "song's already in playlist" popup

from Spotify. I'll implement a similar warning.

Not sure about duplicate filenames. I.e., I agree we shouldn't allow
them. But not sure if we should bother showing a warning popup for that. In
DVN 3.* we allow the upload to proceed, but modify the filename quietly,
making it unique. If a user is uploading README.txt, and there's already a
file with this name, we add it to the study as README-1.txt. (and if that
already exists also, we try README-2.txt, etc.)

What do you think? If you would rather just have a warning, I could do
that too. (but that would give them a possibility to ignore the warning and
add a file with the same name but different content... - kinda sounds like
a mess - ?)

—
Reply to this email directly or view it on GitHub
#357 (comment).

If an uploaded file appears to be a duplicate of an existing file, *by content* (i.e., by md5), a warning message will be displayed, and the matching md5s highlighted. This way the user has an option of either canceling the entire upload, or checking the delete chekboxes next to the files that don't want, before they hit 'save'; or just to proceed with adding the files as they are - if they have some kind of a weird reason to have multiple identical files in the dataset...

landreev · 2014-09-09T20:34:20Z

OK, for this week's beta push, this is going to be implemented as agreed upon yesterday:

duplicate file names - un-duplicated automatically on upload (similarly to DVN 3.*; README.txt will become README-1.txt; if that already exists, README-2.txt, etc.). The user then has a chance to change the names back - if they have some kind of a weird reason to do so.
files that are duplicate by content, i.e. same md5: the Add Files page will display a warning and the matching md5s will be highlighted. So the user gets a chance to delete the duplicates (through the delete checkboxes) or cancel the deposit altogether.

Phil, Akio: answering your question - in 4.0 the part above, where identical/already existing file names are modified until unique, is done in the IngestService (not in the Dataset page). Both the page and SWORD deposit call the service to process the files that are being uploaded. So the Deposit API will match the default behavior of the page - modified until unique.

If anybody has suggestions/ideas for the GUI part of this, we may revisit this in Beta 8, when/if the dataset page is rebuilt to switch to using search for file display.

landreev · 2014-09-09T20:40:05Z

Phil: yes, you should definitely switch to the ingest service method that supports the "file spawning model" - where an uploaded file may result in several datafiles created (currently supported cases are zip files and geo shape files).
However, even if you continue using the deprecated single-file method, everything still goes through the same Ingest Service pipeline; so the file name rule above will still be applied.

kcondon · 2014-09-09T22:12:32Z

Basically works but doesn't detect duplicate filename if file was previously ingested as subsettable:

ingest 50by1000.dta, then upload again. Name isn't automatically changed to -1. We think this is because subsettable filename changes to .tab after ingest when it should compare subsequent uploads to original filename.

landreev · 2014-09-09T22:54:00Z

Good find - yeah, the filename check wasn't working on tabular data files (because foobar.dta becomes foobar.tab once ingested!).
should be working now, please retest.

eaquigley · 2014-09-10T15:42:33Z

@sbarbosadataverse Have you gotten complaints about this? According to @landreev, this is how it is being done in 3.* already so we won't be implementing a change.

landreev · 2014-09-10T18:45:12Z

@eaquigley, @sbarbosadataverse:
we are implementing a change;
in 3.* we do not allow duplicate names at all; even as an option.
so, did you ever get any complaints, from people who really wanted to have files with identical names in the same study?

kcondon · 2014-09-10T20:07:07Z

ok, seems to be working now. Closing

eaquigley added this to the Dataverse 4.0: In Review milestone Jul 9, 2014

eaquigley assigned landreev Jul 9, 2014

eaquigley modified the milestones: Dataverse 4.0: Beta 5, Dataverse 4.0: In Review Jul 15, 2014

eaquigley modified the milestones: Beta 5 - Dataverse 4.0, Beta 6 - Dataverse 4.0 Aug 27, 2014

landreev added a commit that referenced this issue Sep 9, 2014

Added code to automatically resolve duplicate file names, per #357.

b63d7f9

landreev added Status: QA and removed Status: Design labels Sep 9, 2014

landreev assigned kcondon and unassigned landreev Sep 9, 2014

landreev assigned landreev and kcondon and unassigned kcondon and landreev Sep 9, 2014

kcondon closed this as completed Sep 10, 2014

pdurbin mentioned this issue Mar 28, 2016

File Upload: Detection of duplicate files via md5 found in an existing file is confusing to users. #2955

Closed

pdurbin mentioned this issue Jan 13, 2017

The same filename with the same checksum (i.e. MD5) shouldn't be able to appear twice in the same dataset #3571

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate upload of the same file #357

Duplicate upload of the same file #357

eaquigley commented Jul 9, 2014

eaquigley commented Jul 9, 2014

pdurbin commented Jul 14, 2014

eaquigley commented Jul 14, 2014

landreev commented Sep 8, 2014

eaquigley commented Sep 8, 2014

landreev commented Sep 8, 2014

pdurbin commented Sep 9, 2014

akio-sone commented Sep 9, 2014

pdurbin commented Sep 9, 2014

mercecrosas commented Sep 9, 2014

sbarbosadataverse commented Sep 9, 2014

same as another file;

from Spotify. I'll implement a similar warning.

landreev commented Sep 9, 2014

landreev commented Sep 9, 2014

kcondon commented Sep 9, 2014

landreev commented Sep 9, 2014

eaquigley commented Sep 10, 2014

landreev commented Sep 10, 2014

kcondon commented Sep 10, 2014

Duplicate upload of the same file #357

Duplicate upload of the same file #357

Comments

eaquigley commented Jul 9, 2014

eaquigley commented Jul 9, 2014

pdurbin commented Jul 14, 2014

eaquigley commented Jul 14, 2014

landreev commented Sep 8, 2014

eaquigley commented Sep 8, 2014

landreev commented Sep 8, 2014

pdurbin commented Sep 9, 2014

akio-sone commented Sep 9, 2014

pdurbin commented Sep 9, 2014

mercecrosas commented Sep 9, 2014

sbarbosadataverse commented Sep 9, 2014

same as another file;

from Spotify. I'll implement a similar warning.

landreev commented Sep 9, 2014

landreev commented Sep 9, 2014

kcondon commented Sep 9, 2014

landreev commented Sep 9, 2014

eaquigley commented Sep 10, 2014

landreev commented Sep 10, 2014

kcondon commented Sep 10, 2014