Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate upload of the same file #357

Closed
eaquigley opened this issue Jul 9, 2014 · 18 comments
Closed

Duplicate upload of the same file #357

eaquigley opened this issue Jul 9, 2014 · 18 comments
Assignees
Labels
Type: Feature a feature request

Comments

@eaquigley
Copy link
Contributor


Author Name: Elizabeth Quigley (@eaquigley)
Original Redmine Issue: 3772, https://redmine.hmdc.harvard.edu/issues/3772
Original Date: 2014-03-25
Original Assignee: Leonid Andreev


Noticed during testing that a user can upload a file multiple times at once without issues.

@eaquigley
Copy link
Contributor Author


Original Redmine Comment
Author Name: Philip Durbin (@pdurbin)
Original Date: 2014-06-02T18:48:06Z


Elizabeth Quigley wrote:

Noticed during testing that a user can upload a file multiple times at once without issues.

I'm curious what we want users to see. "Copy of file1"? Or should they get an error?

@eaquigley eaquigley added this to the Dataverse 4.0: In Review milestone Jul 9, 2014
@pdurbin
Copy link
Member

pdurbin commented Jul 14, 2014

See also some discussion with @jwhitney and @posixeleni about duplicate files in the context of the Data Deposit API as its implemented in DVN 3.x: http://irclog.iq.harvard.edu/dataverse/2014-07-14

@eaquigley
Copy link
Contributor Author

What do we want this do? Should we have the system check for:
-Duplicate file names
-Duplicate content (i.e.-MD5 is checked to see if it is the same)
-Or both
After this check has been done and there are duplicates in either the title or the content, we need to have a warning pop up asking the user if they want to allow another copy of the file or cancel.

Example of a way Spotify does something like this when they recognize a duplicate file:

photo1 1

@eaquigley eaquigley modified the milestones: Dataverse 4.0: Beta 5, Dataverse 4.0: In Review Jul 15, 2014
@landreev
Copy link
Contributor

landreev commented Sep 8, 2014

OK, in its current form, the ticket still doesn't say how exactly this should be handled.
There's a discussion of different approaches; but it leaves it as an open question.
I'd like to get this out of the way; so let's try and finalize it quickly:

Elizabeth, per your last comment:

  • Yes, I feel we should definitely check if the content (md5) is the same as another file;
  • Yes, I like your analogy with the "song's already in playlist" popup from Spotify. I'll implement a similar warning.
  • Not sure about duplicate filenames. I.e., I agree we shouldn't allow them. But not sure if we should bother showing a warning popup for that. In DVN 3.* we allow the upload to proceed, but modify the filename quietly, making it unique. If a user is uploading README.txt, and there's already a file with this name, we add it to the study as README-1.txt. (and if that already exists also, we try README-2.txt, etc.)

What do you think? If you would rather just have a warning, I could do that too. (but that would give them a possibility to ignore the warning and add a file with the same name but different content... - kinda sounds like a mess - ?)

@eaquigley
Copy link
Contributor Author

@landreev The way duplicate filenames are done in DVN 3.* works for me. Since a user can edit a file name after its uploaded, they can always change a duplicate filename then so no need for a warning.

@landreev
Copy link
Contributor

landreev commented Sep 8, 2014

Great, thanks.
I'll go ahead and implement this asap.

@pdurbin
Copy link
Member

pdurbin commented Sep 9, 2014

What do we want to happen if users attempt to upload the same files via SWORD?

As https://redmine.hmdc.harvard.edu/issues/3301 explains, right now an error states, "Filename 50by1000.tab already exists."

@akio-sone
Copy link
Contributor

Phil,
your question sounds like the Deposit-API might have a processing route
different from the one used by the web-GUI.

Are you seeking a separate processing route for the Deposit API for some
design reasons rather than using the unified, core deposit-call that
avoids duplicated code?

On 9/9/2014 8:15 AM, Philip Durbin wrote:

What do we want to happen if users attempt to upload the same files via
SWORD?

As https://redmine.hmdc.harvard.edu/issues/3301 explains, right now an
error states, "Filename 50by1000.tab already exists."


Reply to this email directly or view it on GitHub
#357 (comment).

Akio Sone
Odum Inst.
UNC at Chapel Hill

@pdurbin
Copy link
Member

pdurbin commented Sep 9, 2014

@akio-sone in DVN 3.x SWORD code duplicates code elsewhere in the system, unfortunately. Refactoring to use common code would have been too much effort.

In Dataverse 4.0, as much as possible, I would like SWORD to use the same code path as the GUI. @landreev and I have already talked about how I should switch the SWORD code to his new back end method (#611 I think) for expanding zip files.

@mercecrosas
Copy link
Member

The aim (if not now, later) should be to use the same API. Same applies to
the data and metadata API - it should be the same as what the web UI uses
to access/download data and metadata.

On Tue, Sep 9, 2014 at 9:02 AM, Philip Durbin [email protected]
wrote:

@akio-sone
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_akio-2Dsone&d=AAMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=MoES6dokjPLLcKaEAd7qaCuTcYZ4jLjEOBQnbbJ9BaA&m=KtwmOoqrJVmJKIgYLA5oSgxX8i2mH2ahRfpmGN4sMdo&s=03nyOc5bbAfkpKrfdo4NYw05Nt8lxTnkiVkr498JiVY&e=
in DVN 3.x SWORD code duplicates code elsewhere in the system,
unfortunately. Refactoring to use common code would have been too much
effort.

In Dataverse 4.0, as much as possible, I would like SWORD to use the same
code path as the GUI. @landreev
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_landreev&d=AAMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=MoES6dokjPLLcKaEAd7qaCuTcYZ4jLjEOBQnbbJ9BaA&m=KtwmOoqrJVmJKIgYLA5oSgxX8i2mH2ahRfpmGN4sMdo&s=Zctk1zkuHautme7LgX67a-rU5u7G8SUOE2Fu_W51MgU&e=
and I have already talked about how I should switch the SWORD code to his
new back end method (#611
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_IQSS_dataverse_issues_611&d=AAMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=MoES6dokjPLLcKaEAd7qaCuTcYZ4jLjEOBQnbbJ9BaA&m=KtwmOoqrJVmJKIgYLA5oSgxX8i2mH2ahRfpmGN4sMdo&s=KTYcNEf4CaGnqE8LJ0VVwf9vkqkbWE5An5PgxKoRepY&e=
I think) for expanding zip files.

Reply to this email directly or view it on GitHub
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_IQSS_dataverse_issues_357-23issuecomment-2D54965029&d=AAMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=MoES6dokjPLLcKaEAd7qaCuTcYZ4jLjEOBQnbbJ9BaA&m=KtwmOoqrJVmJKIgYLA5oSgxX8i2mH2ahRfpmGN4sMdo&s=llt0lcVQf5Nue-HcDsIDTkdna9t1Pcq-RdwiR3CWDB4&e=
.

@sbarbosadataverse
Copy link

While I agree with having distinct file names, please note that
contributors who have multiple data waves will use identical names and they
usually have documentation noting such names. The change then doesn't just
become one on the data file page, but also their documentation which
becomes a hassle for them. The readme files usually have the files names in
them. They may not be hip on being" forced" to change names
Cheers

On Mon, Sep 8, 2014 at 3:54 PM, landreev [email protected] wrote:

OK, in its current form, the ticket still doesn't say how exactly this
should be handled.
There's a discussion of different approaches; but it leaves it as an open
question.
I'd like to get this out of the way; so let's try and finalize it quickly:

Elizabeth, per your last comment:

  • Yes, I feel we should definitely check if the content (md5) is the

    same as another file;

    Yes, I like your analogy with the "song's already in playlist" popup

    from Spotify. I'll implement a similar warning.

    Not sure about duplicate filenames. I.e., I agree we shouldn't allow
    them. But not sure if we should bother showing a warning popup for that. In
    DVN 3.* we allow the upload to proceed, but modify the filename quietly,
    making it unique. If a user is uploading README.txt, and there's already a
    file with this name, we add it to the study as README-1.txt. (and if that
    already exists also, we try README-2.txt, etc.)

What do you think? If you would rather just have a warning, I could do
that too. (but that would give them a possibility to ignore the warning and
add a file with the same name but different content... - kinda sounds like
a mess - ?)


Reply to this email directly or view it on GitHub
#357 (comment).

landreev added a commit that referenced this issue Sep 9, 2014
If an uploaded file appears to be a duplicate of an existing file, *by content* (i.e., by md5),
a warning message will be displayed, and the matching md5s highlighted.
This way the user has an option of either canceling the entire upload, or checking the delete
chekboxes next to the files that don't want, before they hit 'save'; or just to proceed with
adding the files as they are - if they have some kind of a weird reason to have multiple identical
files in the dataset...
@landreev landreev assigned kcondon and unassigned landreev Sep 9, 2014
@landreev
Copy link
Contributor

landreev commented Sep 9, 2014

OK, for this week's beta push, this is going to be implemented as agreed upon yesterday:

  1. duplicate file names - un-duplicated automatically on upload (similarly to DVN 3.*; README.txt will become README-1.txt; if that already exists, README-2.txt, etc.). The user then has a chance to change the names back - if they have some kind of a weird reason to do so.

  2. files that are duplicate by content, i.e. same md5: the Add Files page will display a warning and the matching md5s will be highlighted. So the user gets a chance to delete the duplicates (through the delete checkboxes) or cancel the deposit altogether.

Phil, Akio: answering your question - in 4.0 the part above, where identical/already existing file names are modified until unique, is done in the IngestService (not in the Dataset page). Both the page and SWORD deposit call the service to process the files that are being uploaded. So the Deposit API will match the default behavior of the page - modified until unique.

If anybody has suggestions/ideas for the GUI part of this, we may revisit this in Beta 8, when/if the dataset page is rebuilt to switch to using search for file display.

@landreev
Copy link
Contributor

landreev commented Sep 9, 2014

Phil: yes, you should definitely switch to the ingest service method that supports the "file spawning model" - where an uploaded file may result in several datafiles created (currently supported cases are zip files and geo shape files).
However, even if you continue using the deprecated single-file method, everything still goes through the same Ingest Service pipeline; so the file name rule above will still be applied.

@kcondon
Copy link
Contributor

kcondon commented Sep 9, 2014

Basically works but doesn't detect duplicate filename if file was previously ingested as subsettable:

ingest 50by1000.dta, then upload again. Name isn't automatically changed to -1. We think this is because subsettable filename changes to .tab after ingest when it should compare subsequent uploads to original filename.

@landreev landreev assigned landreev and kcondon and unassigned kcondon and landreev Sep 9, 2014
@landreev
Copy link
Contributor

landreev commented Sep 9, 2014

Good find - yeah, the filename check wasn't working on tabular data files (because foobar.dta becomes foobar.tab once ingested!).
should be working now, please retest.

@eaquigley
Copy link
Contributor Author

@sbarbosadataverse Have you gotten complaints about this? According to @landreev, this is how it is being done in 3.* already so we won't be implementing a change.

@landreev
Copy link
Contributor

@eaquigley, @sbarbosadataverse:
we are implementing a change;
in 3.* we do not allow duplicate names at all; even as an option.
so, did you ever get any complaints, from people who really wanted to have files with identical names in the same study?

@kcondon
Copy link
Contributor

kcondon commented Sep 10, 2014

ok, seems to be working now. Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature a feature request
Projects
None yet
Development

No branches or pull requests

7 participants