Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiple users reporting duplicate images while signed in #1069

Closed
vrooje opened this issue Jul 4, 2015 · 65 comments
Closed

multiple users reporting duplicate images while signed in #1069

vrooje opened this issue Jul 4, 2015 · 65 comments

Comments

@vrooje
Copy link

vrooje commented Jul 4, 2015

Our 2 moderators on GZ Bars have both reported seeing duplicate images. I'm going to check the subject list for duplicates and will update but just wanted to flag this in case it's related to #787 or something else that isn't duplicate subjects.

@vrooje
Copy link
Author

vrooje commented Jul 6, 2015

Update: I've just checked the metadata in the subject download from the project and I can't find any duplicates in the image name, so I don't think this is a duplicate-subject issue.

@camallen
Copy link
Contributor

camallen commented Jul 6, 2015

Any talk links that indicate they have seen duplicate images?

@vrooje
Copy link
Author

vrooje commented Jul 6, 2015

I don't think so, yet... the mods posted on Talk but didn't link to specific images. Will ask them to do so in the future.

@camallen
Copy link
Contributor

@vrooje closing for now, please re-open if you have more info.

@vrooje
Copy link
Author

vrooje commented Jul 14, 2015

https://www.zooniverse.org/#/projects/vrooje/galaxy-zoo-bar-lengths/talk/21/280?page=1&comment=1515

From @Capella05:

Just got this image to classify for the second time - 465197

@vrooje vrooje reopened this Jul 14, 2015
@vrooje
Copy link
Author

vrooje commented Jul 14, 2015

Also found the subject comment thread, which I suspect is redundant but am including it just in case.

@vrooje
Copy link
Author

vrooje commented Jul 16, 2015

I've now got more information about this:

Strangely I don't have a duplicate in the classification export for Capella05 on subject id 465197, despite the fact that my export was requested more than 24 hours after the reported duplicate.

I do have a different duplicate for that user, however:

username subject_id metadata
Capella05 {"464985": {"started_at":"2015-07-01T19:23:02.756Z","user...
Capella05 {"464985": {"started_at":"2015-07-07T19:16:23.117Z","user...

And I have a total of 159 duplicates from various users. That's about 1% of the classifications. Examples (with dummy usernames):

thisuser {"458602": {"started_at":"2015-06-01T16:57:55.437Z","user...
thisuser {"458591": {"started_at":"2015-06-01T16:58:00.380Z","user...
thisuser {"458584": {"started_at":"2015-06-01T16:58:05.496Z","user...
thisuser {"458550": {"started_at":"2015-06-01T16:58:10.840Z","user...
[11 more classifications of non-duplicate subjects]
thisuser {"458602": {"started_at":"2015-06-01T16:59:23.785Z","user...
[2 non-duplicate classifications]
thisuser {"458591": {"started_at":"2015-06-01T16:59:40.418Z","user...
thisuser {"458584": {"started_at":"2015-06-01T16:59:44.218Z","user...
[2 non-duplicate classifications]
thisuser {"458550": {"started_at":"2015-06-01T16:59:55.702Z","user...

thisuser has done 30 classifications total. And, more recently, from a user who has done 18 classifications total:

user_name subject_id metadata
[1 non-dup classification]
thatuser {"464953": {"started_at":"2015-07-02T20:39:11.600Z","user...
thatuser {"464954": {"started_at":"2015-07-02T20:39:40.693Z","user...
thatuser {"464955": {"started_at":"2015-07-02T20:40:10.855Z","user...
thatuser {"464956": {"started_at":"2015-07-02T20:40:30.784Z","user...
thatuser {"464957": {"started_at":"2015-07-02T20:40:38.595Z","user...
thatuser {"464958": {"started_at":"2015-07-02T20:40:42.172Z","user...
thatuser {"464959": {"started_at":"2015-07-02T20:40:45.651Z","user...
[3 non-dup classifications]
thatuser {"464953": {"started_at":"2015-07-02T20:41:05.295Z","user...
thatuser {"464954": {"started_at":"2015-07-02T20:41:27.165Z","user...
thatuser {"464955": {"started_at":"2015-07-02T20:41:33.603Z","user...
thatuser {"464956": {"started_at":"2015-07-02T20:41:47.336Z","user...
thatuser {"464957": {"started_at":"2015-07-02T20:41:51.548Z","user...
thatuser {"464958": {"started_at":"2015-07-02T20:41:56.526Z","user...
thatuser {"464959": {"started_at":"2015-07-02T20:42:00.374Z","user...

I also have duplicates from users who have done many more classifications overall, though not proportionally more duplicates. And not everyone who has done hundreds of classifications has done duplicate classifications.

@vrooje
Copy link
Author

vrooje commented Jul 16, 2015

More details:

  • Some of the duplicates are from not-logged-in users, but most of them are from logged-in users.
  • The duplicates sometimes span days, and I have also seen them ~60s apart and once 14s apart (re: repeat classifications for logged-in users when still other data to clasify  #1128)
  • Some duplicates are actually triplicates. In that case they are often, but not always, minutes or seconds apart. (I saw one that was 1 [1 hour] 2 [2 minutes] 3).
  • Many of the duplicates (about a third) have the same subject across multiple users, i.e. logged-in users A and B both classified subject 99 twice. In those cases the users' first viewing of the same subject isn't always on the same day, e.g. user B classified subject 99 for the first time 25 hours after user A classified subject 99 for the first time.
  • There is an example where 2 logged-in users on different OSs and classifying about 3 days apart are repeating the same subjects. One did 18 classifications ("thatuser" above) and the other did 14. They each repeated the same 6 subjects.

I've isolated the duplicates from the full export and grouped them together and I'm happy to send that over if the full data would be helpful.

@mschwamb
Copy link

Non-logged in, is a won't fix issue - see the conversation here #1127 - that's how the system was designed

The 60s thing on my ticket #1128 isn't important, there are classifications made in between.

Merging from the info in my ticket since

Yeah got a very recent one subject 487143 dweilant 2015-07-12 23:43:49 and 2015-07-12 23:45:07 - this is after is live for correctly accounting for the subjects - now subject has seen almost 50% of the live subjects if that matters but that still leaves about 4000 subjects to select from

user who's on the other spectrum of classifiers from the above in this post Audriusa for subject 484145 2015-07-12 07:24:10 and 2015-07-12 07:25:32

notice same thing for nathalieg69 on subject 484066 2015-07-01 03:55:37 and 2015-07-01 04:05:37

Worth noting this is still occurring after the fix for updating subject classification counts/retirement

@camallen
Copy link
Contributor

@vrooje @mschwamb just letting you know we are working on this.

@mschwamb
Copy link

Thanks @camallen

@camallen
Copy link
Contributor

Linked to #1176

Update, we've had no luck tracing this to any direct implemented behaviour (in code) but it could be the result of message ordering between clients and busy / fast API end points where subjects are not dequeued before the next request for subjects comes in.

We will be modifying the dequeue behaviour to happen directly after you've been served subjects and not wait till you submit a classification. This should make any race condition harder to come by as the timing between these "race" messages will be much greater.

I'm going to leave this open. If any more duplicate reports come in (especially after #1176 is fixed) then please make us aware on this issue.

@mschwamb
Copy link

Thanks for the update, and thanks for working on tracking it down. @camallen If you can tell me a time stamp when this fix should be online, I'll check it over the next few days after that and report back on P4: Terrains

@vrooje
Copy link
Author

vrooje commented Jul 23, 2015

Thanks - will keep you posted if this keeps happening. If all we can do is minimize it instead of fix it we should make sure @ggdhines knows to explicitly check for duplicates in the data aggregation phase.

@camallen
Copy link
Contributor

@vrooje pretty sure he already is as we always serve subjects to users in panoptes (we mark them seen / retired, etc) but allow them to keep classifying.

@mschwamb
Copy link

@camallen though that status is not marked in the raw csv dump - if duplicates and retired images seen again can be marked that would be something handy to include in the csv for those doing their data reduction

@camallen
Copy link
Contributor

@mschwamb can you open a sperate issue for this? Perhaps reopen #1086 after reading zooniverse/Panoptes-Front-End#368

@mschwamb
Copy link

@camallen I can't reopen. I'm not a collaborator on this repo. If the powers that be can add me I can start do that or if you want to reopen it I'll comment

@camallen
Copy link
Contributor

Just comment and @ mention me

@camallen
Copy link
Contributor

@mschwamb @vrooje code has been deployed about 18:00 today BST. Let's see how it goes for the next few days.

@mschwamb
Copy link

@camallen Still happening I think - I'm just looking at duplicates from 2015-07-30 00:00:00' on P4T

username Uganalandia view subject 491799 at 2015-07-30 12:11:50 and 2015-07-30 12:14:46

gaga7 saw subject 491810 at 2015-07-30 15:41:52 and 2015-07-30 16:08:10

4thplanet4444 saw subject 486319 at 2015-07-31 07:39:39 and 2015-07-31 08:25:08

@camallen
Copy link
Contributor

I'm going to close this issue as I think we've got this sorted. In summary our queuing code had some bugs and some legacy use case was causing the queues to grow very very large, which allowed dups in.

If anyone finds duplicate classifications happening since the latest date quoted above please comment here / re-open the issue asap.

@chrislintott
Copy link
Member

+1

On 18 Aug 2015, at 06:04, Campbell Allen [email protected] wrote:

Closed #1069 #1069.


Reply to this email directly or view it on GitHub #1069 (comment).

@vrooje
Copy link
Author

vrooje commented Aug 28, 2015

Sadly, I have to re-open this. There were a handful of duplicates in GZ:BL between the closing of this issue and the sending of a newsletter to recruit people to GZ:BL, but since the recruitment there have been about 1,150, out of 36,000, so a duplicate percentage of about 3%.

800 of the duplicates were from not-logged-in users. I looked into this because another top user reported duplicates, but (as with Capella05's report previously) I couldn't find any recent duplicates in the database from this user.

Happy to provide more info as needed...

@vrooje vrooje reopened this Aug 28, 2015
@camallen
Copy link
Contributor

camallen commented Sep 2, 2015

@vrooje 800 from non-logged in is by design in the api, i've got an issue open in the front end to be smarter about ignoring these, zooniverse/Panoptes-Front-End#1427.

So the final 350 i'll need some extra information. Are these all from power users in the long tail? Since commit 8693854 we may get to a situation where we have nothing in the queue and just select something to show the user, they should have seen a banner saying they've seen this before though....seems there are still some bugs to iron out on this one.

@CKrawczyk
Copy link

There were two more uses on talk that have reported duplicate images:

https://www.zooniverse.org/talk/18/115/?comment=19991
https://www.zooniverse.org/talk/18/115/?comment=22586

@bruggsy
Copy link

bruggsy commented Oct 14, 2015

Also been noticing this problem when annotating any subject set with more than 10 members, while logged in.

@camallen
Copy link
Contributor

@bruggsy, do you have reports / talk subject id's, screenshots? Can you please confirm that these are real duplicates and not the expected behaviour, that is the api will return a set of subjects when you have classified all of them. The client should mark the images as retired / seen before in the browser, e.g.

seen_before

@vrooje
Copy link
Author

vrooje commented Oct 29, 2015

Still getting occasional comments from users about duplicates. I think this is due to a failure to submit the classification the first time rather than a duplicate registered classification, but this still makes me uncomfortable because we don't really know how often this is happening, right?

@DarrenMcRoy
Copy link

Oh, so this isn't just a WildCam issue? zooniverse/wildcam-gorongosa#192

@bruggsy
Copy link

bruggsy commented Oct 29, 2015

@camallen Sorry for not responding, been busy with other projects. Not sure what you mean exactly, when I do an API request for a subject set / single subject I don't see those fields. The subjects I was having problems with were 1037498 - 1037697, and subject sets 2422 - 2440. The exact problem was getting "already seen" subjects before I had seen the whole set, as far as I can tell I wasn't getting the same image multiple times with no banner the latter times.

@camallen
Copy link
Contributor

@vrooje i'm looking into the dup reports for GZ Bars.

@bruggsy the system should be showing you unseen un retired images first, then falls back to unseen retired, then once you have completed them all it'll just pick some at random to show to you. Can you please check again and provide me some workflow / subject set id's to reproduce for your account?

@bruggsy
Copy link

bruggsy commented Nov 4, 2015

@camallen Sure, the workflow is 861 and the subject set associated with it is 2592, project is 292.

@srallen
Copy link
Contributor

srallen commented Dec 1, 2015

I setup a project today for Intro2Astro using a very small subject set with 22 subjects and @JulieAnnKU reported this issue. She was served a subject with the already seen flag before having seen all 22. Project id 1616, subject set id 2754, workflow id 1064

@camallen
Copy link
Contributor

camallen commented Dec 4, 2015

I've got another lead on real duplicates for users. Seems some of the background workers haven't been running and creating the req'd tracking details for what they've seen. Linked to #1517, ensuring that these workers run as soon as possible after failure decreases the window for duplicates to emerge.

@mschwamb
Copy link

mschwamb commented Jan 2, 2016

@aliburchard
Copy link

@camallen this seems to still be popping up on Wildcam Gorongosa - had a whole bunch reported mid/late December, and at least one person reported this week: https://www.zooniverse.org/projects/zooniverse/wildcam-gorongosa/talk/79/7125?page=3

@camallen
Copy link
Contributor

thanks, there was a bug showing the proper already seen / retired tags from the API that got fixed in SGL. Also some of the issues in SGL would be leading to duplicates showing under heavy api loads.

We've got new code out to fix these and some more to get out that will hopefully alleviate this.

@aliburchard
Copy link

@camallen
Copy link
Contributor

camallen commented Feb 2, 2016

Hmm - seems the messaging about already seen / retired was removed at the end of last month. See this, zooniverse/Panoptes-Front-End#2170. I can only hope we aren't messaging correctly.

I'm checking this user details to see what's happening.

@mschwamb
Copy link

This seems to still be happening - example on Planet Four Terrains user_id number 1328372 classified subject 1328372 twice within a ten minute span. This was on February 9th.

@mschwamb
Copy link

mschwamb commented Apr 2, 2016

This is still happening. Volunteer is seeing 'already seen' flag on and off, but there's ~6000 new images on Comet Hunters. See the volunteers description here

@camallen
Copy link
Contributor

closing this in favour of #1640 since we "are" labelling the duplicates as already seen, please put all reports onto the "Already Seens" issue from now ow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants