Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task] research duplicated data in family names text #29

Open
OriHoch opened this issue Jun 19, 2017 · 5 comments
Open

[Task] research duplicated data in family names text #29

OriHoch opened this issue Jun 19, 2017 · 5 comments

Comments

@OriHoch
Copy link
Contributor

OriHoch commented Jun 19, 2017

reproduction

  • search for DARI, DERI, DER'I, DEREHA, EDRY, EDERY, EDREHY
  • results contain 7 results of different family names all based on DERHI variants
  • look at the content (unit text) of those family names

expected

  • no duplicated content

actual

  • content is the same for all the 7 family names

implications of this bug

  • while this might not look like a bug - it could have consequences on the search enging
  • it might skew results and prevent the search engine from determining relevancy properly
  • need to research this problem to determine if it really is a problem and what can be done about it

TODO

  • look for more examples of this problem - are there more family names which have duplicated content?
  • do we have duplicated content in other collections?
  • research duplicated content in elasticsearch and how we can deal with it
@TheGrandVizier
Copy link

This is a scenario that keeps resurfacing and to my knowledge cannot be fixed without the consent of Haim and his blessing on remodeling this content into something better.

Last we spoke about this subject he insisted there are individual names and must each have their own item page, unlike others where a merge is even preferable. There seems to be a difference (that is not understood by me) between varieties that can be merged and varieties that may not be merged.

@OriHoch
Copy link
Contributor Author

OriHoch commented Jun 19, 2017

great, thanks, I think we can solve this on our side - the content is exactly the same and we can detect this during the sync process (or at some other stage).

the question is what to do once we detect this duplication and what kind of problems this duplication poses

I guess these are the main problems I can see:

  • UX - people don't like to see duplicated content
  • SEO - google doesn't like duplicated content
  • search engine - search engines don't like duplicated content (e.g. return skewed results / messes with the relevancy)

now we need to think how / if to fix it..

@OriHoch
Copy link
Contributor Author

OriHoch commented Jun 19, 2017

also, if it popped up in the past, it would be great to know what kind of problems we had with this in the past

@TheGrandVizier
Copy link

Just flat out refusal to change anything on the BHP side of things, content-wise.
We dropped it at that.

@nuritgazit
Copy link

Two things:

  1. I think Haim's issue was that in some cases, there are (minor) differences between articles, while in others the text is the same. It would be helpful to get an estimation on the amount of items in each group.
  2. we know for sure that there are many of them, but only on family names
  3. In terms of product, we should aspire to allow 2 different people, one looking for "Deri" and the other for "Der'i", for example, to get to the unified item, without them having to guess that its the same name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants