[Task] research duplicated data in family names text #29

OriHoch · 2017-06-19T13:49:01Z

reproduction

search for DARI, DERI, DER'I, DEREHA, EDRY, EDERY, EDREHY
- you can use this prepared search on Kibana
results contain 7 results of different family names all based on DERHI variants
look at the content (unit text) of those family names

expected

no duplicated content

actual

content is the same for all the 7 family names

implications of this bug

while this might not look like a bug - it could have consequences on the search enging
it might skew results and prevent the search engine from determining relevancy properly
need to research this problem to determine if it really is a problem and what can be done about it

TODO

look for more examples of this problem - are there more family names which have duplicated content?
do we have duplicated content in other collections?
research duplicated content in elasticsearch and how we can deal with it

The text was updated successfully, but these errors were encountered:

TheGrandVizier · 2017-06-19T13:53:18Z

This is a scenario that keeps resurfacing and to my knowledge cannot be fixed without the consent of Haim and his blessing on remodeling this content into something better.

Last we spoke about this subject he insisted there are individual names and must each have their own item page, unlike others where a merge is even preferable. There seems to be a difference (that is not understood by me) between varieties that can be merged and varieties that may not be merged.

OriHoch · 2017-06-19T13:58:04Z

great, thanks, I think we can solve this on our side - the content is exactly the same and we can detect this during the sync process (or at some other stage).

the question is what to do once we detect this duplication and what kind of problems this duplication poses

I guess these are the main problems I can see:

UX - people don't like to see duplicated content
SEO - google doesn't like duplicated content
search engine - search engines don't like duplicated content (e.g. return skewed results / messes with the relevancy)

now we need to think how / if to fix it..

OriHoch · 2017-06-19T13:59:06Z

also, if it popped up in the past, it would be great to know what kind of problems we had with this in the past

TheGrandVizier · 2017-06-19T14:04:02Z

Just flat out refusal to change anything on the BHP side of things, content-wise.
We dropped it at that.

nuritgazit · 2017-06-19T15:05:40Z

Two things:

I think Haim's issue was that in some cases, there are (minor) differences between articles, while in others the text is the same. It would be helpful to get an estimation on the amount of items in each group.
we know for sure that there are many of them, but only on family names
In terms of product, we should aspire to allow 2 different people, one looking for "Deri" and the other for "Der'i", for example, to get to the unified item, without them having to guess that its the same name.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Task] research duplicated data in family names text #29

[Task] research duplicated data in family names text #29

OriHoch commented Jun 19, 2017

TheGrandVizier commented Jun 19, 2017

OriHoch commented Jun 19, 2017

OriHoch commented Jun 19, 2017

TheGrandVizier commented Jun 19, 2017

nuritgazit commented Jun 19, 2017

[Task] research duplicated data in family names text #29

[Task] research duplicated data in family names text #29

Comments

OriHoch commented Jun 19, 2017

reproduction

expected

actual

implications of this bug

TODO

TheGrandVizier commented Jun 19, 2017

OriHoch commented Jun 19, 2017

OriHoch commented Jun 19, 2017

TheGrandVizier commented Jun 19, 2017

nuritgazit commented Jun 19, 2017