Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Owned by a self-identified native" search criterion throws an error for English #2166

Closed
ckjpn opened this issue Feb 26, 2020 · 5 comments · Fixed by #2173
Closed

"Owned by a self-identified native" search criterion throws an error for English #2166

ckjpn opened this issue Feb 26, 2020 · 5 comments · Fixed by #2173
Labels
regression Issue that describes a bug for a feature that used to work just fine.

Comments

@ckjpn
Copy link

ckjpn commented Feb 26, 2020

The Problem

At least for some advanced searches, "Owned by a self-identified native" gets an error message.

Experiments: Things you can try

Here are several searches with only minor differences, results sorted oldest first.
Only the first one results in an error message.

Query: liquor|brandy|ale|absinthe|daiquiri|margarita|sangria|wine|tea|soda|smoothie|milkshake|milk|lemonade|juice|coffee|espresso|cappuccino|cocoa|grog|cola|beer|whiskey|bourbon|tequila|rum|cocktail|cider|martini|vodka|gin|"white russian"|"bloody mary"|"tom collins"|"hot chocolate"|"piña colada"|"soft drink"|"soda water"|"black cow"|"mint julep"|"egg nog"|"tonic water"|"mineral water"

Limited to native speakers ** This one gets an error **

Search error
Invalid query. Please refer to the search documentation for more details.

https://tatoeba.org/eng/sentences/search?query=liquor%7Cbrandy%7Cale%7Cabsinthe%7Cdaiquiri%7Cmargarita%7Csangria%7Cwine%7Ctea%7Csoda%7Csmoothie%7Cmilkshake%7Cmilk%7Clemonade%7Cjuice%7Ccoffee%7Cespresso%7Ccappuccino%7Ccocoa%7Cgrog%7Ccola%7Cbeer%7Cwhiskey%7Cbourbon%7Ctequila%7Crum%7Ccocktail%7Ccider%7Cmartini%7Cvodka%7Cgin%7C%22white+russian%22%7C%22bloody+mary%22%7C%22tom+collins%22%7C%22hot+chocolate%22%7C%22pi%C3%B1a+colada%22%7C%22soft+drink%22%7C%22soda+water%22%7C%22black+cow%22%7C%22mint+julep%22%7C%22egg+nog%22%7C%22tonic+water%22%7C%22mineral+water%22&from=eng&to=und&user=&orphans=no&unapproved=no&has_audio=&tags=&list=&native=yes&trans_filter=limit&trans_to=und&trans_link=&trans_user=&trans_orphan=&trans_unapproved=&trans_has_audio=&sort=created&sort_reverse=yes

Limited to List 907 (1,000 results out of 4,499 occurrences)

https://tatoeba.org/eng/sentences/search?query=liquor%7Cbrandy%7Cale%7Cabsinthe%7Cdaiquiri%7Cmargarita%7Csangria%7Cwine%7Ctea%7Csoda%7Csmoothie%7Cmilkshake%7Cmilk%7Clemonade%7Cjuice%7Ccoffee%7Cespresso%7Ccappuccino%7Ccocoa%7Cgrog%7Ccola%7Cbeer%7Cwhiskey%7Cbourbon%7Ctequila%7Crum%7Ccocktail%7Ccider%7Cmartini%7Cvodka%7Cgin%7C%22white+russian%22%7C%22bloody+mary%22%7C%22tom+collins%22%7C%22hot+chocolate%22%7C%22pi%C3%B1a+colada%22%7C%22soft+drink%22%7C%22soda+water%22%7C%22black+cow%22%7C%22mint+julep%22%7C%22egg+nog%22%7C%22tonic+water%22%7C%22mineral+water%22&from=eng&to=und&user=&orphans=no&unapproved=no&has_audio=&tags=&list=&native=yes&trans_filter=limit&trans_to=und&trans_link=&trans_user=&trans_orphan=&trans_unapproved=&trans_has_audio=&sort=created&sort_reverse=yes

Has Audio (1,000 results out of 3,069 occurrences)

https://tatoeba.org/eng/sentences/search?query=liquor%7Cbrandy%7Cale%7Cabsinthe%7Cdaiquiri%7Cmargarita%7Csangria%7Cwine%7Ctea%7Csoda%7Csmoothie%7Cmilkshake%7Cmilk%7Clemonade%7Cjuice%7Ccoffee%7Cespresso%7Ccappuccino%7Ccocoa%7Cgrog%7Ccola%7Cbeer%7Cwhiskey%7Cbourbon%7Ctequila%7Crum%7Ccocktail%7Ccider%7Cmartini%7Cvodka%7Cgin%7C%22white+russian%22%7C%22bloody+mary%22%7C%22tom+collins%22%7C%22hot+chocolate%22%7C%22pi%C3%B1a+colada%22%7C%22soft+drink%22%7C%22soda+water%22%7C%22black+cow%22%7C%22mint+julep%22%7C%22egg+nog%22%7C%22tonic+water%22%7C%22mineral+water%22&from=eng&to=und&user=&orphans=no&unapproved=no&has_audio=yes&tags=&list=&native=&trans_filter=limit&trans_to=und&trans_link=&trans_user=&trans_orphan=&trans_unapproved=&trans_has_audio=&sort=created&sort_reverse=yes

No limits (1,000 results out of 7,870 occurrences)

https://tatoeba.org/eng/sentences/search?query=liquor%7Cbrandy%7Cale%7Cabsinthe%7Cdaiquiri%7Cmargarita%7Csangria%7Cwine%7Ctea%7Csoda%7Csmoothie%7Cmilkshake%7Cmilk%7Clemonade%7Cjuice%7Ccoffee%7Cespresso%7Ccappuccino%7Ccocoa%7Cgrog%7Ccola%7Cbeer%7Cwhiskey%7Cbourbon%7Ctequila%7Crum%7Ccocktail%7Ccider%7Cmartini%7Cvodka%7Cgin%7C%22white+russian%22%7C%22bloody+mary%22%7C%22tom+collins%22%7C%22hot+chocolate%22%7C%22pi%C3%B1a+colada%22%7C%22soft+drink%22%7C%22soda+water%22%7C%22black+cow%22%7C%22mint+julep%22%7C%22egg+nog%22%7C%22tonic+water%22%7C%22mineral+water%22&from=eng&to=und&user=&orphans=no&unapproved=no&has_audio=&tags=&list=&native=&trans_filter=limit&trans_to=und&trans_link=&trans_user=&trans_orphan=&trans_unapproved=&trans_has_audio=&sort=created&sort_reverse=yes

I purposely avoided the "relevance" option because of #1895

@jiru jiru added the bug Issue that describes a problem with a feature that doesn't work as expected. label Feb 26, 2020
@jiru
Copy link
Member

jiru commented Feb 26, 2020

I can only reproduce this issue on tatoeba.org, but not on dev.tatoeba.org or locally.

The search daemon returns the following error:

invalid attribute 'user_id' set length 4121 (should be in 0..4096 range)

The "owned by a self-identified native" criterion filters by creating a list of natives in the searched language and filtering by all these user ids. I think the amount of native English speakers on Tatoeba crossed some limit of Manticore (4096).

@ckjpn
Copy link
Author

ckjpn commented Feb 26, 2020

I think the amount of native English speakers on Tatoeba crossed some limit of Manticore (4096).

Is there a possibility, of limiting this list to only those who actually own sentences?
That might solve the problem.

Or, is that something that is already being done?

@jiru
Copy link
Member

jiru commented Feb 26, 2020

Is there a possibility, of limiting this list to only those who actually own sentences?

Yes.

Or, is that something that is already being done?

No.

Another, maybe more scalable solution is to split the list of users into slices of 4096 and to send a batch of search queries to the daemon, each with a different slice, and group the results. If I remember correctly, the Manticore API makes it easy to do things like that.

Or, we can perform the native check during indexation (instead of query) and add this as a new attribute (and it means we have to live-update this attribute too).

@jiru jiru changed the title At least for some advanced searches, "Owned by a self-identified native" gets an error message. "Owned by a self-identified native" search criterion throws an error for English Feb 26, 2020
@ckjpn
Copy link
Author

ckjpn commented Feb 27, 2020

The last time I checked, there were only 5,893 identified native speakers who owned sentences.

http://tatoeba.ueuo.com/stats-200118.html

My numbers may differ from what's actually on the website for the following reasons.

  1. I have native language data on members who contributed before tatoeba.org had a special place to register this data.

  2. I also filter-out those who claim multiple native languages, unless they've responded to me, telling me what their strongest or native language is.

@jiru jiru added regression Issue that describes a bug for a feature that used to work just fine. and removed bug Issue that describes a problem with a feature that doesn't work as expected. labels Feb 29, 2020
jiru added a commit that referenced this issue Mar 1, 2020
This fixes the error message when #2166 occurs:
it should be "search error" instead of "syntax error".
jiru added a commit that referenced this issue Mar 1, 2020
Closes #2166.

Work around Manticore filter values limitation.
When the number of natives is greater than 4096, Manticore
throws an error. To avoid this, we filter by excluding
non-natives instead. This is possible because filters are
combined with a boolean AND operation, so we can create
multiple filters with 4096 values each.

At the moment, on Tatoeba, there are 4128 English natives
and 5129 non-natives.
@jiru
Copy link
Member

jiru commented Mar 1, 2020

Another, maybe more scalable solution is to split the list of users into slices of 4096 and to send a batch of search queries to the daemon, each with a different slice, and group the results. If I remember correctly, the Manticore API makes it easy to do things like that.

It turns out it doesn’t look it’s possible to group the results of a batch of queries. You just get one resultset per query.

@jiru jiru closed this as completed in #2173 Mar 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
regression Issue that describes a bug for a feature that used to work just fine.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants