-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Search results contains a sentence in the wrong language #2226
Comments
The example mentioned by CK was for some time stored as an English sentence (see AlanF_US's comment). That's why Manticore indexed it both as English and Portuguese sentence:
Changing the language from English to Portuguese should have removed the sentence from
But it isn't set anywhere in our configuration
|
But we have
|
I see but I think that doesn't work with how Suppose I have an English sentence in the database and merged the main and delta index, which gives me the following setup as a starting point:
Then I change the language from English to French:
Now updating the delta index gives me:
And merging main and delta results in:
You'll notice that the English delta index doesn't have a row which would suppress the corresponding row in the main index and even if it would have, what would/should be its contents? All columns null? As I understand it we currently do not have a way to tell Manticore to remove a sentence from the main index and I think that's what |
I see. Thanks for taking the time to demonstrate the problem. 🙏 So I think we should use
and
|
This reminds me we have a lot of dangling entries in reindex_flags having lang as NULL. They are create whenever a sentence gets the "unknown" flag, and they are never removed by the sphinx_indexes shell because it only operates on languages (sentences with "unknown" flag are not indexed at all). Apart from that, the sql_query_killlist I suggested above wont work when lang is NULL because mysql is so smart when comparing NULL values:
|
I'm afraid this query is not enough because it only returns the ids for sentences whose language changed. But the kill-list should also contain the sentences that got deleted[*] and this query won't find them because for deletion there would be a row for the sentence id in Actually I think the query for the kill-list would simply be [*] Deletion is also broken:
Searching for "We got along" says that there are 9 results but only 8 are shown. Running the query manually returns all 9 sentences including 950523:
(I'm pretty sure that #1952 is related to this problem.) |
Yes, I've noticed that.
I think we don't need it but the rather baroque
|
I leave you the privilege of opening a PR since you did all this insightful research. 😄 |
Will do so in the near future. But after a quick test I'm afraid
doesn't work. The problem is that the kill-list is still active while merging main and delta and so sentences that were new/updated won't make it into the new main. So I guess we need to differ between new/updated and deleted sentences. New/updated sentences would be included in the delta index (using I hope I'll have some time in the next few days to test this setup. |
Today I've tested the configuration with a separate kill-list for deleted sentences and I'm pretty confident that it works. I've created a little demo that simulates a few cycles of database changes -> delta index updates -> merging. You can find the necessary files at https://gist.github.com/AndiPersti/a13ce7491d3feba4769611a7e6d47655 If you want to test it yourself, just start a new VM (the test script will change tables in the database and the configuration for Manticore) and log into it.
I have implemented already most of the necessary changes but need a little more time for clean up and final testing. |
I guess we still don't want to index sentences with an "unknown" language, do we? |
I agree. |
Reported by brauchinet on the Wall:
The text was updated successfully, but these errors were encountered: