Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search results contains a sentence in the wrong language #2226

Closed
trang opened this issue Mar 21, 2020 · 13 comments · Fixed by #2431
Closed

Search results contains a sentence in the wrong language #2226

trang opened this issue Mar 21, 2020 · 13 comments · Fixed by #2431
Labels
bug Issue that describes a problem with a feature that doesn't work as expected.
Milestone

Comments

@trang
Copy link
Member

trang commented Mar 21, 2020

Reported by brauchinet on the Wall:

Sentence #8281856 is misclassified as Spanish by the search engine.
It shouldn't appear in this search: https://tatoeba.org/eng/sentences/search?query=como+diferente&from=spa&to=und

@trang trang added the bug Issue that describes a problem with a feature that doesn't work as expected. label Mar 21, 2020
@jiru jiru changed the title Rearch results contains a sentence in the wrong language Search results contains a sentence in the wrong language Apr 5, 2020
@AndiPersti
Copy link
Contributor

The example mentioned by CK was for some time stored as an English sentence (see AlanF_US's comment). That's why Manticore indexed it both as English and Portuguese sentence:

MySQL [(none)]> select id, created, modified from eng_main_index where id = 8101669;
+---------+------------+------------+
| id      | created    | modified   |
+---------+------------+------------+
| 8101669 | 1565525305 | 1565525378 |
+---------+------------+------------+
1 row in set (0.002 sec)

MySQL [(none)]> select id, created, modified from por_main_index where id = 8101669;
+---------+------------+------------+
| id      | created    | modified   |
+---------+------------+------------+
| 8101669 | 1565525305 | 1577641431 |
+---------+------------+------------+
1 row in set (0.002 sec)

Changing the language from English to Portuguese should have removed the sentence from eng_main_index but according to the Manticore docs we would need to define a sql_query_killlist for that to happen:

For the datasets that can have documents modified or deleted, the delta index should also provide a list with documents that sufffered changes in order to be suppressed and not be used in search queries. This is achieved with the feature called Kill lists. The document ids to be killed can be provided in an auxiliary query defined by sql_query_killlist. The delta must point the indexes for which the kill-lists will be applied by directive killlist_target. The effect of kill-lists is permanent on the target index, meaning even if the search is made without the delta index, the suppressed documents will not appear in searches.

But it isn't set anywhere in our configuration

root@sloth:~# grep -cF 'sql_query_killlist' /etc/manticoresearch/manticore.conf
0

@jiru
Copy link
Member

jiru commented Jun 13, 2020

But we have killlist_target = main:kl defined in all delta indexes. The documentation about killlist_target says:

killlist_target = main:id. All document ids from delta index are suppressed in the main index. Kill-list is ignored.

@AndiPersti
Copy link
Contributor

I see but I think that doesn't work with how reindex_flags is designed.

Suppose I have an English sentence in the database and merged the main and delta index, which gives me the following setup as a starting point:

MariaDB [tatoeba]> select id, lang, text from sentences where id = 139;
+-----+------+------------------------+
| id  | lang | text                   |
+-----+------+------------------------+
| 139 | eng  | Testing search engine. |
+-----+------+------------------------+
1 row in set (0.00 sec)

MariaDB [tatoeba]> select * from reindex_flags;
Empty set (0.00 sec)
MySQL [(none)]> select id, created, modified from eng_main_index where id = 139;
+------+------------+------------+
| id   | created    | modified   |
+------+------------+------------+
|  139 | 1592074942 | 1592074942 |
+------+------------+------------+
1 row in set (0.00 sec)

MySQL [(none)]> select id, created, modified from fra_main_index where id = 139;
Empty set (0.00 sec)

MySQL [(none)]> select id, created, modified from eng_delta_index where id = 139;
Empty set (0.00 sec)

MySQL [(none)]> select id, created, modified from fra_delta_index where id = 139;
Empty set (0.01 sec)

Then I change the language from English to French:

MariaDB [tatoeba]> select id, lang, text from sentences where id = 139;
+-----+------+------------------------+
| id  | lang | text                   |
+-----+------+------------------------+
| 139 | fra  | Testing search engine. |
+-----+------+------------------------+
1 row in set (0.00 sec)

MariaDB [tatoeba]> select * from reindex_flags;
+-----+-------------+------+---------+
| id  | sentence_id | lang | indexed |
+-----+-------------+------+---------+
| 733 |         139 | fra  |       0 |
| 734 |         139 | eng  |       0 |
+-----+-------------+------+---------+
2 rows in set (0.00 sec)

Now updating the delta index gives me:

MySQL [(none)]> select id, created, modified from eng_main_index where id = 139;
+------+------------+------------+
| id   | created    | modified   |
+------+------------+------------+
|  139 | 1592074942 | 1592074942 |
+------+------------+------------+
1 row in set (0.00 sec)

MySQL [(none)]> select id, created, modified from fra_main_index where id = 139;
Empty set (0.00 sec)

MySQL [(none)]> select id, created, modified from eng_delta_index where id = 139;
Empty set (0.00 sec)

MySQL [(none)]> select id, created, modified from fra_delta_index where id = 139;
+------+------------+------------+
| id   | created    | modified   |
+------+------------+------------+
|  139 | 1592074942 | 1592075208 |
+------+------------+------------+
1 row in set (0.00 sec)
MariaDB [tatoeba]> select * from reindex_flags;
+-----+-------------+------+---------+
| id  | sentence_id | lang | indexed |
+-----+-------------+------+---------+
| 733 |         139 | fra  |       1 |
| 734 |         139 | eng  |       1 |
+-----+-------------+------+---------+
2 rows in set (0.01 sec)

And merging main and delta results in:

MySQL [(none)]> select id, created, modified from eng_main_index where id = 139;
+------+------------+------------+
| id   | created    | modified   |
+------+------------+------------+
|  139 | 1592074942 | 1592074942 |
+------+------------+------------+
1 row in set (0.00 sec)

MySQL [(none)]> select id, created, modified from fra_main_index where id = 139;
+------+------------+------------+
| id   | created    | modified   |
+------+------------+------------+
|  139 | 1592074942 | 1592075208 |
+------+------------+------------+
1 row in set (0.00 sec)

MySQL [(none)]> select id, created, modified from eng_delta_index where id = 139;
Empty set (0.00 sec)

MySQL [(none)]> select id, created, modified from fra_delta_index where id = 139;
+------+------------+------------+
| id   | created    | modified   |
+------+------------+------------+
|  139 | 1592074942 | 1592075208 |
+------+------------+------------+
1 row in set (0.00 sec)
MariaDB [tatoeba]> select * from reindex_flags;
Empty set (0.00 sec)

You'll notice that the English delta index doesn't have a row which would suppress the corresponding row in the main index and even if it would have, what would/should be its contents? All columns null?

As I understand it we currently do not have a way to tell Manticore to remove a sentence from the main index and I think that's what sql_query_killist is used for.

@jiru
Copy link
Member

jiru commented Jun 14, 2020

I see. Thanks for taking the time to demonstrate the problem. 🙏

So I think we should use

    index fra_delta_index : fra_main_index
    {
        killlist_target = fra_main_index:kl

and

    source fra_delta_src : fra_main_src
    {
        sql_query_killlist = select sentence_id from reindex_flags rf join sentences s on s.id = rf.sentence_id where rf.lang = 'fra' and s.lang <> rf.lang

@jiru
Copy link
Member

jiru commented Jun 14, 2020

This reminds me we have a lot of dangling entries in reindex_flags having lang as NULL. They are create whenever a sentence gets the "unknown" flag, and they are never removed by the sphinx_indexes shell because it only operates on languages (sentences with "unknown" flag are not indexed at all).

Apart from that, the sql_query_killlist I suggested above wont work when lang is NULL because mysql is so smart when comparing NULL values:

MariaDB [tatoeba]> select IF(NULL <> 'fra', 'different', 'equal');
+-----------------------------------------+
| IF(NULL <> 'fra', 'different', 'equal') |
+-----------------------------------------+
| equal                                   |
+-----------------------------------------+

@AndiPersti
Copy link
Contributor

sql_query_killlist = select sentence_id from reindex_flags rf join sentences s on s.id = rf.sentence_id where rf.lang = 'fra' and s.lang <> rf.lang

I'm afraid this query is not enough because it only returns the ids for sentences whose language changed. But the kill-list should also contain the sentences that got deleted[*] and this query won't find them because for deletion there would be a row for the sentence id in reindex_flags but not in sentences any more.

Actually I think the query for the kill-list would simply be select sentence_id from reindex_flags where lang = 'fra'; because we record all modifications in reindex_flags (inserts, updates and deletions) and so all the sentence ids contained there are "dirty" and shouldn't be searched in the main index. But it is already late and I may miss some detail. 😄 It needs for sure further testing.

[*] Deletion is also broken:
#950523 was deleted a few weeks ago. But it is still in eng_main_index:

MySQL [(none)]> select count(*) from eng_main_index where id = 950523;
+----------+
| count(*) |
+----------+
|        1 |
+----------+

Searching for "We got along" says that there are 9 results but only 8 are shown. Running the query manually returns all 9 sentences including 950523:

MySQL [(none)]> select id from eng_main_index where match('We got along') order by id asc;
+---------+
| id      |
+---------+
|   27126 |
|  243981 |
|  662960 |
|  950523 |
| 1886651 |
| 3313052 |
| 5158739 |
| 8060496 |
| 8064221 |
+---------+

(I'm pretty sure that #1952 is related to this problem.)

@AndiPersti
Copy link
Contributor

This reminds me we have a lot of dangling entries in reindex_flags having lang as NULL. They are create whenever a sentence gets the "unknown" flag, and they are never removed by the sphinx_indexes shell because it only operates on languages (sentences with "unknown" flag are not indexed at all).

Yes, I've noticed that.

Apart from that, the sql_query_killlist I suggested above wont work when lang is NULL because mysql is so smart when comparing NULL values:

MariaDB [tatoeba]> select IF(NULL <> 'fra', 'different', 'equal');
+-----------------------------------------+
| IF(NULL <> 'fra', 'different', 'equal') |
+-----------------------------------------+
| equal                                   |
+-----------------------------------------+

I think we don't need it but the rather baroque not (null <=> 'fra') would work:

MariaDB [tatoeba]> select if(not (null <=> 'fra'), 'diff', 'eq'), if (not ('eng' <=> 'fra'), 'diff', 'eq'), if (not ('fra' <=> 'fra'), 'diff', 'eq'), if(not (null <=> null), 'diff', 'eq');
+----------------------------------------+------------------------------------------+------------------------------------------+---------------------------------------+
| if(not (null <=> 'fra'), 'diff', 'eq') | if (not ('eng' <=> 'fra'), 'diff', 'eq') | if (not ('fra' <=> 'fra'), 'diff', 'eq') | if(not (null <=> null), 'diff', 'eq') |
+----------------------------------------+------------------------------------------+------------------------------------------+---------------------------------------+
| diff                                   | diff                                     | eq                                       | eq                                    |
+----------------------------------------+------------------------------------------+------------------------------------------+---------------------------------------+

@jiru
Copy link
Member

jiru commented Jun 26, 2020

I leave you the privilege of opening a PR since you did all this insightful research. 😄

@AndiPersti
Copy link
Contributor

Will do so in the near future.

But after a quick test I'm afraid

Actually I think the query for the kill-list would simply be select sentence_id from reindex_flags where lang = 'fra'; because we record all modifications in reindex_flags (inserts, updates and deletions) and so all the sentence ids contained there are "dirty" and shouldn't be searched in the main index.

doesn't work. The problem is that the kill-list is still active while merging main and delta and so sentences that were new/updated won't make it into the new main.

So I guess we need to differ between new/updated and deleted sentences. New/updated sentences would be included in the delta index (using sql_query as we do now) and deleted sentences would be included in the kill-list (using sql_query_killlist). And the target would be killlist_target=xxx_main_index which will suppress ids from both the delta index and the killlist.
We also need to modify reindex_flags because we need three states (updated, deleted, indexed).

I hope I'll have some time in the next few days to test this setup.

@AndiPersti
Copy link
Contributor

I hope I'll have some time in the next few days to test this setup.

Today I've tested the configuration with a separate kill-list for deleted sentences and I'm pretty confident that it works. I've created a little demo that simulates a few cycles of database changes -> delta index updates -> merging. You can find the necessary files at https://gist.github.com/AndiPersti/a13ce7491d3feba4769611a7e6d47655

If you want to test it yourself, just start a new VM (the test script will change tables in the database and the configuration for Manticore) and log into it.
Then from the vagrant home directory enter

git clone https://gist.github.com/a13ce7491d3feba4769611a7e6d47655.git manticore_test
cd manticore_test && sh test.sh

I have implemented already most of the necessary changes but need a little more time for clean up and final testing.

@AndiPersti
Copy link
Contributor

This reminds me we have a lot of dangling entries in reindex_flags having lang as NULL. They are create whenever a sentence gets the "unknown" flag, and they are never removed by the sphinx_indexes shell because it only operates on languages (sentences with "unknown" flag are not indexed at all).

I guess we still don't want to index sentences with an "unknown" language, do we?
In that case I'd suggest to just not adding them to ReindexFlags.

@jiru
Copy link
Member

jiru commented Jul 3, 2020

I agree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue that describes a problem with a feature that doesn't work as expected.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants