Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: fulltext index search in tpch 100g lineitem oom #20213

Open
1 task done
tom-csf opened this issue Nov 20, 2024 · 15 comments
Open
1 task done

[Bug]: fulltext index search in tpch 100g lineitem oom #20213

tom-csf opened this issue Nov 20, 2024 · 15 comments
Assignees
Labels
kind/bug Something isn't working phase/testing severity/s1 High impact: Logical errors or data errors that must occur
Milestone

Comments

@tom-csf
Copy link

tom-csf commented Nov 20, 2024

Is there an existing issue for the same bug?

  • I have checked the existing issues.

Branch Name

main

Commit ID

c93bbbd

Other Environment Information

- Hardware parameters:
- OS type:
- Others:

Actual Behavior

企业微信截图_f78d812e-1d6e-4c97-9bb5-0627d8ed6f87 企业微信截图_e61b236f-bb27-458b-b2c1-9d502b4db342

Expected Behavior

No response

Steps to Reproduce

1、load 100g tpch data to mo
2、set experimental_fulltext_index=1;
create fulltext index fdx on lineitem(L_COMMENT)
3、select * from lineitem where match(l_comment) against('"olphins nag slyly after the regular packa"' in boolean mode);

Additional information

No response

@tom-csf tom-csf added kind/bug Something isn't working needs-triage severity/s0 Extreme impact: Cause the application to break down and seriously affect the use labels Nov 20, 2024
@tom-csf tom-csf added this to the 2.0.1 milestone Nov 20, 2024
@aronchanisme
Copy link
Contributor

Fulltext index issue. Hi Eric @cpegeric , could you please kindly help take a look? Thanks.

@cpegeric
Copy link
Contributor

cpegeric commented Nov 20, 2024

mysql> select count(*) from lineitem where match(l_comment) against('deposits' in boolean mode);
ERROR 20101 (HY000): internal error: Invalid alloc size 1147486208

image image

@ouyuanning
Copy link
Contributor

hash build时候的OOM。不知道跟 #20236 是不是类似问题
@badboynt1 先帮忙定位一下吧

@badboynt1
Copy link
Contributor

badboynt1 commented Nov 21, 2024

跟 #20236不是同一个问题,我在129上可以成功创建索引,但是我跑query会hang住,看起来这条query要跑10分钟以上? 没有报错也没有oom @cpegeric

@badboynt1 badboynt1 assigned cpegeric and unassigned badboynt1 Nov 21, 2024
mergify bot pushed a commit that referenced this issue Nov 25, 2024
…rser (#20269)

bug fixes for #20217 #20213 #20175

1. limit the batch size to 8192 on both fulltext_index_scan() and fulltext_tokenize() function
2. In fulltext_index_scan function, create a new thread to evaluate the score in 8192 documents per batch instead of waiting for all results from SQL.  It will speed up and avoid OOM in the function.  However, the score will be calculated based on each mini-batch instead of complete batch.  I think it doesn't matter as long as we have the correct answer.
3. support json_value parser
4. Pre-allocation of memory in fulltext_tokenize() function to avoid malloc
5. add monpl tokenizer repo to matrixone
6. bug fix json tokenizer to truncate value and increase the limit to 127 bytes
7. pushdown limit

Approved by: @badboynt1, @zhangxu19830126, @m-schen, @fengttt, @aunjgr, @ouyuanning, @sukki37, @aressu1985, @heni02, @XuPeng-SH, @qingxinhome
@cpegeric cpegeric assigned badboynt1 and unassigned cpegeric Nov 28, 2024
@badboynt1
Copy link
Contributor

select * from lineitem where match(l_comment) against('"olphins nag slyly after the regular packa"' in boolean mode);

这条query在我本地环境上依然会hang住,或者需要运行超过十分钟。

但是即便我扫描全表, select * from lineitem where l_comment like "%olphins nag slyly after the regular packa%";
这条语句也能在数秒内返回。 看起来fulltext index的实现似乎有些问题
@cpegeric

@badboynt1 badboynt1 assigned cpegeric and unassigned badboynt1 Nov 29, 2024
mergify bot pushed a commit that referenced this issue Nov 29, 2024
…rser (#20230)

bug fixes for #20217 #20213 #20175 #20149
and add json_value parser

1. limit the batch size to 8192 on both fulltext_index_scan() and fulltext_tokenize() function
2. In fulltext_index_scan function, create a new thread to evaluate the score in 8192 documents per batch instead of waiting for all results from SQL.  It will speed up and avoid OOM in the function.  However, the score will be calculated based on each mini-batch instead of complete batch.  I think it doesn't matter as long as we have the correct answer.
3. support json_value parser
4. Pre-allocation of memory in fulltext_tokenize() function to avoid malloc
5.  bug fix #20149 Delete table.   pkPos, pkType is needed but (doc_id, INT) is given.
6. add monpl tokenizer repo to matrixone
7. bug fix json tokenizer to truncate value and increase the limit to 127 bytes
8. pushdown limit
9. bug fix #20311. data race occurred during bvt test
10. alter table drop column with fulltext index
11. SQL executor add streaming mode.

Approved by: @fengttt, @badboynt1, @zhangxu19830126, @m-schen, @aunjgr, @ouyuanning, @aressu1985, @XuPeng-SH, @sukki37, @qingxinhome
@cpegeric
Copy link
Contributor

cpegeric commented Dec 2, 2024

The slow query in fulltext index is

select * from `__mo_index_secondary_019386fe-f00c-7b42-8f99-ba608027eee7` where word in ('olphins', 'nag', 'slyly', 'after', 'the', 'regular', 'packa') order by doc_id;

order by is slow when the number of rows is large.

mysql> select count(*) from `__mo_index_secondary_019386fe-f00c-7b42-8f99-ba608027eee7` where word in ('olphins', 'nag', 'slyly', 'after'
, 'the', 'regular', 'packa');
+-----------+
| count(*)  |
+-----------+
| 309164415 |
+-----------+

Possible solutions:

  1. make ORDER BY faster
  2. use stop word
  3. return a vector to represent the text. Right now, multiple rows per doc_id returns that why we need to sort by doc_id.
  4. Our own index file format that enable us to sort the index in advance but not sort on the fly.
  5. use disk hashtable to speed up the word -> docid look up. https://github.com/rosedblabs/diskhash
  6. disk cache https://github.com/peterbourgon/diskv

@fengttt fengttt modified the milestones: 2.0.1, 2.1.0 Dec 4, 2024
@fengttt
Copy link
Contributor

fengttt commented Dec 4, 2024

First let's double check fulltext index table schema and primary key/cluster by key.

Second, the following query is not going to work well

select * from `__mo_index_secondary_019386fe-f00c-7b42-8f99-ba608027eee7` where word in ('olphins', 'nag', 'slyly', 'after', 'the', 'regular', 'packa') order by doc_id;

I would rather issue the following

with 
kw1 as (select docid, to_array(position) as pos from __mo_index where word = 'olphins' group by docid),
kw2 as (select docs, to_array(position) as pos from __mo_index where word = 'nag' group by docid),
...
select kw1.docid, kw1.pos as pos1, kw2.pos as pos2 ... kw6.pos6
from kw1, kw2, ... kw6
where kw1.docid = kw2.docid and kw1.docid = kw3.docid ....    

Note the we need a new agg function to_array, otherwise, the join could become a cross product of positions and explode. Of course, for phrase query, we don't need this to_array, we can put this is
the join condition

where kw1.docid = kw2.docid ....     AND kw1.pos + 1 = kw2.pos and kw2.pos +1 = kw3.pos ...

But there could be more complex position processing inside the fulltext function -- such as 'foo NEAR bar', etc, so to_array seems to be a better solution.

@cpegeric
Copy link
Contributor

cpegeric commented Dec 5, 2024

Checked, Cluster By word got some speed improvement but ORDER BY is not going to work.

only pharse search can use JOIN. For OR operation like natural language mode, we cannot use JOIN at all. I think we need to save the data from SQL into temporary file when the data size is large.

  1. store the SQL result into map just like what we are doing
  2. when data size exceed limit, sort the data with doc_id and save to temp file
  3. After getting all results from SQL, we will have several temp files which is sorted by doc_id
  4. use heapsort to get the sorted doc_id for each words by getting the top item from each files.
  5. merge the doc_id into the data structure
  6. when number of doc_id reach 8192 limit, process the batch
  7. go back to step 4 to read the data

Note: use ordered map to prevent us from sorting the keys before save the file
https://github.com/elliotchance/orderedmap

@fengttt
Copy link
Contributor

fengttt commented Dec 8, 2024

I think it should be cluster by at least, by (word, docid) or even (word, docid, position).

Next, I am really curious what is the performance if we just issue

where ...
AND kw1.docid = kw2.docid ....     AND kw1.pos + 1 = kw2.pos and kw2.pos +1 = kw3.pos ...

I don't see any reason this could be bad.

@fengttt
Copy link
Contributor

fengttt commented Dec 8, 2024

Checked, Cluster By word got some speed improvement but ORDER BY is not going to work.

only pharse search can use JOIN. For OR operation like natural language mode, we cannot use JOIN at all. I think we need to save the data from SQL into temporary file when the data size is large.

  1. store the SQL result into map just like what we are doing
  2. when data size exceed limit, sort the data with doc_id and save to temp file
  3. After getting all results from SQL, we will have several temp files which is sorted by doc_id
  4. use heapsort to get the sorted doc_id for each words by getting the top item from each files.
  5. merge the doc_id into the data structure
  6. when number of doc_id reach 8192 limit, process the batch
  7. go back to step 4 to read the data

Note: use ordered map to prevent us from sorting the keys before save the file https://github.com/elliotchance/orderedmap

I still don't see any reason/benefit of the order by. Just avoid it.

And for OR, it is not a join but a UNION ALL then group by. Basically, translate all fulltext query to proper SQL -- our query engine should be faster than any hand rolled code -- if not, we should optimize our query engine.

@fengttt
Copy link
Contributor

fengttt commented Dec 8, 2024

But there could be more complex position processing inside the fulltext function -- such as 'foo NEAR bar', etc, so to_array seems to be a better solution.

Now I believe foo NEAR bar should be just compiled to foo.pos > bar.pos - 100 and foo.pos < bar.pos + 100, suppose NEAR means pos within 100. Why not.

@cpegeric
Copy link
Contributor

cpegeric commented Dec 9, 2024

I changed my mind on the design.

  1. For large data scale, we need a hashtable with spill (do we have it?)
  2. For OR operation,
  • ignore Position
  • minimize the memory footprint in hashtable say, 3 words in Search ["hello", "happy", "world"] with internal index [0, 1, 2],
    The result from multiple words in SQL group by doc_id into single row with an []uint8 value which store the document count of the word w1 present in a row.
    say doc_id=1 contains words "hello", "world". Two rows from SQL [{0, doc_id("hello")}, {2, doc_id("world")}] and will aggregate into a uint8 array is [1, 0, 1] and the Hashtable definition is map[doc_id][]uint8. This will minimize the hashtable size and []uint8 is simillar to vector which can evaluate the score by each vector instead of batch. This new GROUP BY operation is done in table function. Maybe it is good to implement in query engine?
  1. For AND operation (phrase search), use JOIN and filtering with position to do phrase search. We will have answer immediately from SQL.

@cpegeric
Copy link
Contributor

cpegeric commented Dec 9, 2024

Even worse with SELECt UNION, ERROR 20101 (HY000): internal error: mpool memory allocation exceed limit with requested size 1147486208

SELECT doc_id from table where word in (w1, ...wn) has smaller memory footprint.

image

@Ariznawlll
Copy link
Contributor

testing

@Ariznawlll
Copy link
Contributor

commit: 2b6ab7e
测试步骤:

1、load 100g tpch data to mo
2、set experimental_fulltext_index=1;
create fulltext index fdx on lineitem(L_COMMENT)
3、select * from lineitem where match(l_comment) against('"olphins nag slyly after the regular packa"' in boolean mode);

image

本地测试通过

@Ariznawlll Ariznawlll added severity/s1 High impact: Logical errors or data errors that must occur and removed severity/s0 Extreme impact: Cause the application to break down and seriously affect the use labels Dec 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working phase/testing severity/s1 High impact: Logical errors or data errors that must occur
Projects
None yet
Development

No branches or pull requests

8 participants