-
Notifications
You must be signed in to change notification settings - Fork 279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: fulltext index search in tpch 100g lineitem oom #20213
Comments
Fulltext index issue. Hi Eric @cpegeric , could you please kindly help take a look? Thanks. |
hash build时候的OOM。不知道跟 #20236 是不是类似问题 |
跟 #20236不是同一个问题,我在129上可以成功创建索引,但是我跑query会hang住,看起来这条query要跑10分钟以上? 没有报错也没有oom @cpegeric |
…rser (#20269) bug fixes for #20217 #20213 #20175 1. limit the batch size to 8192 on both fulltext_index_scan() and fulltext_tokenize() function 2. In fulltext_index_scan function, create a new thread to evaluate the score in 8192 documents per batch instead of waiting for all results from SQL. It will speed up and avoid OOM in the function. However, the score will be calculated based on each mini-batch instead of complete batch. I think it doesn't matter as long as we have the correct answer. 3. support json_value parser 4. Pre-allocation of memory in fulltext_tokenize() function to avoid malloc 5. add monpl tokenizer repo to matrixone 6. bug fix json tokenizer to truncate value and increase the limit to 127 bytes 7. pushdown limit Approved by: @badboynt1, @zhangxu19830126, @m-schen, @fengttt, @aunjgr, @ouyuanning, @sukki37, @aressu1985, @heni02, @XuPeng-SH, @qingxinhome
select * from lineitem where match(l_comment) against('"olphins nag slyly after the regular packa"' in boolean mode); 这条query在我本地环境上依然会hang住,或者需要运行超过十分钟。 但是即便我扫描全表, select * from lineitem where l_comment like "%olphins nag slyly after the regular packa%"; |
…rser (#20230) bug fixes for #20217 #20213 #20175 #20149 and add json_value parser 1. limit the batch size to 8192 on both fulltext_index_scan() and fulltext_tokenize() function 2. In fulltext_index_scan function, create a new thread to evaluate the score in 8192 documents per batch instead of waiting for all results from SQL. It will speed up and avoid OOM in the function. However, the score will be calculated based on each mini-batch instead of complete batch. I think it doesn't matter as long as we have the correct answer. 3. support json_value parser 4. Pre-allocation of memory in fulltext_tokenize() function to avoid malloc 5. bug fix #20149 Delete table. pkPos, pkType is needed but (doc_id, INT) is given. 6. add monpl tokenizer repo to matrixone 7. bug fix json tokenizer to truncate value and increase the limit to 127 bytes 8. pushdown limit 9. bug fix #20311. data race occurred during bvt test 10. alter table drop column with fulltext index 11. SQL executor add streaming mode. Approved by: @fengttt, @badboynt1, @zhangxu19830126, @m-schen, @aunjgr, @ouyuanning, @aressu1985, @XuPeng-SH, @sukki37, @qingxinhome
The slow query in fulltext index is
order by is slow when the number of rows is large.
Possible solutions:
|
First let's double check fulltext index table schema and primary key/cluster by key. Second, the following query is not going to work well
I would rather issue the following
Note the we need a new agg function to_array, otherwise, the join could become a cross product of positions and explode. Of course, for phrase query, we don't need this to_array, we can put this is
But there could be more complex position processing inside the fulltext function -- such as 'foo NEAR bar', etc, so to_array seems to be a better solution. |
Checked, Cluster By word got some speed improvement but ORDER BY is not going to work. only pharse search can use JOIN. For OR operation like natural language mode, we cannot use JOIN at all. I think we need to save the data from SQL into temporary file when the data size is large.
Note: use ordered map to prevent us from sorting the keys before save the file |
I think it should be cluster by at least, by (word, docid) or even (word, docid, position). Next, I am really curious what is the performance if we just issue
I don't see any reason this could be bad. |
I still don't see any reason/benefit of the order by. Just avoid it. And for OR, it is not a join but a UNION ALL then group by. Basically, translate all fulltext query to proper SQL -- our query engine should be faster than any hand rolled code -- if not, we should optimize our query engine. |
But there could be more complex position processing inside the fulltext function -- such as 'foo NEAR bar', etc, so to_array seems to be a better solution. Now I believe |
I changed my mind on the design.
|
testing |
commit: 2b6ab7e
本地测试通过 |
Is there an existing issue for the same bug?
Branch Name
main
Commit ID
c93bbbd
Other Environment Information
Actual Behavior
Expected Behavior
No response
Steps to Reproduce
Additional information
No response
The text was updated successfully, but these errors were encountered: