how to search fast with ramchunk(rt table) in High-Frequency Write Systems #2787

zhangdapao745 · 2024-11-26T04:47:52Z

Confirmation Checklist:

You have searched for an answer in the manual.
You have considered using the forum for general discussions, which can be more suitable for non-urgent or broad queries.
You are aware of our community support channels on Slack, Telegram EN, and Telegram RU, where you can interact with other users and our developers.
You know about Manticore Team's professional services. Engaging with our experts through a support subscription can significantly accelerate resolution times and provide tailored solutions to your specific needs.

Your question:

I found that queries will slow down if a table has a RAM chunk. After flushing the RAM chunk to a new disk chunk by using the command 'FLUSH RAMCHUNK rt_table', the query will return quickly.

Server version: 6.3.2
create table :

CREATE TABLE rt_table(
     field_b bool,
     field_str_1 string,
     field_str_2 string,
     mid int)
     rt_mem_limit='1G'

data : 200w
query sql
SELECT mid FROM rt_table WHERE mid IN (${mids_1000}) group by mid LIMIT 1000;

ram chunk(90ms)

993 rows in set (0.10 sec)
--- 993 out of 993 results in 90ms ---

mysql> show profile;
+---------------+----------+----------+---------+
| Status        | Duration | Switches | Percent |
+---------------+----------+----------+---------+
| unknown       | 0.000237 | 6        | 0.26    |
| local_search  | 0.000202 | 2        | 0.22    |
| sql_parse     | 0.000223 | 1        | 0.25    |
| fullscan      | 0.089095 | 1        | 98.89   |
| dict_setup    | 0.000003 | 1        | 0.00    |
| parse         | 0.000000 | 1        | 0.00    |
| init          | 0.000013 | 3        | 0.01    |
| clone_attrs   | 0.000233 | 1        | 0.26    |
| aggregate     | 0.000021 | 3        | 0.02    |
| net_write     | 0.000068 | 2        | 0.08    |
| eval_post     | 0.000002 | 1        | 0.00    |
| eval_getfield | 0.000000 | 1        | 0.00    |
| total         | 0.090097 | 23       | 0       |
+---------------+----------+----------+---------+
13 rows in set (0.00 sec)
mysql> show table rt_table status;
+-----------------------------+--------------------------------------------------------------------------------------------------------+
| Variable_name               | Value                                                                                                  |
+-----------------------------+--------------------------------------------------------------------------------------------------------+
| index_type                  | rt                                                                                                     |
| indexed_documents           | 2000000                                                                                                |
| ram_bytes                   | 70256383                                                                                               |
| disk_bytes                  | 835                                                                                                    |
| disk_mapped                 | 0                                                                                                      |
| disk_mapped_cached          | 0                                                                                                      |
| ram_chunk                   | 70253975                                                                                               |
| ram_chunk_segments_count    | 24                                                                                                     |
| disk_chunks                 | 0                                                                                                      |
| mem_limit                   | 1073741824                                                                                             |
| mem_limit_rate              | 33.33%                                                                                                 |
+-----------------------------+--------------------------------------------------------------------------------------------------------+
30 rows in set (0.00 sec)

disk chunk（fast 10ms）

993 rows in set (0.01 sec)
--- 993 out of 993 results in 9ms ---

mysql> show profile;
+---------------+----------+----------+---------+
| Status        | Duration | Switches | Percent |
+---------------+----------+----------+---------+
| unknown       | 0.000209 | 6        | 0.20    |
| local_search  | 0.000118 | 2        | 0.11    |
| sql_parse     | 0.000200 | 1        | 0.19    |
| setup_iter    | 0.001972 | 16       | 1.86    |
| fullscan      | 0.101794 | 16       | 96.21   |
| dict_setup    | 0.000002 | 1        | 0.00    |
| parse         | 0.000143 | 1        | 0.14    |
| init          | 0.000219 | 4        | 0.21    |
| finalize      | 0.001060 | 1        | 1.00    |
| aggregate     | 0.000015 | 3        | 0.01    |
| net_write     | 0.000067 | 2        | 0.06    |
| eval_post     | 0.000002 | 1        | 0.00    |
| eval_getfield | 0.000000 | 1        | 0.00    |
| total         | 0.105801 | 55       | 0       |
+---------------+----------+----------+---------+
14 rows in set (0.00 sec)
mysql> show table rt_table status;
+-----------------------------+--------------------------------------------------------------------------------------------------------+
| Variable_name               | Value                                                                                                  |
+-----------------------------+--------------------------------------------------------------------------------------------------------+
| index_type                  | rt                                                                                                     |
| indexed_documents           | 2000000                                                                                                |
| ram_bytes                   | 63150560                                                                                               |
| disk_bytes                  | 109367489                                                                                              |
| disk_mapped                 | 63131149                                                                                               |
| disk_mapped_cached          | 63143936                                                                                               |
| ram_chunk                   | 0                                                                                                      |
| ram_chunk_segments_count    | 0                                                                                                      |
| disk_chunks                 | 1                                                                                                      |
| mem_limit                   | 1073741824                                                                                             |
| mem_limit_rate              | 33.33%                                                                                                 |
+-----------------------------+--------------------------------------------------------------------------------------------------------+
30 rows in set (0.00 sec)

I have three questions:

How can I maintain fast search performance? Is it advisable to set rt_mem_limit to 10M or lower in a high-frequency write system? Should I periodically(maybe in 30s?) flush the RAM chunk to a new disk chunk? Could this potentially increase CPU usage due to background merging?
Does the RAM chunk search perform a full scan on all row data within the RAM chunk (maybe use CSphTightVector m_dRows), while disk chunks can utilize an index, which makes them faster than the RAM chunk's full table scan?
Why does the profile duration (0.1s) not match the query result cost (9ms) with a disk chunk? It matches with a RAM chunk.

mysql> show profile;
+---------------+----------+----------+---------+
| Status        | Duration | Switches | Percent |
+---------------+----------+----------+---------+
| unknown       | 0.000209 | 6        | 0.20    |
| local_search  | 0.000118 | 2        | 0.11    |
| sql_parse     | 0.000200 | 1        | 0.19    |
| setup_iter    | 0.001972 | 16       | 1.86    |
| fullscan      | 0.101794 | 16       | 96.21   |
| total         | 0.105801 | 55       | 0       |
+---------------+----------+----------+---------+

query result cost : --- 993 out of 993 results in 9ms ---

The text was updated successfully, but these errors were encountered:

sanikolaev · 2024-11-26T11:29:17Z

How can I maintain fast search performance? Is it advisable to set rt_mem_limit to 10M or lower in a high-frequency write system? Should I periodically(maybe in 30s?) flush the RAM chunk to a new disk chunk? Could this potentially increase CPU usage due to background merging?

Does the RAM chunk search perform a full scan on all row data within the RAM chunk (maybe use CSphTightVector m_dRows), while disk chunks can utilize an index, which makes them faster than the RAM chunk's full table scan?

The point is that searching in a RAM chunk doesn't utilize some of the performance optimization techniques that Manticore uses when searching in disk chunks, such as:

Pseudo sharding
Secondary indexes
Other optimizations

You're correct about the workaround if the RAM chunk's performance isn't sufficient. We've been considering improving it by flushing the RAM chunk to disk as soon as there are no changes in the table for a certain period. Since smaller disk chunks are merged first, we can expect the merging overhead to be minimal in this case.

Why does the profile duration (0.1s) not match the query result cost (9ms) with a disk chunk? It matches with a RAM chunk.

This is likely because queries on disk chunks are parallelized using pseudo sharding. In the profile, you're seeing the combined sums from all threads.

sanikolaev · 2024-11-28T11:30:54Z

@zhangdapao745

We discussed your issue in detail during today's call. It seems there might be other ways to improve performance in this case. Could you share a reproducible example with us so we can test and investigate the issue locally?

zhangdapao745 · 2024-11-28T13:04:16Z

@sanikolaev Thank you for your reply!

We discussed your issue in detail during today's call. It seems there might be other ways to improve performance in this case. Could you share a reproducible example with us so we can test and investigate the issue locally?

I will provide you with an non-sensitive dataset for testing purposes. However, it will take a few days to prepare and share this data with you.

You're correct about the workaround if the RAM chunk's performance isn't sufficient. We've been considering improving it by flushing the RAM chunk to disk as soon as there are no changes in the table for a certain period. Since smaller disk chunks are merged first, we can expect the merging overhead to be minimal in this case.

I am eager to contribute to this enhancement by coding to implement the functionality that flush the RAM chunk when there are no changes in the table for a certain period. I am currently learning and implementing this feature.

sanikolaev · 2024-12-08T10:47:39Z

@zhangdapao745 Hello. Did you have a chance to prepare the test case?

zhangdapao745 · 2024-12-08T11:08:38Z

Did you have a chance to prepare the test case?

@sanikolaev Of course, if there are no unexpected problems, I will upload the test case tomorrow. Additionally, is the following content enough for the upload?

Create table SQL
Upload a 200,000 data set in CSV format to your write-only S3 storage
Query SQL

sanikolaev · 2024-12-08T13:29:15Z

Yes, please!

zhangdapao745 · 2024-12-09T13:44:46Z

@sanikolaev Here is the reproducible test case.

Create table sql

 CREATE TABLE rt_table(
     field_b bool,
     field_str_1 string,
     field_str_2 string,
     mid int)
     rt_mem_limit='1G'

Dataset : Use the following Python script to generate a dataset file named issue_2787.csv and import it into Manticore Search.

import csv
import random

# filename
csv_filename = 'issue_2787.csv'
# rowcount
num_rows = 2000000

# open and write data
with open(csv_filename, mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    # csv title
    writer.writerow(['id', 'field_b', 'field_str_1', 'field_str_2', 'mid'])
    
    # used to store the generated mid values to control the duplication rate
    mid_values = set()
    mid_value_count = {}
    
    for _ in range(num_rows):
        # Generate id, a 10-digit random number
        id_value = random.randint(1000000000, 9999999999)
        
        # Generate field_b, 90% chance for 1, 10% chance for 0
        field_b_value = random.choice([1] * 9 + [0])
        
        # Generate field_str_1 and field_str_2, random 'a', 'b', or 'c' strings
        field_str_1 = random.choice(['a', 'b', 'c'])
        field_str_2 = random.choice(['a', 'b', 'c'])
        
        # Generate mid, a 6-digit random number
        mid_value = random.randint(100000, 999999)
        
        # Ensure mid has duplicates within multiple rows, and an average duplication value of 4-5
        while mid_value in mid_value_count and mid_value_count[mid_value] > 4:
            mid_value = random.randint(100000, 999999)
        
        if mid_value in mid_value_count:
            mid_value_count[mid_value] += 1
        else:
            mid_value_count[mid_value] = 1
        
        # Add the mid value to the set for subsequent duplicate checks
        mid_values.add(mid_value)
        
        # Write a row of data
        writer.writerow([id_value, field_b_value, field_str_1, field_str_2, mid_value])

3.Query sql
SELECT mid FROM rt_table WHERE mid IN (${id_1000}) group by mid LIMIT 1000;

Generate the variable ${id_1000} through the following Python script and replace it in the aforementioned SQL.

import csv

# Define the CSV file name
csv_filename = 'issue_2787.csv'
# define number of mid count
count_num = 1000
# Initialize a set to store unique mid values
unique_mids = set()
result_string = ""

# Open the file and read it as a stream
with open(csv_filename, mode='r', newline='', encoding='utf-8') as file:
    reader = csv.reader(file)
    
    # Skip the header row
    next(reader)
    
    # Read the file line by line
    for row in reader:
        # Extract the mid value from each row
        mid_value = row[4]
        
        # If the mid value is not in the set, add it to the set and the result string
        if mid_value not in unique_mids:
            unique_mids.add(mid_value)
            result_string += mid_value + ","
            
            # If 1000 unique mid values have been collected, terminate early
            if len(unique_mids) == count_num:
                break

# Remove the trailing comma from the result string
result_string = result_string.strip(",")

# Output the result string
print(result_string)

When querying the entire dataset in Manticore Search without performing any actions after loading the file(with all ramchunk), it takes 160ms. After calling 'FLUSH RAMCHUNK rt_table', the query time is reduced to 15 ms.

sanikolaev · 2024-12-12T16:01:10Z

Thanks for preparing the test case @zhangdapao745

@tomatolog please reproduce it, profile and suggest what we could do to improve the performance.

sanikolaev assigned klirichek Nov 26, 2024

sanikolaev added the waiting Waiting for the original poster (in most cases) or something else label Nov 28, 2024

sanikolaev assigned tomatolog and unassigned klirichek Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to search fast with ramchunk(rt table) in High-Frequency Write Systems #2787

how to search fast with ramchunk(rt table) in High-Frequency Write Systems #2787

zhangdapao745 commented Nov 26, 2024

sanikolaev commented Nov 26, 2024

sanikolaev commented Nov 28, 2024

zhangdapao745 commented Nov 28, 2024

sanikolaev commented Dec 8, 2024

zhangdapao745 commented Dec 8, 2024

sanikolaev commented Dec 8, 2024

zhangdapao745 commented Dec 9, 2024

sanikolaev commented Dec 12, 2024

how to search fast with ramchunk(rt table) in High-Frequency Write Systems #2787

how to search fast with ramchunk(rt table) in High-Frequency Write Systems #2787

Comments

zhangdapao745 commented Nov 26, 2024

Confirmation Checklist:

Your question:

sanikolaev commented Nov 26, 2024

sanikolaev commented Nov 28, 2024

zhangdapao745 commented Nov 28, 2024

sanikolaev commented Dec 8, 2024

zhangdapao745 commented Dec 8, 2024

sanikolaev commented Dec 8, 2024

zhangdapao745 commented Dec 9, 2024

sanikolaev commented Dec 12, 2024