Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update doc #584

Merged
merged 1 commit into from
Jul 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view

Large diffs are not rendered by default.

Large diffs are not rendered by default.

351 changes: 267 additions & 84 deletions docs/MatrixOne/Develop/Publish-Subscribe/pub-sub-overview.md

Large diffs are not rendered by default.

169 changes: 169 additions & 0 deletions docs/MatrixOne/Develop/Vector/cluster_centers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
# Cluster Center

## What is a Cluster Center

When using clustering algorithms, especially K-means, the number of clusters K represents the number of clusters into which you want to divide the data set. Each cluster is represented by its centroid, which is the central point or average position of all data points within the cluster.

In the K-means algorithm, the choice of K has a great influence on the clustering results. Choosing the right K value can help you better understand the structure and pattern of your data. If the K value is not chosen properly, it can cause the following problems:

- The K value is too small: it can cause different clusters to merge together, losing important patterns in the data.
- K-values are too large: may result in over-segmentation of data, with each cluster containing very few data points, which may mask general trends in the data.

Matrixone provides a cluster center query to determine the K cluster centers of a vector column.

## Application scenarios for clustering centers

Clustering plays an important role in data analysis and machine learning. Here are some of the main application scenarios for clustering centers:

- Market segmentation: In market analysis, clustering centers can help identify different customer group characteristics to customize marketing strategies for each group.

- Image segmentation: In image processing, clustering centers are used to distinguish different regions or objects in an image, often used for image compression and segmentation.

- Social network analysis: Cluster centers allow the identification of groups of users with similar behaviors or interests in social networks.

- Anomaly detection: Cluster centers can help identify anomalies in the data, as anomalies are often far from all cluster centers.

- Astronomical data analysis: In astronomy, clustering centers can be used to identify the characteristics of clusters of galaxies or star clusters.

## involve algorithms

Determining the clustering center of a vector dataset in Matrixone involves the following algorithms:

- Random: In random initialization, the algorithm randomly selects n\_clusters of observations from the data set as the initial centroid. This method is simple and fast, but may result in the quality of the clustering results depending on the selection of the initial centroid, as random selection may not fall in a dense region of the data.

- K-means++ (k-means++ initialization): k-means++ is a more advanced initialization method designed to improve the inadequacy of random initialization by selecting the initial centroid through a multi-step process to increase the probability that the selected centroid points represent the overall distribution of data.

- Regular Kmeans: A widely used clustering method designed to divide data points into K clusters so that data points within clusters are as similar as possible and data points between clusters as different as possible. This method measures similarity between data points based on Euclidean distance, so it is better suited for processing data in plane space.

- Spherical Kmeans: An algorithm for clustering data points. The process of calculating the center of a cluster by the Spherical K-means algorithm involves normalizing the data points. Especially suitable for high-dimensional and sparse high-dimensional and sparse, or data where the directionality of the data points is more important than the distance, such as text data, geographic location or user interest models.

## Examples

### Example 1

Suppose we have annual shopping data for a set of customers, including their annual income and total annual consumption. We want to use this data to understand our customers' consumption behavior and break it down into different consumer behavior groups.

#### Steps

1. Create customer table and insert data

Prepare a table named `customer_table` that inserts 10 pieces of customer data. The two-dimensional vector represents the customer's annual revenue and total annual consumption.

```sql
CREATE TABLE customer_table(id int auto_increment PRIMARY KEY,in_ex vecf64(2));
INSERT INTO customer_table(in_ex) VALUES("[120,50]"),("[80,25]"),("[200,100]"),("[100,40]"),("[300,120]"),("[150,75]"),("[90,30]"),("[250,90]"),("[75,20]"),("[150,60]");

mysql> select * from customer_table;
+------+------------+
| id | in_ex |
+------+------------+
| 1 | [120, 50] |
| 2 | [80, 25] |
| 3 | [200, 100] |
| 4 | [100, 40] |
| 5 | [300, 120] |
| 6 | [150, 75] |
| 7 | [90, 30] |
| 8 | [250, 90] |
| 9 | [75, 20] |
| 10 | [150, 60] |
+------+------------+
10 rows in set (0.01 sec)
```

2. Determining the Cluster Center

```sql
mysql> SELECT cluster_centers(in_ex kmeans '2,vector_l2_ops,random,false') AS centers FROM customer_table;
+------------------------------------------------------------------------+
| centers |
+------------------------------------------------------------------------+
| [ [109.28571428571428, 42.857142857142854],[250, 103.33333333333333] ] |
+------------------------------------------------------------------------+
1 row in set (0.00 sec)
```

3. Check Cluster Center

A good cluster usually appears as a distinctly separated group in the visualization. As can be seen from the figure below, the cluster center selection is more appropriate.

<div align="center">
<img src=https://community-shared-data-1308875761.cos.ap-beijing.myqcloud.com/artwork/docs/tutorial/Vector/cluster-center.png width=60% heigth=60%/>
</div>

By identifying cluster centers, we can divide our customers into two groups: those with middle income and middle consumption levels (cluster center A) and those with higher income and higher consumption levels (cluster center B). Merchants can tailor their product positioning to each group's consumption characteristics, such as offering better value for money for Cluster Center A and high-end or luxury brands for Cluster Center B.

### Example 2

A music streaming service wants to divide users into groups based on their preferences for different music types in order to offer personalized playlists. They collected user preference ratings (1 for disinterested and 5 for very fond) for five music types: rock, pop, jazz, classical, and hip-hop.

#### Steps

1. Build music type table and insert data

Prepare a table named `music_table` that inserts 5 pieces of user data. The five-dimensional vectors correspond to user preference scores for five genres of music: rock, pop, jazz, classical and hip-hop.

```sql
CREATE TABLE music_table(id int,grade vecf64(5));
INSERT INTO music_table VALUES(1,"[5,2,3,1,4]"),(2,"[3,5,2,1,4]"),(3,"[4,3,5,1,2]"),(4,"[2,5,4,3,1]"),(5,"[5,4,3,2,5]");

mysql> select * from music_table;
+------+-----------------+
| id | grade |
+------+-----------------+
| 1 | [5, 2, 3, 1, 4] |
| 2 | [3, 5, 2, 1, 4] |
| 3 | [4, 3, 5, 1, 2] |
| 4 | [2, 5, 4, 3, 1] |
| 5 | [5, 4, 3, 2, 5] |
+------+-----------------+
5 rows in set (0.01 sec)
```

2. View vector normalization results

```sql
mysql> select normalize_l2(grade) from music_table;
+---------------------------------------------------------------------------------------------------------+
| normalize_l2(grade) |
+---------------------------------------------------------------------------------------------------------+
| [0.6741998624632421, 0.26967994498529685, 0.40451991747794525, 0.13483997249264842, 0.5393598899705937] |
| [0.40451991747794525, 0.6741998624632421, 0.26967994498529685, 0.13483997249264842, 0.5393598899705937] |
| [0.5393598899705937, 0.40451991747794525, 0.6741998624632421, 0.13483997249264842, 0.26967994498529685] |
| [0.26967994498529685, 0.6741998624632421, 0.5393598899705937, 0.40451991747794525, 0.13483997249264842] |
| [0.562543950463012, 0.4500351603704096, 0.3375263702778072, 0.2250175801852048, 0.562543950463012] |
+---------------------------------------------------------------------------------------------------------+
5 rows in set (0.01 sec)
```

3. Determining the Cluster Center

```sql
mysql> SELECT cluster_centers(grade kmeans '2,vector_l2_ops,kmeansplusplus,true') AS centers FROM music_table;
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| centers |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| [ [0.3370999312316211, 0.6741998624632421, 0.40451991747794525, 0.26967994498529685, 0.3370999312316211],[0.5920345676322826, 0.3747450076112172, 0.4720820500729982, 0.16489917505683388, 0.4571945951396342] ] |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
```

4. Check Cluster Center

Use t-SNE to reduce high-dimensional data to 2D and visualize clustering results. As can be seen from the figure below, the data points are clearly separated by cluster centers in the space after dimension reduction, which increases confidence in the correctness of the cluster centers.

<div align="center">
<img src=https://community-shared-data-1308875761.cos.ap-beijing.myqcloud.com/artwork/docs/tutorial/Vector/cluster_center2.png width=70% heigth=70%/>
</div>

By determining the cluster centers, we can divide users into two groups: Cluster 1 is primarily composed of users who prefer rock and hip hop music, which may represent a group of users seeking modern and rhythmic music. Cluster 2 is composed of users who prefer pop and jazz music, which may represent a group of users who prefer melodic and relaxed atmosphere music. Media companies can push out styles of music for users based on their preferences.

## Reference Documents

[Vector data type](../../Reference/Data-Types/vector-type.md)

[CLUSTER_CENTERS()](../../Reference/Functions-and-Operators/Vector/cluster_centers.md)

[L2_DISTANCE()](../../Reference/Functions-and-Operators/Vector/l2_distance.md)

[NORMALIZE_L2()](MatrixOne/Reference/Functions-and-Operators/Vector/normalize_l2.md)
79 changes: 79 additions & 0 deletions docs/MatrixOne/Develop/Vector/vector_search.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Vector retrieval

## What is vector retrieval

Vector retrieval is the retrieval of K vectors (K-Nearest Neighbor, KNN) that are close to the query vectors in a given vector dataset by some measure. This is a technique for finding vectors similar to a given query vector in large-scale high-dimensional vector data. Vector retrieval has a wide range of applications in many AI fields, such as image retrieval, text retrieval, speech recognition, recommendation systems, and more. Vector retrieval is very different from traditional database retrieval.Scalar search on traditional database mainly targets structured data for accurate data query,while vector search mainly targets vector data after vectorization of unstructured data for similar retrieval,which can only approximate the best match.

<div align="center">
<img src=https://community-shared-data-1308875761.cos.ap-beijing.myqcloud.com/artwork/docs/reference/vector/vector_vs_scalar.png width=60% heigth=60%/>
</div>

Matrixone currently supports vector retrieval using the following distance measure functions:

- Cosine similarity function [`cosine_similarity`](../../Reference/Functions-and-Operators/Vector/cosine_similarity.md)
- Cosine distance function [`cosine_distance`](../../Reference/Functions-and-Operators/Vector/cosine_distance.md)
- L2 distance function [`l2_distance`](../../Reference/Functions-and-Operators/Vector/l2_distance.md)

!!! note Matrixone currently only supports fast KNN queries using vector indexes on the l2\_distance measure.

## Application scenarios for vector retrieval

Having vector capability in a database means that the database system has the ability to store, query, and analyze vector data. These vectors are often associated with complex data analysis, machine learning, and data mining tasks. Here are some application scenarios where the database has vector processing power:

- **Generative AI applications**: These databases can serve as the backend for generative AI applications, enabling them to obtain nearest neighbor results based on user-supplied queries, improving output quality and relevance.
- **Advanced object recognition**: They are invaluable for developing advanced object recognition platforms that recognize similarities between different data sets. This has practical applications in areas such as plagiarism detection, facial recognition and DNA matching.
- **Personalized recommendation systems**: Vector databases can enhance recommendation systems by integrating user preferences and choices. This will result in more accurate and targeted recommendations that improve the user experience and engagement.
- **Anomaly detection**: A vector database can be used to store feature vectors representing normal behavior. The anomaly can then be detected by comparing the input vector with the storage vector. This is useful in cybersecurity and industrial quality control.
- **Marketing optimization**: Through the analysis and mining of user data, vector database can realize personalized recommendations, customer segmentation and market trend forecasting, and other functions to provide enterprises with accurate marketing strategies.
- **Natural language processing**: Vector database can process large-scale text data, realize semantic similarity search, text classification, document clustering and other natural language processing tasks, widely used in intelligent customer service, public opinion analysis and other fields.
- **Semantic Search and** Retrieval: In applications involving large language models, vector databases can store and retrieve massive amounts of text vectors, and intelligent text matching and semantic search can be achieved by calculating similarities between vectors.

## Examples

The Iris dataset is a well-known multi-class taxonomic dataset that can be searched and downloaded online by itself. This dataset contains 150 samples divided into 3 categories: Iris Setosa (mountain iris), Iris Versicolour (chromatic iris) and Iris Virginica (virginian iris). Each sample has 4 characteristics: sepal length, sepal width, petal length, and petal width. Below we perform a KNN query (based on l2\_distance) on the Iris dataset to determine the type of sample by identifying the K samples that most closely resemble a particular sample based on its characteristics.

1. Create Iris tables and import data

Prepare a table named `iris_table` and the corresponding Iris dataset data. The dataset has 150 rows of data, each row consisting of a four-dimensional eigenvector and species.

```sql
CREATE TABLE iris_table(
species varchar(100), --category
attributes vecf64(4) --feature
);
LOAD DATA INFILE '/your_path/iris.csv' INTO TABLE iris_table;
```

2. Use KNN to predict the category of this input feature

```sql
mysql> select * from iris_table order by l2_distance(attributes,"[4,3.3,3,0.9]") asc limit 1;
+------------------+--------------------+
| species | attributes |
+------------------+--------------------+
| Iris-versicolour | [4.9, 2.4, 3.3, 1] |
+------------------+--------------------+
1 row in set (0.00 sec)

mysql> select * from iris_table order by l2_distance(attributes,"[4,3.3,3,0.9]") asc limit 5;
+------------------+----------------------+
| species | attributes |
+------------------+----------------------+
| Iris-versicolour | [4.9, 2.4, 3.3, 1] |
| Iris-versicolour | [5.1, 2.5, 3, 1.1] |
| Iris-versicolour | [5, 2.3, 3.3, 1] |
| Iris-setosa | [4.8, 3.4, 1.9, 0.2] |
| Iris-versicolour | [5.2, 2.7, 3.9, 1.4] |
+------------------+----------------------+
5 rows in set (0.00 sec)
```

After searching, we can roughly determine that the sample type is discolored Iris.

To understand the role of vector retrieval in building RAG applications, refer to the [RAG Application Foundation](../../Tutorial/rag-demo.md) example in the View Application Development Example.

## Reference Documents

[Vector data type](../../Reference/Data-Types/vector-type.md)

[L2_DISTANCE()](../../Reference/Functions-and-Operators/Vector/l2_distance.md)
Loading
Loading