From 758573fff7154e433737f52a1d4e05ab9492d196 Mon Sep 17 00:00:00 2001
From: Alex Chi <iskyzh@gmail.com>
Date: Mon, 7 Sep 2020 12:45:54 +0800
Subject: [PATCH 1/7] enum suppoort in copr

Signed-off-by: Alex Chi <iskyzh@gmail.com>
---
 text/2020-09-06-enum-in-copr.md | 157 ++++++++++++++++++++++++++++++++
 1 file changed, 157 insertions(+)
 create mode 100644 text/2020-09-06-enum-in-copr.md
diff --git a/text/2020-09-06-enum-in-copr.md b/text/2020-09-06-enum-in-copr.md
new file mode 100644
index 00000000..fcb1e40f
--- /dev/null
+++ b/text/2020-09-06-enum-in-copr.md
@@ -0,0 +1,157 @@
+# Enum and Set support in TiKV coprocessor
+
+## Motivation
+
+Currently, TiKV and TiDB see an enum as a string. In this RFC,
+we want to discuss adding real enum support in TiKV coprocessor.
+
+## Representation of Enum and Set
+
+Enum column stores a finite set of string values. To represent one enum column,
+we first need to introduce a chunk format for the enum column.
+
+Inside the enum chunk, we first store all possible values of this column. After
+that, we use a usize vector to store actual elements of an enum column. For
+example, we have the following column from MySQL reference manual:
+
+```text
+size ENUM('x-small', 'small', 'medium', 'large', 'x-large')
+```
+
+First, we store all possible values sequentially in one byte vector. We use an
+offset array to indicate the beginning of each element.
+
+```text
+Byte vector: x-smallsmallmediumlargex-large
+Offset array: 0, 7, 12, 18, 23, 30
+```
+
+Then, we have a bitmap and usize array to store each element. We take “small”,
+“medium”, NULL as an example.
+
+```text
+Bitmap: 110 (=6)
+Array: 2, 3, 0
+```
+
+This design leads to an enum chunk vector, which efficiently stores an enum column.
+
+```rust
+pub ChunkedVecEnum {
+    var_offset: Vec<usize>,
+    enum_data: Vec<u8>,
+    bitmap: BitVec,
+    values: Vec<usize>
+}
+```
+
+## Support Enum and Set in Vectorized Functions
+
+To add support for enums in vectorized functions, we need to change the `rpn_fn`
+macro, and define corresponding types to represent one enum value.
+
+Enum can only appear as parameters of vectorized functions. No function could
+return an enum value. This constraint would greatly simplify our design.
+
+An enum value must be binded with an enum chunk vector. Hence, to store enum
+values inside the coprocessor framework, we must define the following structures.
+
+To represent only one enum value, we could use `Enum` structure. Note that this
+structure should only be used for unit tests. Enums should only be stored and
+accesed in the format of enum chunk vectors.
+
+```rust
+pub struct Enum {
+    var_offset: Vec<usize>,
+    enum_data: Vec<u8>,
+    value: usize
+}
+```
+
+To represent reference to an enum value, we could use `EnumRef` structure.
+Typically, `var_offset` and `enum_values` refers to the same fields in
+`ChunkedVecEnum`.
+
+```rust
+pub struct EnumRef <'a> {
+    var_offset: &'a[usize],
+    enum_data: &'a[u8],
+    index: usize
+}
+```
+
+After that, we could refactor the coprocesser framework to support using
+enums during computation.
+
+```rust
+#[derive(Debug, PartialEq, Clone)]
+pub enum VectorValue {
+    Int(ChunkedVecSized<Int>),
+    Real(ChunkedVecSized<Real>),
+    Decimal(ChunkedVecSized<Decimal>),
+    Bytes(ChunkedVecBytes),
+    DateTime(ChunkedVecSized<DateTime>),
+    Duration(ChunkedVecSized<Duration>),
+    Json(ChunkedVecJson),
+    Enum(ChunkedVecEnum)
+}
+```
+
+```rust
+#[derive(Clone, Copy, Debug, PartialEq, Eq)]
+pub enum ScalarValueRef<'a> {
+    Int(Option<&'a Int>),
+    // ... other fixed-size types ...
+    Bytes(Option<BytesRef<'a>>),
+    Json(Option<JsonRef<'a>>),
+    Enum(Option<EnumRef<'a>>)
+}
+```
+
+```rust
+#[derive(Clone, Debug, PartialEq)]
+pub enum ScalarValue {
+    Int(Option<super::Int>),
+    Real(Option<super::Real>),
+    Decimal(Option<super::Decimal>),
+    Bytes(Option<super::Bytes>),
+    DateTime(Option<super::DateTime>),
+    Duration(Option<super::Duration>),
+    Json(Option<super::Json>),
+    Enum(Option<super::Enum>)
+}
+```
+
+Like `Bytes` and `Json`, an enum vectorized function accepts `EnumRef` as parameter.
+
+```rust
+#[rpn_fn]
+pub fn cast_enum_to_int(data: EnumRef) -> Result<Option<Int>>;
+```
+
+## Add Cast Functions for Enum
+
+From [MySQL docs](https://dev.mysql.com/doc/refman/8.0/en/enum.html), we can find
+all possible usage of enum column.
+
+For enums, as we only accept them as inputs of vectorized functions, the only functions
+we need to implement are casting functions. For other functions that may use enum
+as input, we could always first convert enums to `Bytes` or `Int`, and then use the
+casting result as inputs.
+
+Enum could be casted to `Bytes` and `Int`. Therefore, in TiKV coprocessor, we will
+need to implement `cast_enum_to_bytes` and `cast_enum_to_int`.
+
+## Integration with TiDB (future work)
+
+Currently, TiDB treat enum and set as (name, value) pair. To enable full support
+for enum functions, TiDB also need to be refactored. This may include:
+
+* Change EvalType and FieldType in tipb
+* Cast enum to string and enum to int in SQL plan
+* Implement enum and set Chunk vector on TiDB side
+* decode new chunk format in LazyColumn
+
+This task should be done on TiDB side. In this RFC, we doesn’t consider this
+part. In this RFC, we only ensure that TiKV coprocessor would work correctly
+with the future enum/set chunk vector.

From 84b2ae6a28c798e122f7cc22fd6caf3333616439 Mon Sep 17 00:00:00 2001
From: Alex Chi <iskyzh@gmail.com>
Date: Mon, 7 Sep 2020 12:51:21 +0800
Subject: [PATCH 2/7] update integration

Signed-off-by: Alex Chi <iskyzh@gmail.com>
---
 text/2020-09-06-enum-in-copr.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/text/2020-09-06-enum-in-copr.md b/text/2020-09-06-enum-in-copr.md
index fcb1e40f..8279b055 100644
--- a/text/2020-09-06-enum-in-copr.md
+++ b/text/2020-09-06-enum-in-copr.md
@@ -148,6 +148,7 @@ Currently, TiDB treat enum and set as (name, value) pair. To enable full support
 for enum functions, TiDB also need to be refactored. This may include:
 
 * Change EvalType and FieldType in tipb
+* Add new signatures in tipb
 * Cast enum to string and enum to int in SQL plan
 * Implement enum and set Chunk vector on TiDB side
 * decode new chunk format in LazyColumn

From 60b3126bc40016c0fa7e481f2f052fe18ce9648d Mon Sep 17 00:00:00 2001
From: Alex Chi <iskyzh@gmail.com>
Date: Mon, 7 Sep 2020 12:59:29 +0800
Subject: [PATCH 3/7] add set in RFC

Signed-off-by: Alex Chi <iskyzh@gmail.com>
---
 text/2020-09-06-enum-in-copr.md | 43 +++++++++++++++++++++++++++++++--
 1 file changed, 41 insertions(+), 2 deletions(-)

diff --git a/text/2020-09-06-enum-in-copr.md b/text/2020-09-06-enum-in-copr.md
index 8279b055..c346c896 100644
--- a/text/2020-09-06-enum-in-copr.md
+++ b/text/2020-09-06-enum-in-copr.md
@@ -1,4 +1,4 @@
-# Enum and Set support in TiKV coprocessor
+# Enum and Set support in TiKV Coprocessor
 
 ## Motivation
 
@@ -7,6 +7,8 @@ we want to discuss adding real enum support in TiKV coprocessor.
 
 ## Representation of Enum and Set
 
+### Chunk Format of Enum and Set
+
 Enum column stores a finite set of string values. To represent one enum column,
 we first need to introduce a chunk format for the enum column.
 
@@ -16,6 +18,7 @@ example, we have the following column from MySQL reference manual:
 
 ```text
 size ENUM('x-small', 'small', 'medium', 'large', 'x-large')
+col SET('a', 'b', 'c', 'd')
 ```
 
 First, we store all possible values sequentially in one byte vector. We use an
@@ -26,6 +29,8 @@ Byte vector: x-smallsmallmediumlargex-large
 Offset array: 0, 7, 12, 18, 23, 30
 ```
 
+This also applies to set.
+
 Then, we have a bitmap and usize array to store each element. We take “small”,
 “medium”, NULL as an example.
 
@@ -34,7 +39,16 @@ Bitmap: 110 (=6)
 Array: 2, 3, 0
 ```
 
-This design leads to an enum chunk vector, which efficiently stores an enum column.
+And for set, we store `BitVec` inside array. We take “('a,d'), ('a'), ('')” as
+an example.
+
+```text
+Bitmap: 111 (=7)
+Array: 11B, 01B, 00B
+```
+
+This design leads to an enum chunk vector and a set chunk vector, which
+efficiently stores an enum column.
 
 ```rust
 pub ChunkedVecEnum {
@@ -45,6 +59,15 @@ pub ChunkedVecEnum {
 }
 ```
 
+```rust
+pub ChunkedVecSet {
+    var_offset: Vec<usize>,
+    set_data: Vec<u8>,
+    bitmap: BitVec,
+    values: Vec<BitVec>
+}
+```
+
 ## Support Enum and Set in Vectorized Functions
 
 To add support for enums in vectorized functions, we need to change the `rpn_fn`
@@ -68,6 +91,14 @@ pub struct Enum {
 }
 ```
 
+```rust
+pub struct Set {
+    var_offset: Vec<usize>,
+    set_data: Vec<u8>,
+    value: BitVec
+}
+```
+
 To represent reference to an enum value, we could use `EnumRef` structure.
 Typically, `var_offset` and `enum_values` refers to the same fields in
 `ChunkedVecEnum`.
@@ -80,6 +111,14 @@ pub struct EnumRef <'a> {
 }
 ```
 
+```rust
+pub struct SetRef <'a> {
+    var_offset: &'a[usize],
+    enum_data: &'a[u8],
+    value: BitVec
+}
+```
+
 After that, we could refactor the coprocesser framework to support using
 enums during computation.
 

From 795b86c97ea4304cb065763d574f715802abfa27 Mon Sep 17 00:00:00 2001
From: Alex Chi <iskyzh@gmail.com>
Date: Mon, 7 Sep 2020 13:05:40 +0800
Subject: [PATCH 4/7] add aggregators

Signed-off-by: Alex Chi <iskyzh@gmail.com>
---
 text/2020-09-06-enum-in-copr.md | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/text/2020-09-06-enum-in-copr.md b/text/2020-09-06-enum-in-copr.md
index c346c896..9503607e 100644
--- a/text/2020-09-06-enum-in-copr.md
+++ b/text/2020-09-06-enum-in-copr.md
@@ -168,7 +168,7 @@ Like `Bytes` and `Json`, an enum vectorized function accepts `EnumRef` as parame
 pub fn cast_enum_to_int(data: EnumRef) -> Result<Option<Int>>;
 ```
 
-## Add Cast Functions for Enum
+## Add Cast Functions for Enum and Set
 
 From [MySQL docs](https://dev.mysql.com/doc/refman/8.0/en/enum.html), we can find
 all possible usage of enum column.
@@ -181,6 +181,11 @@ casting result as inputs.
 Enum could be casted to `Bytes` and `Int`. Therefore, in TiKV coprocessor, we will
 need to implement `cast_enum_to_bytes` and `cast_enum_to_int`.
 
+## Aggregators for Enum and Set
+
+For other SQL functions, such as `MAX`, `MIN`, and so on, we could implement them
+as aggregators. This can be done by modifying current implemented aggregators.
+
 ## Integration with TiDB (future work)
 
 Currently, TiDB treat enum and set as (name, value) pair. To enable full support

From cb2b33ff357568beab68f91ba224ccaa0791402e Mon Sep 17 00:00:00 2001
From: Alex Chi <iskyzh@gmail.com>
Date: Mon, 7 Sep 2020 13:09:55 +0800
Subject: [PATCH 5/7] add control functions

Signed-off-by: Alex Chi <iskyzh@gmail.com>
---
 text/2020-09-06-enum-in-copr.md | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/text/2020-09-06-enum-in-copr.md b/text/2020-09-06-enum-in-copr.md
index 9503607e..cfa2f29c 100644
--- a/text/2020-09-06-enum-in-copr.md
+++ b/text/2020-09-06-enum-in-copr.md
@@ -168,20 +168,21 @@ Like `Bytes` and `Json`, an enum vectorized function accepts `EnumRef` as parame
 pub fn cast_enum_to_int(data: EnumRef) -> Result<Option<Int>>;
 ```
 
-## Add Cast Functions for Enum and Set
+## Add Vectorized Functions for Enum and Set
 
-From [MySQL docs](https://dev.mysql.com/doc/refman/8.0/en/enum.html), we can find
-all possible usage of enum column.
+### Cast Functions
 
-For enums, as we only accept them as inputs of vectorized functions, the only functions
-we need to implement are casting functions. For other functions that may use enum
-as input, we could always first convert enums to `Bytes` or `Int`, and then use the
-casting result as inputs.
+Enum could be casted to `Bytes` and `Int`. In these functions, enums and sets
+are only used as inputs. Therefore, in TiKV coprocessor, we will need to
+implement `cast_enum_to_bytes` and `cast_enum_to_int`, etc.
 
-Enum could be casted to `Bytes` and `Int`. Therefore, in TiKV coprocessor, we will
-need to implement `cast_enum_to_bytes` and `cast_enum_to_int`.
+### Control Functions
 
-## Aggregators for Enum and Set
+For `IF` and `CASE` functions, the output vector and input vectors should have
+the same ranges of values. We will need to modify the coprocessor `rpn_fn` macro
+to support this kind of functions.
+
+### Aggregators
 
 For other SQL functions, such as `MAX`, `MIN`, and so on, we could implement them
 as aggregators. This can be done by modifying current implemented aggregators.

From 212ba743212300f314a06c6bd10573af5dd68c3b Mon Sep 17 00:00:00 2001
From: Alex Chi <iskyzh@gmail.com>
Date: Mon, 7 Sep 2020 13:11:10 +0800
Subject: [PATCH 6/7] add set in enums

Signed-off-by: Alex Chi <iskyzh@gmail.com>
---
 text/2020-09-06-enum-in-copr.md | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/text/2020-09-06-enum-in-copr.md b/text/2020-09-06-enum-in-copr.md
index cfa2f29c..6304268a 100644
--- a/text/2020-09-06-enum-in-copr.md
+++ b/text/2020-09-06-enum-in-copr.md
@@ -132,7 +132,8 @@ pub enum VectorValue {
     DateTime(ChunkedVecSized<DateTime>),
     Duration(ChunkedVecSized<Duration>),
     Json(ChunkedVecJson),
-    Enum(ChunkedVecEnum)
+    Enum(ChunkedVecEnum),
+    Set(ChunkedVecSet)
 }
 ```
 
@@ -143,7 +144,8 @@ pub enum ScalarValueRef<'a> {
     // ... other fixed-size types ...
     Bytes(Option<BytesRef<'a>>),
     Json(Option<JsonRef<'a>>),
-    Enum(Option<EnumRef<'a>>)
+    Enum(Option<EnumRef<'a>>),
+    Set(Option<SetRef<'a>>)
 }
 ```
 
@@ -157,7 +159,8 @@ pub enum ScalarValue {
     DateTime(Option<super::DateTime>),
     Duration(Option<super::Duration>),
     Json(Option<super::Json>),
-    Enum(Option<super::Enum>)
+    Enum(Option<super::Enum>),
+    Set(Option<super::Set>)
 }
 ```
 

From 636545194a416c40a2272086c299060945ea1246 Mon Sep 17 00:00:00 2001
From: Alex Chi <iskyzh@gmail.com>
Date: Mon, 7 Sep 2020 13:14:10 +0800
Subject: [PATCH 7/7] fix doc

Signed-off-by: Alex Chi <iskyzh@gmail.com>
---
 text/2020-09-06-enum-in-copr.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/text/2020-09-06-enum-in-copr.md b/text/2020-09-06-enum-in-copr.md
index 6304268a..8e86f0ba 100644
--- a/text/2020-09-06-enum-in-copr.md
+++ b/text/2020-09-06-enum-in-copr.md
@@ -2,8 +2,8 @@
 
 ## Motivation
 
-Currently, TiKV and TiDB see an enum as a string. In this RFC,
-we want to discuss adding real enum support in TiKV coprocessor.
+Currently, TiKV and TiDB see an enum and a set as a string. In this RFC,
+we want to discuss adding real enum and sets support in TiKV coprocessor.
 
 ## Representation of Enum and Set
 
@@ -51,7 +51,7 @@ This design leads to an enum chunk vector and a set chunk vector, which
 efficiently stores an enum column.
 
 ```rust
-pub ChunkedVecEnum {
+pub struct ChunkedVecEnum {
     var_offset: Vec<usize>,
     enum_data: Vec<u8>,
     bitmap: BitVec,
@@ -60,7 +60,7 @@ pub ChunkedVecEnum {
 ```
 
 ```rust
-pub ChunkedVecSet {
+pub struct ChunkedVecSet {
     var_offset: Vec<usize>,
     set_data: Vec<u8>,
     bitmap: BitVec,