From 758573fff7154e433737f52a1d4e05ab9492d196 Mon Sep 17 00:00:00 2001 From: Alex Chi Date: Mon, 7 Sep 2020 12:45:54 +0800 Subject: [PATCH 1/7] enum suppoort in copr Signed-off-by: Alex Chi --- text/2020-09-06-enum-in-copr.md | 157 ++++++++++++++++++++++++++++++++ 1 file changed, 157 insertions(+) create mode 100644 text/2020-09-06-enum-in-copr.md diff --git a/text/2020-09-06-enum-in-copr.md b/text/2020-09-06-enum-in-copr.md new file mode 100644 index 00000000..fcb1e40f --- /dev/null +++ b/text/2020-09-06-enum-in-copr.md @@ -0,0 +1,157 @@ +# Enum and Set support in TiKV coprocessor + +## Motivation + +Currently, TiKV and TiDB see an enum as a string. In this RFC, +we want to discuss adding real enum support in TiKV coprocessor. + +## Representation of Enum and Set + +Enum column stores a finite set of string values. To represent one enum column, +we first need to introduce a chunk format for the enum column. + +Inside the enum chunk, we first store all possible values of this column. After +that, we use a usize vector to store actual elements of an enum column. For +example, we have the following column from MySQL reference manual: + +```text +size ENUM('x-small', 'small', 'medium', 'large', 'x-large') +``` + +First, we store all possible values sequentially in one byte vector. We use an +offset array to indicate the beginning of each element. + +```text +Byte vector: x-smallsmallmediumlargex-large +Offset array: 0, 7, 12, 18, 23, 30 +``` + +Then, we have a bitmap and usize array to store each element. We take “small”, +“medium”, NULL as an example. + +```text +Bitmap: 110 (=6) +Array: 2, 3, 0 +``` + +This design leads to an enum chunk vector, which efficiently stores an enum column. + +```rust +pub ChunkedVecEnum { + var_offset: Vec, + enum_data: Vec, + bitmap: BitVec, + values: Vec +} +``` + +## Support Enum and Set in Vectorized Functions + +To add support for enums in vectorized functions, we need to change the `rpn_fn` +macro, and define corresponding types to represent one enum value. + +Enum can only appear as parameters of vectorized functions. No function could +return an enum value. This constraint would greatly simplify our design. + +An enum value must be binded with an enum chunk vector. Hence, to store enum +values inside the coprocessor framework, we must define the following structures. + +To represent only one enum value, we could use `Enum` structure. Note that this +structure should only be used for unit tests. Enums should only be stored and +accesed in the format of enum chunk vectors. + +```rust +pub struct Enum { + var_offset: Vec, + enum_data: Vec, + value: usize +} +``` + +To represent reference to an enum value, we could use `EnumRef` structure. +Typically, `var_offset` and `enum_values` refers to the same fields in +`ChunkedVecEnum`. + +```rust +pub struct EnumRef <'a> { + var_offset: &'a[usize], + enum_data: &'a[u8], + index: usize +} +``` + +After that, we could refactor the coprocesser framework to support using +enums during computation. + +```rust +#[derive(Debug, PartialEq, Clone)] +pub enum VectorValue { + Int(ChunkedVecSized), + Real(ChunkedVecSized), + Decimal(ChunkedVecSized), + Bytes(ChunkedVecBytes), + DateTime(ChunkedVecSized), + Duration(ChunkedVecSized), + Json(ChunkedVecJson), + Enum(ChunkedVecEnum) +} +``` + +```rust +#[derive(Clone, Copy, Debug, PartialEq, Eq)] +pub enum ScalarValueRef<'a> { + Int(Option<&'a Int>), + // ... other fixed-size types ... + Bytes(Option>), + Json(Option>), + Enum(Option>) +} +``` + +```rust +#[derive(Clone, Debug, PartialEq)] +pub enum ScalarValue { + Int(Option), + Real(Option), + Decimal(Option), + Bytes(Option), + DateTime(Option), + Duration(Option), + Json(Option), + Enum(Option) +} +``` + +Like `Bytes` and `Json`, an enum vectorized function accepts `EnumRef` as parameter. + +```rust +#[rpn_fn] +pub fn cast_enum_to_int(data: EnumRef) -> Result>; +``` + +## Add Cast Functions for Enum + +From [MySQL docs](https://dev.mysql.com/doc/refman/8.0/en/enum.html), we can find +all possible usage of enum column. + +For enums, as we only accept them as inputs of vectorized functions, the only functions +we need to implement are casting functions. For other functions that may use enum +as input, we could always first convert enums to `Bytes` or `Int`, and then use the +casting result as inputs. + +Enum could be casted to `Bytes` and `Int`. Therefore, in TiKV coprocessor, we will +need to implement `cast_enum_to_bytes` and `cast_enum_to_int`. + +## Integration with TiDB (future work) + +Currently, TiDB treat enum and set as (name, value) pair. To enable full support +for enum functions, TiDB also need to be refactored. This may include: + +* Change EvalType and FieldType in tipb +* Cast enum to string and enum to int in SQL plan +* Implement enum and set Chunk vector on TiDB side +* decode new chunk format in LazyColumn + +This task should be done on TiDB side. In this RFC, we doesn’t consider this +part. In this RFC, we only ensure that TiKV coprocessor would work correctly +with the future enum/set chunk vector. From 84b2ae6a28c798e122f7cc22fd6caf3333616439 Mon Sep 17 00:00:00 2001 From: Alex Chi Date: Mon, 7 Sep 2020 12:51:21 +0800 Subject: [PATCH 2/7] update integration Signed-off-by: Alex Chi --- text/2020-09-06-enum-in-copr.md | 1 + 1 file changed, 1 insertion(+) diff --git a/text/2020-09-06-enum-in-copr.md b/text/2020-09-06-enum-in-copr.md index fcb1e40f..8279b055 100644 --- a/text/2020-09-06-enum-in-copr.md +++ b/text/2020-09-06-enum-in-copr.md @@ -148,6 +148,7 @@ Currently, TiDB treat enum and set as (name, value) pair. To enable full support for enum functions, TiDB also need to be refactored. This may include: * Change EvalType and FieldType in tipb +* Add new signatures in tipb * Cast enum to string and enum to int in SQL plan * Implement enum and set Chunk vector on TiDB side * decode new chunk format in LazyColumn From 60b3126bc40016c0fa7e481f2f052fe18ce9648d Mon Sep 17 00:00:00 2001 From: Alex Chi Date: Mon, 7 Sep 2020 12:59:29 +0800 Subject: [PATCH 3/7] add set in RFC Signed-off-by: Alex Chi --- text/2020-09-06-enum-in-copr.md | 43 +++++++++++++++++++++++++++++++-- 1 file changed, 41 insertions(+), 2 deletions(-) diff --git a/text/2020-09-06-enum-in-copr.md b/text/2020-09-06-enum-in-copr.md index 8279b055..c346c896 100644 --- a/text/2020-09-06-enum-in-copr.md +++ b/text/2020-09-06-enum-in-copr.md @@ -1,4 +1,4 @@ -# Enum and Set support in TiKV coprocessor +# Enum and Set support in TiKV Coprocessor ## Motivation @@ -7,6 +7,8 @@ we want to discuss adding real enum support in TiKV coprocessor. ## Representation of Enum and Set +### Chunk Format of Enum and Set + Enum column stores a finite set of string values. To represent one enum column, we first need to introduce a chunk format for the enum column. @@ -16,6 +18,7 @@ example, we have the following column from MySQL reference manual: ```text size ENUM('x-small', 'small', 'medium', 'large', 'x-large') +col SET('a', 'b', 'c', 'd') ``` First, we store all possible values sequentially in one byte vector. We use an @@ -26,6 +29,8 @@ Byte vector: x-smallsmallmediumlargex-large Offset array: 0, 7, 12, 18, 23, 30 ``` +This also applies to set. + Then, we have a bitmap and usize array to store each element. We take “small”, “medium”, NULL as an example. @@ -34,7 +39,16 @@ Bitmap: 110 (=6) Array: 2, 3, 0 ``` -This design leads to an enum chunk vector, which efficiently stores an enum column. +And for set, we store `BitVec` inside array. We take “('a,d'), ('a'), ('')” as +an example. + +```text +Bitmap: 111 (=7) +Array: 11B, 01B, 00B +``` + +This design leads to an enum chunk vector and a set chunk vector, which +efficiently stores an enum column. ```rust pub ChunkedVecEnum { @@ -45,6 +59,15 @@ pub ChunkedVecEnum { } ``` +```rust +pub ChunkedVecSet { + var_offset: Vec, + set_data: Vec, + bitmap: BitVec, + values: Vec +} +``` + ## Support Enum and Set in Vectorized Functions To add support for enums in vectorized functions, we need to change the `rpn_fn` @@ -68,6 +91,14 @@ pub struct Enum { } ``` +```rust +pub struct Set { + var_offset: Vec, + set_data: Vec, + value: BitVec +} +``` + To represent reference to an enum value, we could use `EnumRef` structure. Typically, `var_offset` and `enum_values` refers to the same fields in `ChunkedVecEnum`. @@ -80,6 +111,14 @@ pub struct EnumRef <'a> { } ``` +```rust +pub struct SetRef <'a> { + var_offset: &'a[usize], + enum_data: &'a[u8], + value: BitVec +} +``` + After that, we could refactor the coprocesser framework to support using enums during computation. From 795b86c97ea4304cb065763d574f715802abfa27 Mon Sep 17 00:00:00 2001 From: Alex Chi Date: Mon, 7 Sep 2020 13:05:40 +0800 Subject: [PATCH 4/7] add aggregators Signed-off-by: Alex Chi --- text/2020-09-06-enum-in-copr.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/text/2020-09-06-enum-in-copr.md b/text/2020-09-06-enum-in-copr.md index c346c896..9503607e 100644 --- a/text/2020-09-06-enum-in-copr.md +++ b/text/2020-09-06-enum-in-copr.md @@ -168,7 +168,7 @@ Like `Bytes` and `Json`, an enum vectorized function accepts `EnumRef` as parame pub fn cast_enum_to_int(data: EnumRef) -> Result>; ``` -## Add Cast Functions for Enum +## Add Cast Functions for Enum and Set From [MySQL docs](https://dev.mysql.com/doc/refman/8.0/en/enum.html), we can find all possible usage of enum column. @@ -181,6 +181,11 @@ casting result as inputs. Enum could be casted to `Bytes` and `Int`. Therefore, in TiKV coprocessor, we will need to implement `cast_enum_to_bytes` and `cast_enum_to_int`. +## Aggregators for Enum and Set + +For other SQL functions, such as `MAX`, `MIN`, and so on, we could implement them +as aggregators. This can be done by modifying current implemented aggregators. + ## Integration with TiDB (future work) Currently, TiDB treat enum and set as (name, value) pair. To enable full support From cb2b33ff357568beab68f91ba224ccaa0791402e Mon Sep 17 00:00:00 2001 From: Alex Chi Date: Mon, 7 Sep 2020 13:09:55 +0800 Subject: [PATCH 5/7] add control functions Signed-off-by: Alex Chi --- text/2020-09-06-enum-in-copr.md | 21 +++++++++++---------- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/text/2020-09-06-enum-in-copr.md b/text/2020-09-06-enum-in-copr.md index 9503607e..cfa2f29c 100644 --- a/text/2020-09-06-enum-in-copr.md +++ b/text/2020-09-06-enum-in-copr.md @@ -168,20 +168,21 @@ Like `Bytes` and `Json`, an enum vectorized function accepts `EnumRef` as parame pub fn cast_enum_to_int(data: EnumRef) -> Result>; ``` -## Add Cast Functions for Enum and Set +## Add Vectorized Functions for Enum and Set -From [MySQL docs](https://dev.mysql.com/doc/refman/8.0/en/enum.html), we can find -all possible usage of enum column. +### Cast Functions -For enums, as we only accept them as inputs of vectorized functions, the only functions -we need to implement are casting functions. For other functions that may use enum -as input, we could always first convert enums to `Bytes` or `Int`, and then use the -casting result as inputs. +Enum could be casted to `Bytes` and `Int`. In these functions, enums and sets +are only used as inputs. Therefore, in TiKV coprocessor, we will need to +implement `cast_enum_to_bytes` and `cast_enum_to_int`, etc. -Enum could be casted to `Bytes` and `Int`. Therefore, in TiKV coprocessor, we will -need to implement `cast_enum_to_bytes` and `cast_enum_to_int`. +### Control Functions -## Aggregators for Enum and Set +For `IF` and `CASE` functions, the output vector and input vectors should have +the same ranges of values. We will need to modify the coprocessor `rpn_fn` macro +to support this kind of functions. + +### Aggregators For other SQL functions, such as `MAX`, `MIN`, and so on, we could implement them as aggregators. This can be done by modifying current implemented aggregators. From 212ba743212300f314a06c6bd10573af5dd68c3b Mon Sep 17 00:00:00 2001 From: Alex Chi Date: Mon, 7 Sep 2020 13:11:10 +0800 Subject: [PATCH 6/7] add set in enums Signed-off-by: Alex Chi --- text/2020-09-06-enum-in-copr.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/text/2020-09-06-enum-in-copr.md b/text/2020-09-06-enum-in-copr.md index cfa2f29c..6304268a 100644 --- a/text/2020-09-06-enum-in-copr.md +++ b/text/2020-09-06-enum-in-copr.md @@ -132,7 +132,8 @@ pub enum VectorValue { DateTime(ChunkedVecSized), Duration(ChunkedVecSized), Json(ChunkedVecJson), - Enum(ChunkedVecEnum) + Enum(ChunkedVecEnum), + Set(ChunkedVecSet) } ``` @@ -143,7 +144,8 @@ pub enum ScalarValueRef<'a> { // ... other fixed-size types ... Bytes(Option>), Json(Option>), - Enum(Option>) + Enum(Option>), + Set(Option>) } ``` @@ -157,7 +159,8 @@ pub enum ScalarValue { DateTime(Option), Duration(Option), Json(Option), - Enum(Option) + Enum(Option), + Set(Option) } ``` From 636545194a416c40a2272086c299060945ea1246 Mon Sep 17 00:00:00 2001 From: Alex Chi Date: Mon, 7 Sep 2020 13:14:10 +0800 Subject: [PATCH 7/7] fix doc Signed-off-by: Alex Chi --- text/2020-09-06-enum-in-copr.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/text/2020-09-06-enum-in-copr.md b/text/2020-09-06-enum-in-copr.md index 6304268a..8e86f0ba 100644 --- a/text/2020-09-06-enum-in-copr.md +++ b/text/2020-09-06-enum-in-copr.md @@ -2,8 +2,8 @@ ## Motivation -Currently, TiKV and TiDB see an enum as a string. In this RFC, -we want to discuss adding real enum support in TiKV coprocessor. +Currently, TiKV and TiDB see an enum and a set as a string. In this RFC, +we want to discuss adding real enum and sets support in TiKV coprocessor. ## Representation of Enum and Set @@ -51,7 +51,7 @@ This design leads to an enum chunk vector and a set chunk vector, which efficiently stores an enum column. ```rust -pub ChunkedVecEnum { +pub struct ChunkedVecEnum { var_offset: Vec, enum_data: Vec, bitmap: BitVec, @@ -60,7 +60,7 @@ pub ChunkedVecEnum { ``` ```rust -pub ChunkedVecSet { +pub struct ChunkedVecSet { var_offset: Vec, set_data: Vec, bitmap: BitVec,