Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema doc #1553

Merged
merged 2 commits into from
Dec 16, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion docs/en_US/concepts/sources/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,9 @@ Users can define the format to decode by setting `format` property. Currently, o

## Schema

Users can define the data schema like a common SQL table. In eKuiper runtime, the data will be validated and transformed according to the schema. To avoid conversion overhead if the data is fixed and clean or to consume unknown schema data, users can define schemaless source.
Users can define the schema of the data source like a relational database table. Some data formats come with their own schema, such as the `protobuf` format. When creating a source, you can define `schemaId` to point to the data structure definition in the Schema Registry.

Where the definition in the schema registry is the physical schema and the data structure in the data source definition statement is the logical schema. If both are defined, the physical schema will override the logical schema. In this case, the validation and formatting of the data will be the responsibility of the defined format, e.g. `protobuf`. If only the logical schema is defined and `strictValidation` is set, the data will be validated and type converted according to the defined structure in the eKuiper runtime. If no validation is set, the logical schema is mainly used for SQL statement validation at compile and load time. If the input data is pre-processed clean data or if the data structure is unknown or variable, the user may not define the schema, thus also avoiding the overhead of data conversion.

## Stream & Table

Expand Down
12 changes: 12 additions & 0 deletions docs/en_US/operation/restapi/schemas.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,12 +26,24 @@ Schema content in a file:
}
```

Schema with static plugin:

```json
{
"name": "schema2",
"file": "file:///tmp/ekuiper/internal/schema/test/test2.proto",
"soFile": "file:///tmp/ekuiper/internal/schema/test/so.proto"
}
```


### Parameters

1. name:the unique name of the schema.
2. schema content, use `file` or `content` parameter to specify. After schema created, the schema content will be written into file `data/schemas/$shcema_type/$schema_name`.
- file: the url of the schema file. The url can be `http` or `https` scheme or `file` scheme to refer to a local file path of the eKuiper server. The schema file must be the file type of the corresponding schema type. For example, protobuf schema file's extension name must be .proto.
- content: the text content of the schema.
3. soFile:The so file of the static plugin. Detail about the plugin creation, please check [customize format](../../rules/codecs.md#format-extension).

## Show schemas

Expand Down
41 changes: 41 additions & 0 deletions docs/en_US/operation/restapi/streams.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,47 @@ Response Sample:
}
```

## Get stream schema

The API is used to get the stream schema. The schema is inferred from the physical and logical schema definitions.

```shell
GET http://localhost:9081/streams/{id}/schema
```

The format is like Json schema:

```json
{
"id": {
"type": "bigint"
},
"name": {
"type": "string"
},
"age": {
"type": "bigint"
},
"hobbies": {
"type": "struct",
"properties": {
"indoor": {
"type": "array",
"items": {
"type": "string"
}
},
"outdoor": {
"type": "array",
"items": {
"type": "string"
}
}
}
}
}
```

## update a stream

The API is used for update the stream definition.
Expand Down
9 changes: 9 additions & 0 deletions docs/en_US/operation/restapi/tables.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,15 @@ Response Sample:
}
```

## Get table schema

The API is used to get the table schema. The schema is inferred from the physical and logical schema definitions.

```shell
GET http://localhost:9081/tables/{id}/schema
```


## update a table

The API is used for update the table definition.
Expand Down
84 changes: 82 additions & 2 deletions docs/en_US/rules/codecs.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ The eKuiper uses a map based data structure internally during computation, so so

## Format

There are two types of formats for codecs: schema and schema-less formats. The formats currently supported by eKuiper are `json`, `binary` and `protobuf`. Among them, `protobuf` is the schema format.
There are two types of formats for codecs: schema and schema-less formats. The formats currently supported by eKuiper are `json`, `binary`, `protobuf` and `custom`. Among them, `protobuf` is the schema format.
The schema format requires registering the schema first, and then setting the referenced schema along with the format. For example, when using mqtt sink, the format and schema can be configured as follows

```json
Expand All @@ -18,9 +18,89 @@ The schema format requires registering the schema first, and then setting the re
}
```

All formats provide the ability to codec and, optionally, the definition of schema. The codec computation can be built-in, such as JSON parsing; dynamic parsing schema for codecs, such as Protobuf parsing `*.proto` files; or user-defined static plug-ins (`*.so`) can be used for parsing. Among them, static parsing has the best performance, but it requires writing additional code and compiling into a plugin, which is more difficult to change. Dynamic parsing is more flexible.

All currently supported formats, their supported codec methods and modes are shown in the following table.


| Format | Codec | Custom Codec | Schema |
|----------|--------------|------------------------|------------------------|
| json | Built-in | Unsupported | Unsupported |
| binary | Built-in | Unsupported | Unsupported |
| protobuf | Built-in | Supported | Supported and required |
| custom | Not Built-in | Supported and required | Supported and optional |

### Format Extension

When using `custom` format or `protobuf` format, the user can customize the codec and schema in the form of a go language plugin. Among them, `protobuf` only supports custom codecs, and the schema needs to be defined by `*.proto` file. The steps for customizing the format are as follows:

1. implement codec-related interfaces. The Encode function encodes the incoming data (currently always `map[string]interface{}`) into a byte array. The Decode function, on the other hand, decodes the byte array into `map[string]interface{}`. The decode function is called in source, while the encode function will be called in sink.
```go
// Converter converts bytes & map or []map according to the schema
type Converter interface {
Encode(d interface{}) ([]byte, error)
Decode(b []byte) (interface{}, error)
}
```
2. Implements the schema description interface. If the custom format is strongly typed, then this interface can be implemented. The interface returns a JSON schema-like string for use by source. The returned data structure will be used as a physical schema to help eKuiper implement capabilities such as SQL validation and optimization during the parse and load phase.
```go
type SchemaProvider interface {
GetSchemaJson() string
}
```
3. Compile as a plugin so file. Usually, format extensions do not need to depend on the main eKuiper project. Due to the limitations of the Go language plugin system, the compilation of the plugin still needs to be done in the same compilation environment as the main eKuiper application, including the same operations, Go language version, etc. If you need to deploy to the official docker, you can use the corresponding docker image for compilation.
```shell
go build -trimpath --buildmode=plugin -o data/test/myFormat.so internal/converter/custom/test/*.go
```
4. Register the schema by REST API.
```shell
###
POST http://{{host}}/schemas/custom
Content-Type: application/json

{
"name": "custom1",
"soFile": "file:///tmp/custom1.so"
}
```
5. Use custom format in source or sink with `format` and `schemaId` parameters.

The complete custom format can be found in [myFormat.go](https://github.com/lf-edge/ekuiper/blob/master/internal/converter/custom/test/myformat.go). This file defines a simple custom format where the codec actually only calls JSON for serialization. It returns a data structure that can be used to infer the data structure of the eKuiper source.

### Static Protobuf

使用 Protobuf 格式时,我们支持动态解析和静态解析两种方式。使用动态解析时,用户仅需要在注册模式时指定 proto 文件。在解析性能要求更高的条件下,用户可采用静态解析的方式。静态解析需要开发解析插件,其步骤如下:

1. Assume we have a proto file helloworld.proto. Use official protoc tool to generate go code. Check [Protocol Buffer Doc](https://developers.google.com/protocol-buffers/docs/reference/go-generated) for detail.
```shell
protoc --go_opt=Mhelloworld.proto=com.main --go_out=. helloworld.proto
```
2. Move the generated code helloworld.pb.go to the go language project and rename the package to main.
3. Create the wrapper struct for each message type. Implement 3 methods `Encode`, `Decode`, `GetXXX`. The main purpose of encoding and decoding is to convert the struct and map types of messages. Note that to ensure performance, do not use reflection.
4. Compile as a plugin so file. Usually, format extensions do not need to depend on the main eKuiper project. Due to the limitations of the Go language plugin system, the compilation of the plugin still needs to be done in the same compilation environment as the main eKuiper application, including the same operations, Go language version, etc. If you need to deploy to the official docker, you can use the corresponding docker image for compilation.
```shell
go build -trimpath --buildmode=plugin -o data/test/helloworld.so internal/converter/protobuf/test/*.go
```
5. Register the schema by REST API. Notice that, the proto file and the so file are needed.
```shell
###
POST http://{{host}}/schemas/protobuf
Content-Type: application/json

{
"name": "helloworld",
"file": "file:///tmp/helloworld.proto",
"soFile": "file:///tmp/helloworld.so"
}
```
6. Use custom format in source or sink with `format` and `schemaId` parameters.

The complete static protobuf plugin can be found in [helloworld protobuf](https://github.com/lf-edge/ekuiper/tree/master/internal/converter/protobuf/test).


## Schema

A schema is a set of metadata that defines the data structure. For example, the .proto file is used in the Protobuf format as the data format for schema definition transfers. Currently, eKuiper supports only one schema type Protobuf.
A schema is a set of metadata that defines the data structure. For example, the .proto file is used in the Protobuf format as the data format for schema definition transfers. Currently, eKuiper supports schema types protobuf and custom.

### Schema Registry

Expand Down
28 changes: 14 additions & 14 deletions docs/en_US/sqls/streams.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,22 +92,22 @@ demo (
) WITH (DATASOURCE="test", FORMAT="JSON", KEY="USERID", SHARED="true");
```

## Schema

The schema of a stream contains two parts. One is the data structure defined in the data source definition, i.e. the logical schema, and the other is the SchemaId specified when using strongly typed data formats, i.e. the physical schema, such as those defined in Protobuf and Custom formats.

Overall, we will support 3 recursive ways of schema.

1. Schemaless, where the user does not need to define any kind of schema, mainly used for weakly structured data flows, or where the data structure changes frequently.
2. Logical schema only, where the user defines the schema at the source level, mostly used for weakly typed encoding, such as the JSON format, for users whose data has a fixed or roughly fixed format and do not want to use a strongly typed data codec format. In the case, the StrictValidation parameter can be used to configure whether to perform data validation and conversion.
3. Physical schema, the user uses protobuf or custom formats and defines the schemaId, where the validation of the data structure is done by the format implementation.

Both the logical and physical schema definitions are used for SQL syntax validation in the parsing and loading phases of rule creation and for runtime optimization. The inferred schema of the stream can be obtained via [Schema API](../operation/restapi/streams.md#get-stream-schema).


### Strict Validation

```
The value of StrictValidation can be true or false.
1) True: Drop the message if the message is not satisfy with the stream definition.
2) False: Keep the message, but fill the missing field with default empty value.

bigint: 0
float: 0.0
string: ""
datetime: the current time
boolean: false
bytea: nil
array: zero length array
struct: null value
```
Used only for logically schema streams. If strict validation is set, the rule will verify the existence of the field and validate the field type based on the schema. If the data is in good format, it is recommended to turn off validation.

### Schema-less stream
If the data type of the stream is unknown or varying, we can define it without the fields. This is called schema-less. It is defined by leaving the fields empty.
Expand Down
4 changes: 3 additions & 1 deletion docs/zh_CN/concepts/sources/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,9 @@

## 数据结构

用户可以像定义关系数据库表结构一样定义数据源的结构。在 eKuiper 的运行时中,数据会根据定义的结构进行验证和类型转换。若输入数据为预处理过的干净数据或者数据结构未知或不固定,用户可不定义数据结构,从而也可以避免数据转换的开销。
用户可以像定义关系数据库表结构一样定义数据源的结构。部分数据格式本身带有数据结构,例如 `protobuf` 格式。用户在创建源时可以定义 `schemaId` 来指向模式注册表( Schema Registry ) 中的数据结构定义。

其中,模式注册表中的定义为物理数据结构,而数据源定义语句中的数据结构为逻辑数据结构。若两者都有定义,则物理数据结构将覆盖逻辑数据结构。此时,数据结构的验证和格式化将有定义的格式负责,例如 `protobuf`。若只定义了逻辑结构而且设定了 `strictValidation`,则在 eKuiper 的运行时中,数据会根据定义的结构进行验证和类型转换。若未设置验证,则逻辑数据结构主要用于编译和加载时的 SQL 语句验证。若输入数据为预处理过的干净数据或者数据结构未知或不固定,用户可不定义数据结构,从而也可以避免数据转换的开销。

## 流和表

Expand Down
11 changes: 11 additions & 0 deletions docs/zh_CN/operation/restapi/schemas.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,12 +26,23 @@ POST http://localhost:9081/schemas/protobuf
}
```

模式包含静态插件示例:

```json
{
"name": "schema2",
"file": "file:///tmp/ekuiper/internal/schema/test/test2.proto",
"soFile": "file:///tmp/ekuiper/internal/schema/test/so.proto"
}
```

### 参数

1. name:模式的唯一名称。
2. 模式的内容,可选用 file 或 content 参数来指定。模式创建后,模式内容将写入 `data/schemas/$shcema_type/$schema_name` 文件中。
- file:模式文件的 URL。URL 支持 http 和 https 以及 file 模式。当使用 file 模式时,该文件必须在 eKuiper 服务器所在的机器上。它必须是模式类型对应的格式。例如 protobuf 模式的文件扩展名应为 .proto。
- content:模式文件的内容。
3. soFile:静态插件 so。插件创建请看[自定义格式](../../rules/codecs.md#格式扩展)。

## 显示模式

Expand Down
42 changes: 42 additions & 0 deletions docs/zh_CN/operation/restapi/streams.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,48 @@ GET http://localhost:9081/streams/{id}}
}
```

## 获取数据结构

该 API 用于获取流的数据结构,该数据结构为合并物理 schema 和逻辑 schema后推断出的实际定义结构。

```shell
GET http://localhost:9081/streams/{id}/schema
```

数据格式为类 Json Schema 的结构。示例如下:

```json
{
"id": {
"type": "bigint"
},
"name": {
"type": "string"
},
"age": {
"type": "bigint"
},
"hobbies": {
"type": "struct",
"properties": {
"indoor": {
"type": "array",
"items": {
"type": "string"
}
},
"outdoor": {
"type": "array",
"items": {
"type": "string"
}
}
}
}
}
```


## 更新流

该 API 用于更新流定义。
Expand Down
8 changes: 8 additions & 0 deletions docs/zh_CN/operation/restapi/tables.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,14 @@ GET http://localhost:9081/tables/{id}}
}
```

## 获取数据结构

该 API 用于获取流的数据结构,该数据结构为合并物理 schema 和逻辑 schema后推断出的实际定义结构。

```shell
GET http://localhost:9081/tables/{id}/schema
```

## 更新表

该 API 用于更新表的定义。
Expand Down
Loading