Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

Support NULL and MISSING value in response #667

Merged
merged 2 commits into from
Aug 10, 2020

Conversation

penghuo
Copy link
Contributor

@penghuo penghuo commented Aug 6, 2020

Description of changes

  1. Auto add FIELDS * for PPL command.
  2. In Analyzer, expend SELECT * to SELECT all fields.
  3. Extract type info from QueryPlan and add to QueryResponse.
  4. Support NULL and MISSING value in response. https://github.com/penghuo/sql/blob/pr-select-all/docs/user/general/values.rst

Problem Statements

Background

Before explain the current issue, firstly, let’s setup the context.

Sample Data

Let’s explain the problem with an example, the bank index which has 2 fields.

    "mappings" : {
      "properties" : {
        "account_number" : {
          "type" : "long"
        },
        "age" : {
          "type" : "integer"
        }
      }
    }

Then we add some data to the index.

POST bank/_doc/1
{"account_number":1,"age":31}

// age is null
POST bank/_doc/2
{"account_number":2,"age":null}

// age is missing
POST bank/_doc/3
{"account_number":3}

JDBC and JSON data format

Then, we define the response data format for query “SELECT account_number FROM bank” as follows.

  • JDBC Format. There there mandatory fields, “field name”, “field type” and “data”. e.g.
{
  "schema": [{
    "name": "account_number",
    "type": "long"
  }],
  "total": 3,
  "datarows": [
    [1],
    [2],
    [3]
  ],
  "size": 3
}
  • JSON Format, comparing with JDBC format, it doesn’t have schema field
{
  "datarows": [
    {"account_number": 1},
    {"account_number": 2},
    {"account_number": 3}
  ]
}

Issue 1. Represent NULL and MISSING in Response

With these sample data and response data format in mind, let go through more query and examine their results.
Considering the query:** SELECT age, account_number FROM bank. **
The JDBC format doesn’t have MISSING value. If the field exist in the schema but missed in the document, it should considered as NULL value.
The JSON format could represent the MISSING value properly.

* JDBC Format
{
  "schema": [
    {
      "name": "age",
      "type": "integer"
    },
    {
      "name": "account_number",
      "type": "long"
    }
  ],
  "total": 3,
  "datarows": [
    [
      31,
      1
    ],
    [
      null,
      2
    ],
    [
      null,
      3
    ],
  ],
  "size": 3
}

* JSON Format
{
  "datarows": [
    {"age": 1, "account_number": 1},
    {"age": null, "account_number": 2},
    {"account_number": 3}
  ]
}

Issue 2. ExprValue to JDBC format

Based on our current impementation, all the SQL operator is translated to chain of PhysicalOpeartor. Each PhysicalOpeartor provide the ExprValue as the return value. The protocol pull the result from PhysicalOperator and transalte to the expected format. e.g. Taking the above query as example, the result of the PhysicalOpeartor is a list of ExprValues.

[
    {"age": ExprIntegerValue(1), "account_number": ExprIntegerValue(1)},
    {"age": ExprNullValue(), "account_number": ExprIntegerValue(2)},
    {"account_number": ExprIntegerValue(3)}
]

The current solution is extract field name and field type from the data itself. This solution has two problems

  1. It is impossible to derive the type from NULL value.
  2. If the field is missing in the ExprValue, there is no way to derive it.

Issue 3. The Type info is missing

In current design, the Protocol is a seperate module which work independently with QueryEngine. The Protocol module receive the list of ExprTupleValue from QueryEngine, then the Protocol module format the result based on the type of ExprValue. the problem is ExprNullValue and ExprMissingValue doesn’t have type assosicate with it. thus the Protocol module can’t derive the type info from input ExprTupleValue directly.

Issue 4. What is *(all field) means in SELECT

In current design, the SELECT * clause ingored in the AST builder logic, because it means select all the data from input operator. The issue is similar as Issue 3 that if the input operator produce NULL or MISSING value, then the Protocol have no idea to derive type info from it.

Requirements

  • The JDBC format should been supported. The MISSING and NULL value should been representd as NULL.
  • The JSON format should been supported.
  • The Protocol module should receive the QueryResponse which include schema and data.

Solution

Includ NULL and MISSING value in the QueryResult (Issue 1, 2)

The SELECT operator will be translated to PhysicalOpeartor with a list of expression to resolve ExprValue from input data. With the above example, when handling NULL and MISSING value, the expected output data should be as follows.

[
    {"age": ExprIntegerValue(1), "account_number": ExprIntegerValue(1)},
    {"age": ExprNullValue(), "account_number": ExprIntegerValue(2)},
    {"age": ExprMissingValue(), "account_number": ExprIntegerValue(3)}
]

An aditionial list of Schema is also required to when protocol is JDBC.

{
  "schema": [
    {
      "name": "age",
      "type": "integer"
    },
    {
      "name": "account_number",
      "type": "long"
    }
  ]    
}

Then the protocol module could easily translate the JDBC format or JSON format.

Expend SELECT * to SELECT ...fields (Issue 4)

In our current implementation, in SQL, the SELECT * is ignored and in PPL there even no fields * command. This solution works fine for JSON format which doesn’t require schema, but it doens’s works for JDBC format.
The proposal in here is

  1. Automatically add the fields command to PPL query
  2. Expand SELECT * to SELECT ...fields.

Automatically add fields * to PPL query

Comparing with SQL, the PPL grammer doesn’t require the Fields command is the last command. Thus, the fields * command should been automatically added.
The automatically added logic is if the last operator is not Fields command, the Fields * command will been added.

Expand SELECT * to SELECT ...fields

In Analyzer, we should expend the * to all fields in the current scope. There are two issues we need to address,

  1. No all the fields in the current scope should been used to expand *. The original scope is from Elasticsearch mapping which include nested mapping. In current design, only the top fields will be retrived from the current scope, all the nested fields will been ignored.
  2. The scope should been dynamtic maintain in the Analyzer. For example, the stats command will define the new scope.

Retrive Type Info from ProjectOperator and Expose to Protocol (Issue 3)

After expending the * and automatically add fields, the type info could been retrived from ProjectOperator. Then the Protocol could get schema and data from QueryEngine.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@penghuo penghuo requested review from chloe-zh and dai-chen August 6, 2020 00:50
@penghuo penghuo self-assigned this Aug 6, 2020
@penghuo penghuo marked this pull request as ready for review August 6, 2020 16:17
Copy link
Member

@dai-chen dai-chen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

Copy link
Member

@chloe-zh chloe-zh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the changes!

@penghuo penghuo merged commit 6021470 into opendistro-for-elasticsearch:develop Aug 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants