Improve query verification by adding semantic analyzer (#201)

* Generate new parser by ANTLR * Integrate ANTLR parser to existing code * Fix string similarity and add visitor generated * Use visitor to change default visiting order * Add function arg type check * Add compatibility check for basic operator * Refactor inner classes to different upper class and subpackage * Change default MySQL grammar file for our case * Change default MySQL grammar file for our case * Change default MySQL grammar file for our case * Remove ANTLR plugin dependencies from runtime * Initial commit for syntax analysis by new ANTLR parser * Add missing function to grammar for unit test * Add support for index name with - or /type * Add more ES special functions to pass the unit test * Add more ES special syntax to pass IT * Add more syntax for MINUS to pass all IT * Fix checkstyle violation * Remove unsupported statements * Remove SELECT INTO * Remove more unused syntax * Remove unused function * Remove unused tokens * Remove unused tokens * Remove unused interval and charset syntax * Remove unused interval and charset syntax * Remove more unused syntax * Add setting for enabling new parser * Improve offending symbol location * Improve offending symbol location * Update 3rd party attribution with ANTLR * Add supported functions missing in existing test code * Add integration test * Change grammar for new merged code from master * Add integration test * Add more test cases * Add more test cases, enable/disable setting * Add more test cases * Add more test cases * Simplify exception for now * Add test mapping and cases for semantic check * Initial adding of semantic analyzer * Move generated source back to build folder * Rename analyze method and assert error message * Fix typo * Intial commit for semantic analysis copied from prototype * Intial commit for semantic analysis copied from prototype * Define all field names with index alias when visiting index * Define all field names with index alias when visiting index * Add support for nested field * Add support for deep nested field * Add more test cases for alias * Split test cases and organize into suite to avoid mapping reload * Simple test cases for identifier with suggestion * Add more test cases for identifier * Add IT for semantic check * Add IT for semantic check * Add more IT for semantic check * Add unit test for scalar function * Move inner mappings class to upper level * Move flat method to FieldMapping class * Add test cases for FieldMapping flat method * Add and pass basic test cases for scalar function * Add and pass basic test cases for scalar function * Fix function call visit issue * Fix function call visit issue * Add generic type to type system * Add unit test for generic type * Fix generic type bug in type system * Fix generic type usage() and add all basic functions to type system * Add IT for all math functions and constants * Add more IT for string function and failed cases * Add more IT for string function and failed cases * Add more IT for date_format and other functions * Add more IT for date_format * Introduce ES special functions into type system * Introduce ES special functions into type system * Fix real number visit issue * Add more test cases for text search functions * Support overloaded specification for type expression * Add unit test for type expression interface * Bypass scope checking for ES hidden fields * Fix alias is optional issue and skip naming conflict check for now * Add overloaded spec for functions * Add basic test cases for aggregate function * Add basic test cases for aggregate function * Add more test cases for aggregate function * Resolve symbol by prefix on all levels * Extract semantic context class to manage environment * Add UT for semantic context * Add UT for environment * Add UT for semantic analyzer * Add UT for semantic analyzer * Add isEmpty for context and overloaded data_format * Add substring * Add support for alias in select item * Add more UT for select item alias * Add UT for unknown data type * Add UT for subquery * Add UT for function composition * Fix IT * Fix all broken ITs * Fix all broken ITs * Fix all broken ITs * Fix check style * Add more Javadoc and move LocalClusterState to context class * Add more Javadoc and dev doc * Add more IT * Add more IT * Add IT for visitor and return result of select clause * Support operator type check and fix compatible base type relations * Add more UT for operator test * Fix broken ITs * Fix broken ITs by checking named argument * Operator implements type instead of type expression * Add IT for operator type check * Add img for dev doc * Pass all unit tests * Fix imports * Add more IT * Rename syntax exception * Add support for multi query type check * Add support for minus and refactor reduce method in visitor * Add more test cases for multi query * Add more test cases * Refactor types package into sub packages * Add support for subquery type check * Add more test cases for subquery * Disable field name auto suggestion for security concern * Add more tests for aggreagtion on nested field * Separate context intialization to a new visitor * Refactor semantic analyzer into 2 separator visitors * Address PR comments and finish dev doc * Add analysis config * Add UT and IT for analyzer config * Remove to-dos * Refactor string similarity class with UT * Move some test cases to proper place * Final code review
opendistro-for-elasticsearch · Oct 7, 2019 · 5034571 · 5034571
1 parent 473a42d
commit 5034571
Show file tree

Hide file tree

Showing 86 changed files with 6,768 additions and 381 deletions.
diff --git a/docs/dev/SemanticAnalysis.md b/docs/dev/SemanticAnalysis.md
@@ -0,0 +1,329 @@
+# Semantic Analysis
+
+---
+## 1.Overview
+
+Previously SQL plugin didn't do semantic analysis, for example field doesn't exist, function call on field with wrong type etc. So it had to rely on Elasticsearch engine to perform the "check" which is actual execution. 
+This led to bad user experience because of missing careful verification, cost of actual execution and confusing error message. In this work, we built a new semantic analyzer based on the new ANTLR generated parser introduced recently.
+With the new semantic analyzer, we manage to perform various verification in terms of meaning of the query and return clear and helpful message to user for troubleshoot.
+
+So in this work our initial goal is to add capability to perform basic semantic analysis including:
+
+ 1. Check field name and found if any typo.
+ 2. Check if function is in use with correct arguments.
+ 3. Apart from basic check above, it would be nice to do some simple check for JOIN, subquery and multi-query etc.
+
+For both exception, we want to return useful message to customer and even suggest to change the wrong symbol to possibly right one.
+
+---
+## 2.Use Cases
+
+Firstly, you could go through the following examples of semantic check with our new analyzer. In the use cases, you should be able to get a taste of what benefits the new semantic analyzer is bring to us:
+
+### 2.1 Field Name Typo
+
+```
+POST _opendistro/_sql
+{
+  "query": "SELECT balace FROM accounts"
+}
+
+{
+  "error": {
+    "reason": "Invalid SQL query",
+    "details": "Field [balace] cannot be found or used here. Did you mean [balance]?",
+    "type": "SemanticAnalysisException"
+  },
+  "status": 400
+}
+```
+
+### 2.2 Function Call on Incompatible Field Type
+
+```
+POST _opendistro/_sql
+{
+  "query": "SELECT * FROM accounts WHERE SUBSTRING(balance, 0, 1) = 'test'"
+}
+
+{
+  "error": {
+    "reason": "Invalid SQL query",
+    "details": "Function [SUBSTRING] cannot work with [LONG, INTEGER, INTEGER]. Usage: SUBSTRING(STRING T, INTEGER, INTEGER) -> T",
+    "type": "SemanticAnalysisException"
+  },
+  "status": 400
+}
+```
+
+### 2.3 An index Join Non-nested Field
+
+```
+POST _opendistro/_sql
+{
+  "query": "SELECT * FROM accounts a, a.firstname"
+}
+
+{
+  "error": {
+    "reason": "Invalid SQL query",
+    "details": "Operator [JOIN] cannot work with [INDEX, TEXT]. Usage: Please join index with other index or its nested field.",
+    "type": "SemanticAnalysisException"
+  },
+  "status": 400
+}
+```
+
+### 2.4 Wrong Reference in Subquery
+
+```
+POST _opendistro/_sql
+{
+  "query": "SELECT * FROM accounts a WHERE EXISTS (SELECT * FROM accounts b WHERE b.address LIKE 'Seattle') AND b.age > 10"
+}
+
+{
+  "error": {
+    "reason": "Invalid SQL query",
+    "details": "Field [b.age] cannot be found or used here. Did you mean [a.age]?",
+    "type": "SemanticAnalysisException"
+  },
+  "status": 400
+}
+```
+
+### 2.5 Operator Use on Incompatible Field Type
+
+```
+POST _opendistro/_sql
+{
+  "query": "SELECT * FROM accounts WHERE lastname IS FALSE"
+}
+
+{
+  "error": {
+    "reason": "Invalid SQL query",
+    "details": "Operator [IS] cannot work with [TEXT, BOOLEAN]. Usage: Please use compatible types from each side.",
+    "type": "SemanticAnalysisException"
+  },
+  "status": 400
+}
+```
+
+### 2.6 Subquery Return Incompatible Type
+
+```
+POST _opendistro/_sql
+{
+  "query": "SELECT * FROM accounts WHERE lastname IN (SELECT age FROM accounts)"
+}
+
+{
+  "error": {
+    "reason": "Invalid SQL query",
+    "details": "Operator [IN] cannot work with [TEXT, LONG]. Usage: Please return field(s) of compatible type from each query.",
+    "type": "SemanticAnalysisException"
+  },
+  "status": 400
+}
+```
+
+### 2.7 Multi-query On Incompatible Type
+
+```
+POST _opendistro/_sql
+{
+  "query": "SELECT balance FROM accounts UNION ALL SELECT city FROM accounts"
+}
+
+{
+  "error": {
+    "reason": "Invalid SQL query",
+    "details": "Operator [UNION] cannot work with [LONG, TEXT]. Usage: Please return field(s) of compatible type from each query.",
+    "type": "SemanticAnalysisException"
+  },
+  "status": 400
+}
+```
+
+---
+## 3.High Level Design
+
+The semantic analyzer consists of 2 core components: semantic context and type system.
+
+### 3.1 Semantic Context
+
+Semantic context manages Environment in a stack for scope management for nested subquery. Precisely, an Environment is the storage of symbols associated with its type. To perform analysis, looking up in the environments in the semantic context is first step. Only after determining the symbol exists and what's its type, further analysis like type checking can happen. We use the same terminology in compiler theory:
+
+ * **Define**: stores symbol name along with its attribute (type only for now) to the current environment in context.
+ * **Resolve**: looks up symbol name to find its attribute associated.
+
+To avoid naming conflict, we need to introduce one more layer inside each environment - namespace. For example, it's supposed to be allowed to have field name or alias in SQL query which has the same name as built-in function such as `SUM`. To implement this, we divide each environment into 3 namespaces for better management:
+
+ * **Field namespace**: field name loaded from index mapping.
+ * **Function namespace**: built-in function names.
+ * **Operator namespace**: mainly comparison operator such as `=`, `>`, `IS` etc.
+
+Here is a simple diagram showing what Semantic Context looks like at runtime:
+
+![What Semantic Context Looks Like](img/what-semantic-context-looks-like.png)
+
+### 3.2 Type System
+
+Type system allows for type check for all symbols present in the SQL query. First of all, we need to define what is type. Typically, type consists of 2 kinds:
+
+ * **Base type**: is based on Elasticsearch data type and organized into hierarchy with "class" type internally.
+    * **ES data type**: For example, INTEGER and LONG belongs to NUMBER, TEXT and KEYWORD belongs to STRING.
+    * **ES index**: we also have specific type for index, index pattern and nested field.
+ * **Type expression**: is expression of multiple base type as argument type along with a constructor, for example, array constructor can construct integer to integer array, struct constructor can construct couple of base type into a new struct type. Similarly, function and comparison operator accepts arguments and generate result type. 
+    * **Function**: including scalar and aggregate function in SQL standard as well as functions for Elasticsearch. 
+    * **Operator**: including comparison operator (=, <, >), set operator (UNION, MINUS) and join operator (JOIN).
+
+But support for only simple type is not sufficient. The following special types needs to be covered:
+
+ * **Generic type**: when we say `LOG(NUMBER) -> NUMBER` actually we want to return whatever input type (INTEGER, FLOAT) specifically instead of NUMBER.
+ * **Vararg**: for example, function `CONCAT` can apply to arbitrary number of strings.
+ * **Overloading**: function like `LOG` can have multiple specifications, one for not specifying base and another for base.
+ * **Named argument**: most functions for Elasticsearch feature use named argument, for example `TOPHITS('size'=3,'age'='desc'`.
+ * **Optional argument**: essentially this is same as function overloading.
+
+Currently we can add support for Generic Type and Overloading. To clarify, function specification involved in generic type like `SUBSTRING(func(T(STRING), INTEGER, INTEGER).to(T))` is similar as that in Java `<T extends String> T Substring(T, int, int)`.
+As for other unsupported feature, we use empty specification to convey the function exists in our type system but we want to skip type check for now. 
+
+### 3.3 String Similarity
+
+Apart from the core components above, another interesting part is the string similarity algorithm that provides suggestion when semantic analysis error occurred. Currently we make use of classic edit distance algorithm in Lucene library to guess a similar symbol when we suspect user put wrong name. That's why you can see the "Did you mean XXX?" in the examples in Use Cases section.
+
+
+---
+## 4.Detailed Design
+
+Basically the components of semantic analysis works together in the manner shown in the following diagram:
+
+![How Semantic Analysis Works](img/how-semantic-analysis-works.png)
+
+### 4.1 Parse Tree Visitor
+
+Parse Tree Visitor that walks through the SQL parse tree is driver of the whole semantic analysis process. So this section tries to make clear how Semantic Context, Type System and other parts work together by providing an example.
+Suppose an index `accounts` has mapping as below:
+
+```
+"mappings": {
+  "account": {
+    "properties": {
+      "age": {
+        "type": "integer"
+      },
+      "city": {
+        "type": "keyword"
+      },
+      "birthday": {
+        "type": "date"
+      },
+      "employer": {
+        "type": "text",
+        "fields": {
+          "keyword": {
+            "type": "keyword",
+            "ignore_above": 256
+          }
+        }
+      },
+      "projects": {
+        "type": "nested",
+        "properties": {
+          "members": {
+            "type": "nested",
+            "properties": {
+              "name": {
+                "type": "text"
+              }
+            }
+          },
+          "active": {
+            "type": "boolean"
+          }
+        }
+      },
+      "manager": {
+        "properties": {
+          "name": {
+            "type": "text"
+          }
+        }
+      }
+    }
+  }
+}
+``` 
+
+Firstly, visitor needs to enforce the visiting order of SQL query. Because some clause like FROM is essentially the definition of symbol, it is required to be visited before other clause such as WHERE which is the resolution of symbol. Currently the visiting process is being performed in the following order:
+
+ 1. **FROM**: define all symbols in index mapping in context for later resolution
+ 2. **WHERE**
+ 3. **SELECT**: the reason why SELECT visiting is so early is alias in SELECT could be used in GROUP BY, ex. SELECT SUBSTRING(city) substr ... GROUP BY substr
+ 4. **GROUP BY**
+ 5. **HAVING**
+ 6. **ORDER BY**
+ 7. **LIMIT**
+
+### 4.2 Context Initialization
+
+This part is done in `ESMappingLoader` visitor each of whose visit methods runs ahead of `TypeChecker`. Take query `SELECT * FROM accounts a, a.projects p WHERE age > 20 AND p.active IS TRUE` for example. After visiting the FROM clause, the context completes the initialization with symbol well defined as follows:
+
+```
+ # field names without alias prefix because alias is optional
+ age -> INTEGER
+ city -> KEYWORD
+ birthday -> DATE
+ employer -> TEXT
+ employer.keyword -> KEYWORD
+ projects -> NESTED
+ projects.active -> BOOLEAN
+ projects.members -> NESTED
+ projects.members.name -> TEXT
+ manager -> OBJECT
+ manager.name -> TEXT
+
+ # field names with alias prefix
+ a.age -> INTEGER
+ a.city -> KEYWORD
+ a.birthday -> DATE
+ a.employer -> TEXT
+ a.employer.keyword -> KEYWORD
+ a.projects -> NESTED
+ a.projects.active -> BOOLEAN
+ a.projects.members -> NESTED
+ a.projects.members.name -> TEXT
+ a.manager -> OBJECT
+ a.manager.name -> TEXT
+
+ # nested field names with nested field alias prefix
+ p -> NESTED
+ p.active -> BOOLEAN
+ p.members -> NESTED
+ p.members.name -> TEXT
+```
+
+And then when we meet symbol in WHERE clause or elsewhere, we resolve the symbol, ex. `age` and `p.active`, in the context and identify its type. If not found, semantic analysis will end up throwing exception with root cause and suggestion.
+
+### 4.3 Type Checking
+
+The trivial field name symbol resolution is very straightforward. Let's take a look at how type checking works for function and operator.
+
+ * **Leaf node**: simply resolve the symbol and return its type. Leaf node includes constant literal (number, boolean), function name, operator name etc.
+ * **Internal node**:
+    1. **Define alias if any**: internal node such as `FROM` and `SELECT` may define alias for index or field. For example, SELECT AVG(age) AS avg or SELECT * FROM accounts a.
+    2. **Synthesize types**: types returned from leaf node needs to be synthesized to a single type as result. Precisely, synthesize here means reduce multiple types into one by applying `construct` method defined by each type in our type system.
+
+![How Types Synthesized](img/how-types-synthesized.png)
+
+---
+## 5.What's Next
+
+Although we read many literature and other open source code for reference, the semantic analysis introduced today is far from being mature. For example, we don't have semantic check for only field in GROUP BY can be used in SELECT without aggregate function wrapped.
+Beside improvement on semantic analyzer itself, there are other things we can benefit from:
+
+ 1. **A `HELP` command**: which gets information in type system so customer doesn't necessarily learn what's right until they fail their query.
+ 2. **Correctness testing**: Generate test cases from grammar. Those cases can be used either for finding gaps between grammar/semantic and our backend code or for performing correctness testing by comparing with other based database.
+ 3. **Symbol table**: is useful for the entire process from semantic analysis here to logical and physical planning. So it should be either kept (flat to single table or avoid real popping when exit visit query) or annotated into Abstract Syntax Tree and pass to backend.
diff --git a/docs/dev/img/how-semantic-analysis-works.png b/docs/dev/img/how-semantic-analysis-works.png
diff --git a/docs/dev/img/how-types-synthesized.png b/docs/dev/img/how-types-synthesized.png
diff --git a/docs/dev/img/what-semantic-context-looks-like.png b/docs/dev/img/what-semantic-context-looks-like.png
diff --git a/src/main/antlr/OpenDistroSqlLexer.g4 b/src/main/antlr/OpenDistroSqlLexer.g4
@@ -173,7 +173,9 @@ EXTENDED_STATS:                     'EXTENDED_STATS';
 FIELD:                              'FIELD';
 FILTER:                             'FILTER';
 GEO_BOUNDING_BOX:                   'GEO_BOUNDING_BOX';
+GEO_CELL:                           'GEO_CELL';
 GEO_DISTANCE:                       'GEO_DISTANCE';
+GEO_DISTANCE_RANGE:                 'GEO_DISTANCE_RANGE';
 GEO_INTERSECTS:                     'GEO_INTERSECTS';
 GEO_POLYGON:                        'GEO_POLYGON';
 HISTOGRAM:                          'HISTOGRAM';

diff --git a/src/main/antlr/OpenDistroSqlParser.g4 b/src/main/antlr/OpenDistroSqlParser.g4
@@ -220,7 +220,7 @@ tableName
     ;
 
 fullColumnName
-    : uid (dottedId dottedId? )?
+    : uid dottedId*
     ;
 
 uid
@@ -316,7 +316,6 @@ aggregateWindowedFunction
 
 scalarFunctionName
     : functionNameBase
-    | SUBSTRING | TRIM
     ;
 
 functionArgs
@@ -345,7 +344,7 @@ expression
 predicate
     : predicate NOT? IN '(' (selectStatement | expressions) ')'     #inPredicate
     | predicate IS nullNotnull                                      #isNullPredicate
-    | left=predicate comparisonOperator right=predicate             #binaryComparasionPredicate
+    | left=predicate comparisonOperator right=predicate             #binaryComparisonPredicate
     | predicate NOT? BETWEEN predicate AND predicate                #betweenPredicate
     | predicate NOT? LIKE predicate                                 #likePredicate
     | predicate NOT? regex=REGEXP predicate                         #regexpPredicate
@@ -399,16 +398,16 @@ keywordsCanBeId
 
 functionNameBase
     : esFunctionNameBase
-    | ABS | ASIN | ATAN | CBRT | CEIL | CONCAT | CONCAT_WS
+    | ABS | ASIN | ATAN | ATAN2 | CBRT | CEIL | CONCAT | CONCAT_WS
     | COS | COSH | DATE_FORMAT | DEGREES
     | E | EXP | EXPM1 | FLOOR | LOG | LOG10 | LOG2 | LOWER
-    | PI | POW | RADIANS | RANDOM | RINT
-    | SIN | SINH | TAN | UPPER | YEAR
+    | PI | POW | RADIANS | RANDOM | RINT | ROUND
+    | SIN | SINH | SQRT | SUBSTRING | TAN | TRIM | UPPER | YEAR
     ;
 
 esFunctionNameBase
     : DATE_HISTOGRAM | DAY_OF_MONTH | DAY_OF_YEAR | DAY_OF_WEEK | EXCLUDE
-    | EXTENDED_STATS | FILTER | GEO_BOUNDING_BOX | GEO_DISTANCE | GEO_INTERSECTS
+    | EXTENDED_STATS | FILTER | GEO_BOUNDING_BOX | GEO_CELL | GEO_DISTANCE | GEO_DISTANCE_RANGE | GEO_INTERSECTS
     | GEO_POLYGON | INCLUDE | IN_TERMS | HISTOGRAM | HOUR_OF_DAY
     | MATCHPHRASE | MATCH_PHRASE | MATCHQUERY | MATCH_QUERY | MINUTE_OF_DAY
     | MINUTE_OF_HOUR | MISSING | MONTH_OF_YEAR | MULTIMATCH | MULTI_MATCH | NESTED