Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

Commit

Permalink
Add identifier and datatype for PPL (#873)
Browse files Browse the repository at this point in the history
  • Loading branch information
penghuo authored Dec 2, 2020
1 parent 85202b6 commit 370ff6d
Show file tree
Hide file tree
Showing 4 changed files with 327 additions and 1 deletion.
3 changes: 2 additions & 1 deletion docs/category.json
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,8 @@
"experiment/ppl/cmd/search.rst",
"experiment/ppl/cmd/sort.rst",
"experiment/ppl/cmd/stats.rst",
"experiment/ppl/cmd/where.rst"
"experiment/ppl/cmd/where.rst",
"experiment/ppl/general/identifiers.rst"
],
"sql_cli": [
"user/dql/expressions.rst",
Expand Down
226 changes: 226 additions & 0 deletions docs/experiment/ppl/general/datatypes.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,226 @@

==========
Data Types
==========

.. rubric:: Table of contents

.. contents::
:local:
:depth: 2


Overview
========

PPL Data Types
-------------------

The PPL support the following data types.

+---------------+
| PPL Data Type |
+===============+
| boolean |
+---------------+
| byte |
+---------------+
| short |
+---------------+
| integer |
+---------------+
| long |
+---------------+
| float |
+---------------+
| double |
+---------------+
| string |
+---------------+
| text |
+---------------+
| timestamp |
+---------------+
| datetime |
+---------------+
| date |
+---------------+
| time |
+---------------+
| interval |
+---------------+
| ip |
+---------------+
| geo_point |
+---------------+
| binary |
+---------------+
| struct |
+---------------+
| array |
+---------------+

Data Types Mapping
------------------

The table below list the mapping between Elasticsearch Data Type, PPL Data Type and SQL Type.

+--------------------+---------------+-----------+
| Elasticsearch Type | PPL Type | SQL Type |
+====================+===============+===========+
| boolean | boolean | BOOLEAN |
+--------------------+---------------+-----------+
| byte | byte | TINYINT |
+--------------------+---------------+-----------+
| short | byte | SMALLINT |
+--------------------+---------------+-----------+
| integer | integer | INTEGER |
+--------------------+---------------+-----------+
| long | long | BIGINT |
+--------------------+---------------+-----------+
| float | float | REAL |
+--------------------+---------------+-----------+
| half_float | float | FLOAT |
+--------------------+---------------+-----------+
| scaled_float | float | DOUBLE |
+--------------------+---------------+-----------+
| double | double | DOUBLE |
+--------------------+---------------+-----------+
| keyword | string | VARCHAR |
+--------------------+---------------+-----------+
| text | text | VARCHAR |
+--------------------+---------------+-----------+
| date | timestamp | TIMESTAMP |
+--------------------+---------------+-----------+
| ip | ip | VARCHAR |
+--------------------+---------------+-----------+
| date | timestamp | TIMESTAMP |
+--------------------+---------------+-----------+
| binary | binary | VARBINARY |
+--------------------+---------------+-----------+
| object | struct | STRUCT |
+--------------------+---------------+-----------+
| nested | array | STRUCT |
+--------------------+---------------+-----------+

Notes: Not all the PPL Type has correspond Elasticsearch Type. e.g. data and time. To use function which required such data type, user should explict convert the data type.



Numeric Data Types
==================

TODO


Date and Time Data Types
========================

The date and time data types are the types that represent temporal values and PPL plugin supports types including DATE, TIME, DATETIME, TIMESTAMP and INTERVAL. By default, the Elasticsearch DSL uses date type as the only date and time related type, which has contained all information about an absolute time point. To integrate with PPL language, each of the types other than timestamp is holding part of temporal or timezone information, and the usage to explicitly clarify the date and time types is reflected in the datetime functions (see `Functions <functions.rst>`_ for details), where some functions might have restrictions in the input argument type.


Date
----

Date represents the calendar date regardless of the time zone. A given date value represents a 24-hour period, or say a day, but this period varies in different timezones and might have flexible hours during Daylight Savings Time programs. Besides, the date type does not contain time information as well. The supported range is '1000-01-01' to '9999-12-31'.

+------+--------------+------------------------------+
| Type | Syntax | Range |
+======+==============+==============================+
| Date | 'yyyy-MM-dd' | '0001-01-01' to '9999-12-31' |
+------+--------------+------------------------------+


Time
----

Time represents the time on the clock or watch with no regard for which timezone it might be related with. Time type data does not have date information.

+------+-----------------------+----------------------------------------+
| Type | Syntax | Range |
+======+=======================+========================================+
| Time | 'hh:mm:ss[.fraction]' | '00:00:00.000000' to '23:59:59.999999' |
+------+-----------------------+----------------------------------------+


Datetime
--------

Datetime type is the combination of date and time. The conversion rule of date or time to datetime is described in `Conversion between date and time types`_. Datetime type does not contain timezone information. For an absolute time point that contains both date time and timezone information, see `Timestamp`_.

+----------+----------------------------------+--------------------------------------------------------------+
| Type | Syntax | Range |
+==========+==================================+==============================================================+
| Datetime | 'yyyy-MM-dd hh:mm:ss[.fraction]' | '0001-01-01 00:00:00.000000' to '9999-12-31 23:59:59.999999' |
+----------+----------------------------------+--------------------------------------------------------------+



Timestamp
---------

A timestamp instance is an absolute instant independent of timezone or convention. For example, for a given point of time, if we set the timestamp of this time point into another timezone, the value should also be different accordingly. Besides, the storage of timestamp type is also different from the other types. The timestamp is converted from the current timezone to UTC for storage, and is converted back to the set timezone from UTC when retrieving.

+-----------+----------------------------------+------------------------------------------------------------------+
| Type | Syntax | Range |
+===========+==================================+==================================================================+
| Timestamp | 'yyyy-MM-dd hh:mm:ss[.fraction]' | '0001-01-01 00:00:01.000000' UTC to '9999-12-31 23:59:59.999999' |
+-----------+----------------------------------+------------------------------------------------------------------+


Interval
--------

Interval data type represents a temporal duration or a period. The syntax is as follows:

+----------+--------------------+
| Type | Syntax |
+==========+====================+
| Interval | INTERVAL expr unit |
+----------+--------------------+

The expr is any expression that can be iterated to a quantity value eventually, see `Expressions <expressions.rst>`_ for details. The unit represents the unit for interpreting the quantity, including MICROSECOND, SECOND, MINUTE, HOUR, DAY, WEEK, MONTH, QUARTER and YEAR.The INTERVAL keyword and the unit specifier are not case sensitive. Note that there are two classes of intervals. Year-week intervals can store years, quarters, months and weeks. Day-time intervals can store days, hours, minutes, seconds and microseconds. Year-week intervals are comparable only with another year-week intervals. These two types of intervals can only comparable with the same type of themselves.


Conversion between date and time types
--------------------------------------

Basically the date and time types except interval can be converted to each other, but might suffer some alteration of the value or some information loss, for example extracting the time value from a datetime value, or convert a date value to a datetime value and so forth. Here lists the summary of the conversion rules that PPL plugin supports for each of the types:

Conversion from DATE
>>>>>>>>>>>>>>>>>>>>

- Since the date value does not have any time information, conversion to `Time`_ type is not useful, and will always return a zero time value '00:00:00'.

- Conversion from date to datetime has a data fill-up due to the lack of time information, and it attaches the time '00:00:00' to the original date by default and forms a datetime instance. For example, the result to covert date '2020-08-17' to datetime type is datetime '2020-08-17 00:00:00'.

- Conversion to timestamp is to alternate both the time value and the timezone information, and it attaches the zero time value '00:00:00' and the session timezone (UTC by default) to the date. For example, the result to covert date '2020-08-17' to datetime type with session timezone UTC is datetime '2020-08-17 00:00:00' UTC.


Conversion from TIME
>>>>>>>>>>>>>>>>>>>>

- Time value cannot be converted to any other date and time types since it does not contain any date information, so it is not meaningful to give no date info to a date/datetime/timestamp instance.


Conversion from DATETIME
>>>>>>>>>>>>>>>>>>>>>>>>

- Conversion from datetime to date is to extract the date part from the datetime value. For example, the result to convert datetime '2020-08-17 14:09:00' to date is date '2020-08-08'.

- Conversion to time is to extract the time part from the datetime value. For example, the result to convert datetime '2020-08-17 14:09:00' to time is time '14:09:00'.

- Since the datetime type does not contain timezone information, the conversion to timestamp needs to fill up the timezone part with the session timezone. For example, the result to convert datetime '2020-08-17 14:09:00' with system timezone of UTC, to timestamp is timestamp '2020-08-17 14:09:00' UTC.


Conversion from TIMESTAMP
>>>>>>>>>>>>>>>>>>>>>>>>>

- Conversion from timestamp is much more straightforward. To convert it to date is to extract the date value, and conversion to time is to extract the time value. Conversion to datetime, it will extracts the datetime value and leave the timezone information over. For example, the result to convert datetime '2020-08-17 14:09:00' UTC to date is date '2020-08-17', to time is '14:09:00' and to datetime is datetime '2020-08-17 14:09:00'.


String Data Types
=================

TODO

93 changes: 93 additions & 0 deletions docs/experiment/ppl/general/identifiers.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
===========
Identifiers
===========

.. rubric:: Table of contents

.. contents::
:local:
:depth: 2


Introduction
============

Identifiers are used for naming your database objects, such as index name, field name, alias etc. Basically there are two types of identifiers: regular identifiers and delimited identifiers.


Regular Identifiers
===================

Description
-----------

A regular identifier is a string of characters that must start with ASCII letter (lower or upper case). The subsequent character can be a combination of letter, digit, underscore (``_``). It cannot be a reversed key word. And whitespace and other special characters are not allowed.

Examples
--------

Here are examples for using index pattern directly without quotes::

od> source=accounts | fields account_number, firstname, lastname;
fetched rows / total rows = 4/4
+------------------+-------------+------------+
| account_number | firstname | lastname |
|------------------+-------------+------------|
| 1 | Amber | Duke |
| 6 | Hattie | Bond |
| 13 | Nanette | Bates |
| 18 | Dale | Adams |
+------------------+-------------+------------+


Delimited Identifiers
=====================

Description
-----------

A delimited identifier is an identifier enclosed in back ticks `````. In this case, the identifier enclosed is not necessarily a regular identifier. In other words, it can contain any special character not allowed by regular identifier. For Elasticsearch, the following identifiers are supported extensionally:

1. Identifiers prefixed by dot ``.``: this is called hidden index in Elasticsearch, for example ``.kibana``.
2. Identifiers prefixed by at sign ``@``: this is common for meta fields generated in Logstash ingestion.
3. Identifiers with ``-`` in the middle: this is mostly the case for index name with date information.
4. Identifiers with star ``*`` present: this is mostly an index pattern for wildcard match.

Use Cases
---------

Here are typical examples of the use of delimited identifiers:

1. Identifiers of reserved key word name
2. Identifiers with dot ``.`` present: similarly as ``-`` in index name to include date information, it is required to be quoted so parser can differentiate it from identifier with qualifiers.
3. Identifiers with other special character: Elasticsearch has its own rule which allows more special character, for example Unicode character is supported in index name.

Examples
--------

Here are examples for quoting an index name by back ticks::

od> source=`acc*` | fields `account_number`;
fetched rows / total rows = 4/4
+------------------+
| account_number |
|------------------|
| 1 |
| 6 |
| 13 |
| 18 |
+------------------+


Case Sensitivity
================

Description
-----------

Identifiers are treated in case sensitive manner. So it must be exactly same as what is stored in Elasticsearch.

Examples
--------

For example, if you run ``source=Accounts``, it will end up with an index not found exception from our plugin because the actual index name is under lower case.
6 changes: 6 additions & 0 deletions docs/experiment/ppl/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -61,3 +61,9 @@ The query start with search command and then flowing a set of command delimited
* **Functions**

- `PPL Functions <../../user/dql/functions.rst>`_

* **Language Structure**

- `Identifiers <general/identifiers.rst>`_

- `Data Types <general/datatypes.rst>`_

0 comments on commit 370ff6d

Please sign in to comment.