Replies: 3 comments 8 replies
-
@alamb I would like to ask a few questions.
For example, most of ideas from DuckDB, ClickHouse, Vertica or Azure can make an array a more complete data type. |
Beta Was this translation helpful? Give feedback.
-
Hi @izveigor -- I read through the description again -- while I would not say I followed everything what I did made sense to me. Thank you for writing it. I think the biggest thing that is unclear to me is what are you proposing. Is it the current (postgres based) set of functions / operators for working with Or perhaps you mean this document to serve as the intellectual foundation for a coherent design for List support in DataFusion and not really a change at all. BTW I have a blog post in the works about the various types of structured arrays supported by Arrow (ListArray, StructArray, MapArray) that can hopefully help provide some more background for this discussion too. |
Beta Was this translation helpful? Give feedback.
-
Thanks for this @izveigor. I found your syntax summary and the comparison matrix very useful. I think we should have an array subpage in Datafusion docs that presents our syntax along with your comparison matrix -- I'm sure others will find it useful too. |
Beta Was this translation helpful? Give feedback.
-
Foreword
The article is about arrays in DataFusion. I decided to write it due to this topic become more popular among users and developers.
Active development is currently underway, dedicated to the formatio formation of an array as a full-fledged tool for working with information. I can not call the development itself successful, I think the main reason is the absence of an unified development concept. So, the main goal of this article is to draw up a general concept and give meaning to the technology for the user.
Backstory
Let's start from the beginning and how exactly the idea of implementing arrays was born. Prior to my introduction to DataFusion there were one array function (
make_array
) and one internal function (unnest_columns
). After that, there was a lull in this topic until a user with a nickname @bubbajoe brought this topic up again. During the discussions, we came to the conclusion that it is necessary to implement some array functions, and the relational database PostgreSQL was chosen as a reference. Most of the features have been implemented in PR: #6384. After the merge, we begin to realize that nested data types are not simple, since many DataFusion mechanisms are not adapted to work with them.The naive implementation of array functions brought a lot of problems and ambiguities. It is related to the fact that the most developers (including me) do not understand how arrays work, which is why they cannot solve the problem correctly. I see the solution to the problem in the use of some concepts of development ethics, because the development of such a voluminous topic should be introduced reasonably.
About development ethics
Let's start with the first mistake that has been made in naive implementation - not having a plan. The development was introduced incrementally, any feature was made taking into the account by individual need. This led to an increase in the cost of development, the lack of communication between individual features. Therefore, in my opinion, further development should go according to the plan. It will also helps to avoid the parallelism work that constantly arises during the wrong order of implementation of specific features.
New development plan
This part will be devoted to developing a plan for solving array problems.
This part is not complete, so more information will be added here over time.
The main postulates DataFusion
In order to form the principles by which arrays will work, it is necessary to understand the areas of application of this technology. Let's start with the purpose of the existence of DataFusion technology. @alamb repeatedly cited similar reasoning in his discussions. Key ideas include:
About array
Basic postulates:
ListArray
(Apache Arrow);FixedSizeListArray
,LargeList
andListArray
transform toListArray
;List
objects are nested;Thus we can conclude that the main characteristics of the array are:
About array functions:
Basic postulates:
Why are arrays needed?
Now let's see what is the main point of using an array from the user's point of view. The main points of use are:
So, an array can be represented as:
About the features of the user query language
DataFusion adheres to the principles of SQL databases. Every command in SQL is atomic. The set of SQL command results forms a query language.
A language can be composed of many other languages.
Now let's figure out which languages will arrange work with an array:
Language DML
Let's start with the data manipulation language. For effective. Its principles:
Set the characteristics of the array:
In quantity:
By quality:
Based on the characteristics of the array, we obtain the corresponding functions.
Corresponding index table:
array_element
array_insert
SET array[i] = value
array_delete
array_slice
array_insert
SET array[i:j] = value
array_delete
Corresponding table by values:
array_position(array, [i])
array_resize(x, 1)
array_replace(array, from, to)
array_remove(array, element)
array_positions(array)[:i]
array_resize(x, n)
array_replace(array, from, to, max)
array_remove(array, element, max)
array_positions(array)
array_resize(x, array_length(array))
array_replaces(array, from, to)
array_removes(array, element)
Corresponding table by borders:
array_first
array_append
SET array[0] = value
array_pop_front
array_last
array_prepend
SET array[array_length(array) - 1] = value
array_pop_back
array_slice(0, n)
array_insert(0, n)
SET array[0:n] = value
array_delete(0, n)
array_slice(n, array_length(array))
array_insert(n, array_length(array))
SET array[n:array_length(array) - 1] = value
array_delete(n, array_length(array))
Corresponding table by predicates:
array_filter
array_transform
array_filter
Set Language
An array can be viewed not only as a vector, but also as a set.
Basic set operations
array_concat
orarray_union
)array_intersect
or&&
)array_except
)All subsequent operations can be derived from these concepts.
Cartesian product:
array_zip
Example (ClickHouse):
Checking if a subset belongs to a set
array_has_all
Checking if any element of another set belongs to a set
array_has_any
Checking if an element belongs to a set
array_has
orarray_contains
About aliases
My postulate regarding aliases is simple: an alias is admissible under conditions:
So, I propose to introduce the alias
list_
for each function as a mutable object andarray_
as an immutable object.Other aliases of functions should be assigned based on other technologies (for instance, the PostgreSQL function
array_cat
is an alias forarray_concat
).Sort
An array can be sorted in two ways: left to right (
array_sort
) and right to left (array_reverse_sort
).About aggregate functions
Based on the postulate - convinience from the user's point of view, all aggregate functions must be adapted to the array.
Two approaches can be distinguished:
unnest
)list_sum
)list_aggregate
)I think all these ways should be used.
Working with Array Characteristics
Characteristics can be divided into categories:
Quality:
Quantity:
For qualitative categories, the following behavior corresponds:
For quantitative categories, the following behavior corresponds:
Data type
array_length
);CAST
);Length
array_length
);array_element(array, 0)
);cardinality
);array_dims
);Emptiness
empty
);Dimension
array_ndims
);flatten
);Homogeneity
NOT (array_filter(array, x -> x IS NULL))
);array_filter(array, x -> x IS NOT NULL)
);Uniqueness
array_distinct(array) == array
);array_distinct
);Documentation
After the implementation of each function, the corresponding documentation must be written.
Documentation should have the following features:
Other applications
For other areas (such as optimization, hashing, and so on), should be consistent with the general concept, but it is considered redundant to write about it here.
Criticism of various array implementations among other databases
PostgreSQL
Pros:
array_ndims
,array_dims
...);Cons:
It is diffucult to represent an array as a set in PostgreSQL, so its language is limited in the following operations:
Other restrictions include:
DuckDB
Pros:
Cons:
ClickHouse
Pros:
Cons:
Comparison
Here we will consider three binary relations: an equivalence relation, a parial order relation, and a strict order relation.
With order relationships (partial and strict), the results are based on the first other pair of elements, not on the sizes of the arrays.
With equivalence relations, results are based on exact element-by-element comparison (
=
,!=
).Operators
@>
=array_contains(first_array, second_array)
;<@
=array_contains(second_array, first_array)
;&&
=array_intersect(first_array, second_array)
;||
=array_concat(first_array, second_array)
array[i]
=array_element(array, i)
;array[i:j]
=array_slice(array, i, j)
;array[i] = value
=array_update(array, value, i)
;array[i:j] = values
=array_update_slice(array, values, i, j)
;<
,<=
,=
,!=
,>
,>=
(See: Comparison);[a1,...,an]
=make_array(a1,...,an)
;Reference: Examples of Difficult Scenarios
Example with stones
Let's say in some country there is a vote for the approval of some reform. Every citizen can vote for and against. The color of the stones will be used as the subject of the vote, namely black (against) and white (for). From the point of view of the array, this reconstruction will look like this, we have an array consisting of 0 (black color) and 1 (white color). However, a situation may arise the elections. For example, 2 white balls and 3 black balls were dishonestly counted in the vote. Having noticed the falsification, the state decides to remove or replace these stones with NULL.
Now let's try to solve these problems using different databases:
Initial list: [1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0]
My decision
Other databases
There are no solutions using Turing incomplete declarative language (classic SQL) for databases like DuckDB, PostgreSQL.
List of array functions
List of array functions and their aliases
make_array
make_list
array_element
list_element
,array[i]
array_insert
array_insert
,list_insert
array_delete
list_delete
array_slice
array[i:j]
,list_slice
array_position
list_position
array_repeat
list_repeat
array_replace
list_replace
array_remove
list_remove
array_positions
list_positions
array_replaces
list_replaces
array_removes
list_removes
array_first
list_first
array_append
list_append
array_pop_front
list_pop_front
array_last
list_last
array_prepend
list_prepend
array_pop_back
list_pop_back
array_filter
list_filter
array_transform
list_transform
array_concat
list_concat
,array_cat
,list_cat
,array_union
,list_union
array_intersect
&&
,list_intersect
array_except
list_except
array_zip
list_zip
array_has
array_contains
,list_contains
,list_has
array_has_any
list_has_any
array_has_all
list_has_all
array_sort
list_sort
array_reverse_sort
list_reverse_sort
unnest
array_aggregate
list_aggr
,list_aggregate
,array_aggr
array_length
list_length
cardinality
array_dims
list_dims
empty
array_ndims
list_ndims
flatten
array_distinct
list_distinct
List of array functions and their counterparts from other databases
make_array
array[]
array
list_value
array
array_element
array[i]
array[i]
list_element(array, i)
arrayElement
array_insert
array_insert
array_slice
list_slice
array_position
array_position
array_position
array_position
indexOf
array_repeat
array_fill
array_repeat
arrayResize
array_replace
array_remove
array_positions
array_positions
array_replaces
array_replace
array_replace
array_transform
array_transform
array_removes
array_remove
array_remove
array_filter
array_filter
array_first
array[0]
array[0]
array[0]
arrayFirst
array_append
array_append
array_append
array_append
arrayPushBack
array_pop_front
array_pop_front
arrayPopFront
array_last
array[array_length(array) - 1]
array[array_length(array) - 1]
array[array_length(array) - 1]
array[length(array) - 1]
array_prepend
array_prepend
array_prepend
arrayPushFront
array_pop_back
array_pop_back
arrayPopBack
array_filter
array_filter
arrayFilter
array_transform
array_transform
arrayTransform
array_concat
array_concat
array_union
array_concat
arrayConcat
array_intersect
array_intersect
array_intersect
arrayIntersect
array_except
array_except
array_except
array_zip
arrays_zip
array_zip
array_has
array_contains
array_has
has
array_has_any
&&
arrays_overlap
array_has_any
hasAny
array_has_all
@>
or<@
array_has_all
hasAll
array_sort
array_sort
array_sort
arraySort
array_reverse_sort
array_sort
array_reverse_sort
arrayReverseSort
unnest
unnest
unnest
arrayJoin
array_aggregate
list_aggregate
array_length
array_length
array_length
array_length
length
cardinality
cardinality
array_dims
array_dims
empty
Empty
array_ndims
array_ndims
flatten
flatten
arrayFlatten
array_distinct
array_distinct
array_distinct
arrayDistinct
Sources:
Articles:
https://www.usenix.org/system/files/login/articles/login_winter18_08_khurana.pdf
Videos:
https://www.youtube.com/watch?v=kHpy64smpYQ - From flat files to deconstructed database The evolution and future of the big data ecosyst
Documentations:
Pull Requests:
#6384
Issues:
#6119
Discussions:
#6441
Beta Was this translation helpful? Give feedback.
All reactions