From 288fc4d3d3d2f7dcfcf665b6100d62a84a0e8250 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Mon, 8 Nov 2021 16:30:39 -0500 Subject: [PATCH 1/7] Add DataFusion 6.0.0 blog --- _posts/2021-11-8-datafusion-6.0.0.md | 129 +++++++++++++++++++++++++++ 1 file changed, 129 insertions(+) create mode 100644 _posts/2021-11-8-datafusion-6.0.0.md diff --git a/_posts/2021-11-8-datafusion-6.0.0.md b/_posts/2021-11-8-datafusion-6.0.0.md new file mode 100644 index 000000000000..4bcccbbe8f49 --- /dev/null +++ b/_posts/2021-11-8-datafusion-6.0.0.md @@ -0,0 +1,129 @@ +--- +layout: post +title: Apache Arrow DataFusion 6.0.0 Release +date: "2021-11-8 00:00:00" +author: pmc +categories: [release] +--- + + +# Introduction + +[DataFusion](https://arrow.apache.org/datafusion/) is an embedded +query engine which leverages the unique features of +[Rust](https://www.rust-lang.org/) and [Apache +Arrow](https://arrow.apache.org/) to provide a system that is high +performance, easy to connect, easy to embed, and high quality. + +The Apache Arrow team is pleased to announce the DataFusion 6.0.0 release. This covers 4 months of development work +and includes 122 commits from the following 28 distinct contributors. + +``` +# TODO update when we have a final 6.0 tag +git shortlog -sn 5.0.0..87c8eaa datafusion datafusion-cli datafusion-examples + 28 Andrew Lamb + 25 Jiayu Liu + 9 rdettai + 8 QP Hou + 5 carlos + 4 Daniël Heres + 4 Guillaume Balaine + 4 Matthew Turner + 4 Carlos + 3 Francis Du + 3 Jon Mease + 3 Nga Tran + 3 Marco Neumann + 2 Andy Grove + 2 Ruihang Xia + 2 Yijie Shen + 2 baishen + 1 Krisztián Szűcs + 1 Antoine Wendlinger + 1 Qingping Hou + 1 Conner Murphy + 1 Taehoon Moon + 1 Tiphaine Ruy + 1 Jason Tianyi Wang + 1 adsharma + 1 Mike Seddon + 1 Nan Zhu + 1 Patrick More +``` + + + +The release notes below are not exhaustive and only expose selected highlights of the release. Many other bug fixes +and improvements have been made: we refer you to the complete +[changelog](https://github.com/apache/arrow-datafusion/blob/6.0.0/datafusion/CHANGELOG.md). + +# New Website + +Befitting a growing project, DataFusion now has its +[own website](https://arrow.apache.org/datafusion/) hosted as part of the +main [Apache Arrow Website](https://arrow.apache.org) + +# Roadmap +The community worked to gather their thoughts about where we are +taking DataFusion into a public +[Roadmap](https://arrow.apache.org/datafusion/specification/roadmap.html) +for the first time + +# Performance +TODO: Anything to report?? + +# New Features + +- Support for `EXPLAIN ANALYZE` and a new runtime metric collection +- DataFrame support for: `show`, `limit`, +- Support for `trim ( [ LEADING | TRAILING | BOTH ] [ FROM ] string text [, characters text ] )` syntax +- Support for Postgres style regular expression matching operators `~`, `~*`, `!~`, and `!~*` +- Automatic schema inference for CSV files +- Support for SQL set operators `UNION`, `INTERSECT`, and `EXCEPT` +- `cume_dist`, `percent_rank`, Window Functions +- `digest`, `blake2s`, `blake2b` functions +- HyperLogLog based `approx_distinct` +- `is distinct from` and `is not distinct from` +- HIVE style partitioning support, for Parquet, CSV, Avro and Json files on local or remote storage +- Generic constant evaluation and simplification framework +- Support for `CREATE TABLE AS SELECT` +- Better interactive editing support in `datafusion-cli` as well as `psql` style commands such as `\d`, `\?`, and `\q` +- Support for accessing elements of `Struct` and `List` columns (e.g. `SELECT struct_column['field_name'] FROM ...`) + + +# `async` Planning and Split File Format and Layout +Driven by the need to support hive style metadata partitioning, the +code for reading specific file formats (`Parquet`, `Avro`, `CSV`, and +`JSON`) was separated from the logic that handles grouping sets of +files into execution partitions, and the process was made +`async`. This sets up DataFusion and its plug-in ecosystem to +supporting remote catalogs and various object store implementations. + + +# How to Get Involved + +If you are interested in contributing to DataFusion, we would love to have you! You +can help by trying out DataFusion on some of your own data and projects and filing bug reports and helping to +improve the documentation, or contribute to the documentation, tests or code. A list of open issues suitable for +beginners is [here](https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) +and the full list is [here](https://github.com/apache/arrow-datafusion/issues). From 9846e52c58ad73b3009cba1c2e6104385a99c175 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Wed, 10 Nov 2021 16:11:35 -0500 Subject: [PATCH 2/7] Add direct links to async planning PR and doc --- _posts/2021-11-8-datafusion-6.0.0.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/_posts/2021-11-8-datafusion-6.0.0.md b/_posts/2021-11-8-datafusion-6.0.0.md index 4bcccbbe8f49..7a0617287830 100644 --- a/_posts/2021-11-8-datafusion-6.0.0.md +++ b/_posts/2021-11-8-datafusion-6.0.0.md @@ -118,6 +118,9 @@ code for reading specific file formats (`Parquet`, `Avro`, `CSV`, and files into execution partitions, and the process was made `async`. This sets up DataFusion and its plug-in ecosystem to supporting remote catalogs and various object store implementations. +You can read more about this change in the +[design document](https://docs.google.com/document/d/1Bd4-PLLH-pHj0BquMDsJ6cVr_awnxTuvwNJuWsTHxAQ) +and on the [arrow-datafusion#1010 PR](https://github.com/apache/arrow-datafusion/pull/1010). # How to Get Involved From 714e6b0d30d330ba8bb40ee9352779270d4578ed Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Thu, 18 Nov 2021 07:06:40 -0500 Subject: [PATCH 3/7] Update _posts/2021-11-8-datafusion-6.0.0.md Co-authored-by: Carlos --- _posts/2021-11-8-datafusion-6.0.0.md | 33 ++++++++++++++-------------- 1 file changed, 16 insertions(+), 17 deletions(-) diff --git a/_posts/2021-11-8-datafusion-6.0.0.md b/_posts/2021-11-8-datafusion-6.0.0.md index 7a0617287830..003c32cdec35 100644 --- a/_posts/2021-11-8-datafusion-6.0.0.md +++ b/_posts/2021-11-8-datafusion-6.0.0.md @@ -36,36 +36,35 @@ The Apache Arrow team is pleased to announce the DataFusion 6.0.0 release. This and includes 122 commits from the following 28 distinct contributors. ``` -# TODO update when we have a final 6.0 tag -git shortlog -sn 5.0.0..87c8eaa datafusion datafusion-cli datafusion-examples +git shortlog -sn 5.0.0..7824a8d datafusion datafusion-cli datafusion-examples 28 Andrew Lamb - 25 Jiayu Liu + 26 Jiayu Liu + 13 xudong963 9 rdettai - 8 QP Hou - 5 carlos - 4 Daniël Heres + 9 QP Hou + 6 Matthew Turner + 5 Daniël Heres 4 Guillaume Balaine - 4 Matthew Turner - 4 Carlos 3 Francis Du + 3 Marco Neumann 3 Jon Mease 3 Nga Tran - 3 Marco Neumann - 2 Andy Grove - 2 Ruihang Xia 2 Yijie Shen + 2 Ruihang Xia + 2 Liang-Chi Hsieh 2 baishen - 1 Krisztián Szűcs + 2 Andy Grove + 2 Jason Tianyi Wang + 1 Nan Zhu 1 Antoine Wendlinger - 1 Qingping Hou + 1 Krisztián Szűcs + 1 Mike Seddon 1 Conner Murphy + 1 Patrick More 1 Taehoon Moon 1 Tiphaine Ruy - 1 Jason Tianyi Wang 1 adsharma - 1 Mike Seddon - 1 Nan Zhu - 1 Patrick More + 1 lichuan6 ``` ``` -git shortlog -sn 5.0.0..7824a8d datafusion datafusion-cli datafusion-examples 28 Andrew Lamb 26 Jiayu Liu 13 xudong963 @@ -67,11 +76,6 @@ git shortlog -sn 5.0.0..7824a8d datafusion datafusion-cli datafusion-examples 1 lichuan6 ``` - - The release notes below are not exhaustive and only expose selected highlights of the release. Many other bug fixes and improvements have been made: we refer you to the complete [changelog](https://github.com/apache/arrow-datafusion/blob/6.0.0/datafusion/CHANGELOG.md). From f2d80395f8d850979efcae25a573bd1bc0b5190a Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Thu, 18 Nov 2021 07:23:29 -0500 Subject: [PATCH 5/7] Update _posts/2021-11-8-datafusion-6.0.0.md --- _posts/2021-11-8-datafusion-6.0.0.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2021-11-8-datafusion-6.0.0.md b/_posts/2021-11-8-datafusion-6.0.0.md index e587792383e5..e766ea31ab50 100644 --- a/_posts/2021-11-8-datafusion-6.0.0.md +++ b/_posts/2021-11-8-datafusion-6.0.0.md @@ -1,7 +1,7 @@ --- layout: post title: Apache Arrow DataFusion 6.0.0 Release -date: "2021-11-8 00:00:00" +date: "2021-11-18 00:00:00" author: pmc categories: [release] --- From f8fb38dddd07b9ed4bc9012c7a0cf465023c2b53 Mon Sep 17 00:00:00 2001 From: Qingping Hou Date: Thu, 18 Nov 2021 23:29:10 -0800 Subject: [PATCH 6/7] add more changelog --- _posts/2021-11-8-datafusion-6.0.0.md | 72 ++++++++++++++++++---------- 1 file changed, 48 insertions(+), 24 deletions(-) diff --git a/_posts/2021-11-8-datafusion-6.0.0.md b/_posts/2021-11-8-datafusion-6.0.0.md index e766ea31ab50..5897d23a9be2 100644 --- a/_posts/2021-11-8-datafusion-6.0.0.md +++ b/_posts/2021-11-8-datafusion-6.0.0.md @@ -37,7 +37,7 @@ and includes 134 commits from the following 28 distinct contributors.