diff --git a/_posts/2024-10-28-18.0.0-release.md b/_posts/2024-10-28-18.0.0-release.md new file mode 100644 index 000000000000..943c772eb14f --- /dev/null +++ b/_posts/2024-10-28-18.0.0-release.md @@ -0,0 +1,238 @@ +--- +layout: post +title: "Apache Arrow 18.0.0 Release" +date: "2024-10-28 00:00:00" +author: pmc +categories: [release] +--- + + + +The Apache Arrow team is pleased to announce the 18.0.0 release. This covers +over 3 months of development work and includes [**334 resolved issues**][1] +on [**530 distinct commits**][2] from [**89 distinct contributors**][2]. +See the [Install Page](https://arrow.apache.org/install/) +to learn how to get the libraries for your platform. + +The release notes below are not exhaustive and only expose selected highlights +of the release. Many other bugfixes and improvements have been made: we refer +you to the [complete changelog][3]. + +## Community + +Since the 17.0.0 release, Will Ayd and Rossi Sun have been invited to be committer. +No new members have joined the Project Management Committee (PMC). + +Thanks for your contributions and participation in the project! + +## Columnar format + +The Arrow columnar format now allows 32-bit and 64-bit decimal data, in +addition to the already existing 128-bit and 256-bit decimal data types +(GH-43956). + + +## Arrow Flight RPC notes + +The Java implementation now transparently handles compressed Arrow data when reading, instead of requiring explicit configuration. (GH-43469) +The Ruby bindings now support implementing DoPut on the server. (GH-43814) + +## C++ notes + +The default memory pool has changed to mimalloc on all platforms (GH-43254). +Previously, jemalloc was used by default on Linux. Using mimalloc by default +provides a more consistent experience across different platforms, and +makes configuration easier. It is expected that this might either increase +or decrease performance on user workloads that use the default memory pool; +please benchmark accordingly. Jemalloc can still be selected by setting +the [`ARROW_DEFAULT_MEMORY_POOL`](https://arrow.apache.org/docs/cpp/env_vars.html#envvar-ARROW_DEFAULT_MEMORY_POOL) environment variable to "jemalloc". + +A new class `arrow::ArrayStatistics` has been added to encode basic statistics +about an Arrow array. It provides a source-agnostic representation for statistics +provided by third-party sources such as Parquet files (GH-41909). + +The new Decimal32 and Decimal64 types have been made available (GH-43956). + +Several canonical extension types have been implemented: +- the [Opaque](https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html#opaque) extension type (GH-43454); +- the [8-bit boolean](https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html#bit-boolean) extension type (GH-17682); +- the [UUID](https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html#uuid) extension type (GH-15058); +- the [JSON](https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html#json) extension type (GH-32538). + +**Flight UCX is deprecated.** We plan to remove this experiment in the next couple of releases. + +### Acero + +- Enhanced the row-oriented representation by widening the offset type from 32-bit to 64-bit, resolving crashes and data corruption in aggregation and hash join on large datasets due to offset overflow (GH-43495). +- Improved ordered aggregation performance by reducing complexity from `O(n*m)` to `O(n)`, where `n` is the number of rows and `m` the number of segments in the batch (GH-44052). + +### Compute + +Casting between string-like and string-view-like types has been implemented (GH-42247). + +### Dataset + + +### Filesystems + +Writing small files to S3 can use a single S3 API call instead of three, +provided the new option `allow_delayed_open` is enabled (GH-40557). +Files larger than 5 MB still go through the regular multipart +upload mechanism. + +Background writes are now implemented and enabled by default for the Azure +filesystem, dramatically improving the performance of writing to remote files +(GH-40036). + +Finalization of the S3 filesystem layer should hopefully be more robust (GH-44071). + +### Gandiva + +LLVM 19.1 is now supported (GH-44222). + +### GPU + + +### IPC + +The seed corpus used for fuzzing the IPC reader has been improved, hopefully +helping make the IPC reader even more robust against corrupt or malicious +IPC streams (GH-38041). + +### Parquet + +A new command line utility `parquet-dump-footer` allows dumping the Thrift-encoded +footer metadata of a Parquet file, optionally scrubbing confidential data +(GH-42102). This is part of the effort to collect real-world Parquet metadata +so as to evaluate the efficiency of future improvements to the Parquet format. +Please see https://github.com/apache/parquet-benchmark for instructions to submit +footers representative of your own workloads. + + +## C# notes + +- Partial support has been added for LargeBinary, LargeString and LargeList. The column sizes cannot exceed 2 GB in length. (GH-43266). +- Changes to Flight support were made for better control and compatibility, and to allow Flight Server to be hosted in pre-Kestrel versions of .NET (GH-43907, GH-43672, GH-41347). +- Support has been added for newly-defined types decimal32 and decimal64 (GH-44271). +- The import of sliced arrays through the C Data interface now works correctly. (GH-43267) +## Java notes + +**Java 8 is no longer supported.** (GH-38051) + +**Gandiva may not work in this release.** For details, please see [GH-43576](https://github.com/apache/arrow/issues/43576). + +Basic support for RunEndEncoded was added (GH-39982). The ListView/StringView vector implementations are now more complete, including C Data support (multiple issues). + +Several APIs have been updated to accept `long` for addresses in preparation for FFM/large buffer support (GH-43902). We no longer expose `sun.misc.Unsafe` (GH-43479). We no longer ship the `shaded` flight-core JARs (GH-43217). + +More options were added to the Dataset ScanOptions API (GH-28866). + +## JavaScript notes +- Accessing individual rows in Tables or Structs should now be more performant ([GH-30863](https://github.com/apache/arrow/issues/30863)). + +## Python notes +Compatibility notes: +* NumPy required dependency has been removed from pyarrow packaging + [GH-43846](https://github.com/apache/arrow/issues/43846) and has been + made an optional runtime dependency [GH-25118](https://github.com/apache/arrow/issues/25118). +* Support for Python 3.8 has been dropped [GH-43518](https://github.com/apache/arrow/issues/43518) +* No longer used serialize/deserialize Pyarrow C++ functions have been + deprecated [GH-44063](https://github.com/apache/arrow/issues/44063). +* Passing of build flags to setup.py (e.g. `setup.py --with-parquet`) has been + deprecated [GH-43514](https://github.com/apache/arrow/issues/43514) + +New features: +* Non-cpu work has continued with [GH-43973](https://github.com/apache/arrow/issues/43973), + [GH-43728](https://github.com/apache/arrow/issues/43728), [GH-43727](https://github.com/apache/arrow/issues/43727), + [GH-43391](https://github.com/apache/arrow/issues/43391), + [GH-42222](https://github.com/apache/arrow/issues/42222) and + [GH-41665](https://github.com/apache/arrow/issues/41665). +* Arrow C++ ``arrow::dataset::Partitioning::Format`` method has been exposed in + Python [GH-43684](https://github.com/apache/arrow/issues/43684). +* UUID canonical extension type is now supported in Python + [GH-15058](https://github.com/apache/arrow/issues/15058). +* Opaque canonical extension type has been implemented + [GH-43454](https://github.com/apache/arrow/issues/43454). +* ``StructArray.from_array`` now accepts a type in addition to names or fields + [GH-42014](https://github.com/apache/arrow/issues/42014). +* New attributes have been added to ``StructType`` in order to access all its fields + [GH-30058](https://github.com/apache/arrow/issues/30058). + +Other improvements: +* In order to support free-threaded build of CPython 3.13 additional work has been made: + [GH-44046](https://github.com/apache/arrow/issues/44046), + [GH-44355](https://github.com/apache/arrow/issues/44355) and + [GH-43964](https://github.com/apache/arrow/issues/43964). Umbrella issue + [GH-43536](https://github.com/apache/arrow/issues/43536). +* PyCapsule interface now has precedence over others in pa.schema(..) + [GH-43388](https://github.com/apache/arrow/issues/43388). +* Usage of deprecated ``pkg_resources`` in setup.py has been replaced with + ``numpy.get_include()`` [GH-43532](https://github.com/apache/arrow/issues/43532). +* Conversion from Arrow to JAX via dlpack as added to the documentation examples + [GH-44229](https://github.com/apache/arrow/issues/44229). + +Relevant bug fixes: +* ``pyarrow.Table.rename_columns`` has been updated and should have accepted ``tuples``, + not only ``list`` or ``dict``. This has been fixed + [GH-43588](https://github.com/apache/arrow/issues/43588). +* Python reference handling in UDF implementation has been sanitized + [GH-43487](https://github.com/apache/arrow/issues/43487). +* Files included when building wheels have been cleaned (unnecessary files removed) + [GH-43299](https://github.com/apache/arrow/issues/43299). + +## R notes + +For more on what’s in the 18.0.0 R package, see the [R changelog][4]. + +## Ruby and C GLib notes + +### Ruby + +* Add workaround for install failure due to `re2.pc` on Ubuntu 20.04: [GH-41396](https://github.com/apache/arrow/issues/41396) +* Add support for `0` decimal value: [GH-43877](https://github.com/apache/arrow/issues/43877) + +C GLib related improvements are also available in Ruby. + +### C GLib + +* Add support for Azure file system: [GH-43738](https://github.com/apache/arrow/issues/43738) +* FlightRPC: Add support for DoPut: [GH-41056](https://github.com/apache/arrow/issues/41056) +* FlightRPC: Add support for timeout: [GH-44178](https://github.com/apache/arrow/issues/44178) +* Parquet: Add support for writing a record batch: [GH-40860](https://github.com/apache/arrow/issues/40860) +* Add support for pull style IPC stream format decoder: [GH-40493](https://github.com/apache/arrow/issues/40493) + +## Rust notes and Go notes + +The Rust and Go projects have moved to separate repositories outside the +main Arrow monorepo. For notes on the latest release of the Rust +implementation, see the latest [Arrow Rust changelog][5]. +For notes on the latest release of the Go implementation, see the latest +[Arrow Go changelog][6] + +## Linux packages notes + +The Azure file system is now enabled. + +[1]: https://github.com/apache/arrow/milestone/64?closed=1 +[2]: {{ site.baseurl }}/release/18.0.0.html#contributors +[3]: {{ site.baseurl }}/release/18.0.0.html#changelog +[4]: {{ site.baseurl }}/docs/r/news/ +[5]: https://github.com/apache/arrow-rs/tags +[6]: https://github.com/apache/arrow-go/tags