Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nanoarrow 0.6.0 release post #545

Merged
merged 5 commits into from
Oct 23, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
289 changes: 289 additions & 0 deletions _posts/2024-10-07-nanoarrow-0.6.0-release.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,289 @@
---
layout: post
title: "Apache Arrow nanoarrow 0.6.0 Release"
date: "2024-10-07 00:00:00"
author: pmc
categories: [release]
---
<!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
-->

The Apache Arrow team is pleased to announce the 0.6.0 release of
Apache Arrow nanoarrow. This release covers 114 resolved issues from
10 contributors.

## Release Highlights

- Run End Encoding support
- StringView support
- IPC Write support
- DLPack/device support
- IPC/Device available from CMake/Meson as feature flags

See the
[Changelog](https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.6.0/CHANGELOG.md)
for a detailed list of contributions to this release.

## Breaking Changes

Most changes included in the nanoarrow 0.6.0 release will not break downstream
code; however, two changes with respect to packaging and distribution may require
users to update the code used to bring nanoarrow in as a dependency.

In nanoarrow 0.5.0 and earlier, the bundled single-file amalgamation was included in
the `dist/` subdirectory or could be generated using a specially-crafted CMake
command. The nanoarrow 0.6.0 release removes the pre-compiled includes and migrates
the code used to generate it to Python. This setup is less confusing for contributors
(whose editors would frequently jump into the wrong `nanoarrow.h`) and is a less confusing
use of CMake. Users can generate the `dist/` subdirectory as it previously existed
with:

``` shell
python ci/scripts/bundle.py \
--source-output-dir=dist \
--include-output-dir=dist \
--header-namespace= \
--with-device \
--with-ipc \
--with-testing \
--with-flatcc
```

Second, the Arrow IPC and ArrowDeviceArray implementations previously lived in the `extensions/`
subdirectory of the repository. This was helpful during the initial development of these
features; however, the nanoarrow 0.6.0 release added the requisite feature coverage and testing
such that the appropriate home for them is now the main `src/` directory. As such, one
can now build nanoarrow with IPC and/or device support using:

``` shell
cmake -S . -B build -DNANOARROW_IPC=ON -DNANOARROW_DEVICE=ON
```

## Features

### Float16, StringView, and REE Support

The nanoarrow 0.6.0 release adds support for Arrow's float16 (half float), string view,
and run-end encoding support. The C library supports building float16 arrays using
`ArrowArrayAppendDouble()` and supports building string view and binary view arrays
using `ArrowArrayAppendString()` and/or `ArrowArrayAppendBytes()` and supports consuming
using `ArrowArrayViewGetStringUnsafe()` and/or `ArrowArrayViewGetBytesUnsafe()`. R and
Python users can request a string view or float16 type when building an array, and
conversion back to R/Python strings is suppored.

``` python
# pip install nanoarrow
# conda install nanoarrow -c conda-forge
import nanoarrow as na

na.Array(["abc", "def", None], na.string_view())
#> nanoarrow.Array<string_view>[3]
#> 'abc'
#> 'def'
#> None
na.Array([1, 2, 3], na.float16())
#> nanoarrow.Array<half_float>[3]
#> 1.0
#> 2.0
#> 3.0
```

``` r
# install.packages("nanoarrow")
library(nanoarrow)

as_nanoarrow_array(c("abc", "def", NA), schema = na_string_view()) |>
convert_array()
#> [1] "abc" "def" NA
as_nanoarrow_array(c(1, 2, 3), schema = na_half_float()) |>
convert_array()
#> [1] 1 2 3
```

Creating/consuming run-end encoding arrays by element is not yet
supported in C, R, or Python; however, arrays can be built or consumed by assembling
the correct array/buffer structure in C.

Thank you to [cocoa-xu](https://github.com/cocoa-xu) for adding float16 and run-end encoding
support and thank you to [WillAyd](https://github.com/WillAyd) for adding string view support!

### IPC Write Support

The nanoarrow library has supported reading
[Arrow IPC streams](https://arrow.apache.org/docs/format/Columnar.html)
since 0.4.0; however, could not write streams of its own. The nanoarrow 0.6.0 release adds
support for stream writing from C using the `ArrowIpcWriter` and stream writing
from R and Python:

```python
import io
import nanoarrow as na
from nanoarrow import ipc

out = io.BytesIO()
with ipc.StreamWriter.from_writable(out) as writer:
writer.write_stream(ipc.InputStream.example())

out.seek(0)
na.ArrayStream.from_readable(out).read_all()
#> nanoarrow.Array<non-nullable struct<some_col: int32>>[3]
#> {'some_col': 1}
#> {'some_col': 2}
#> {'some_col': 3}
```

``` r
library(nanoarrow)

tf <- tempfile()
nycflights13::flights |> write_nanoarrow(tf)

read_nanoarrow(tf) |> tibble::as_tibble()
#> # A tibble: 336,776 × 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> 7 2013 1 1 555 600 -5 913 854
#> 8 2013 1 1 557 600 -3 709 723
#> 9 2013 1 1 557 600 -3 838 846
#> 10 2013 1 1 558 600 -2 753 745
#> # ℹ 336,766 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> # hour <dbl>, minute <dbl>, time_hour <dttm>
```

As a result of the IPC write support, nanoarrow now joins the Arrow IPC integration tests
to ensure compatability across implementations. With the exception of
[arrow-rs due to a bug in the Rust flatbuffers implementation](https://github.com/apache/arrow-rs/issues/5052),
nanoarrow is now tested against all participating Arrow implementations with every commit.

A huge thank you to [bkietz](https://github.com/bkietz) for implementing this support and
the tests (which included multiple bugfixes and identification of inconsistencies of
flatbuffer verification in C, Rust, and C++!).

### DLPack/CUDA Support

The nanoarrow 0.6.0 release includes improved support for the
[Arrow C Device data interface](https://arrow.apache.org/docs/format/CDeviceDataInterface.html).
In particular, the CUDA device implementation was improved to more efficiently coordinate
synchronization when copying arrays to/from the GPU and migrated to use the driver API
for wider compatibility. The nanoarrow Python bindings have limited support for creating
`ArrowDeviceArray` wrappers that implement the
[`__arrow_c_device_array__` protocol](https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html#export-protocol)
from anything that implements DLPack:

``` python
# Currently requires:
# export NANOARROW_PYTHON_CUDA=/usr/local/cuda
# pip install --force-reinstall --no-binary=":all:" nanoarrow
import nanoarrow as na
from nanoarrow import device
import cupy as cp

device.c_device_array(cp.array([1, 2, 3]))
#> <nanoarrow.device.CDeviceArray>
#> - device_type: CUDA <2>
#> - device_id: 0
#> - array: <nanoarrow.c_array.CArray int64>
#> - length: 3
#> - offset: 0
#> - null_count: 0
#> - buffers: (0, 133980798058496)
#> - dictionary: NULL
#> - children[0]:

darray = device.c_device_array(cp.array([1, 2, 3]))
cp.from_dlpack(darray.array.view().buffer(1))
#> array([1, 2, 3])
```

Thank you to [AlenkaF](https://github.com/AlenkaF), [shwina](https://github.com/shwina),
and [danepitkin](https://github.com/danepitkin) for their contributions to and
review of this feature!

### Build System Support for IPC/Device

Lastly, the CMake build system was refactored to enable `FetchContent` to
work in an even wider variety of
[develop/build/install scenarios](https://github.com/apache/arrow-nanoarrow/tree/main/examples/cmake-scenarios). In most cases, CMake-based projects should be able
to add the nanoarrow C library with device and/or IPC support as a dependency with:

``` cmake
include(FetchContent)

# If required:
# set(NANOARROW_IPC ON)
# set(NANOARROW_DEVICE ON)
fetchcontent_declare(nanoarrow
URL "https://www.apache.org/dyn/closer.lua?action=download&filename=arrow/nanoarrow-0.6.0/apache-arrow-0.6.0.tar.gz")
fetchcontent_makeavailable(nanoarrow)

add_executable(some_target ...)
target_link_libraries(some_target
PRIVATE
nanoarrow::nanoarrow
# If needed
# nanoarrow::nanoarrow_ipc
# nanoarrow::nanoarrow_device
)
```

Linking against nanoarrow installed via `cmake --install` and located
via `find_package()` is also supported.

Users of the Meson build system can install the latest nanoarrow with:

``` shell
mkdir subprojects
meson wrap install nanoarrow
```

...and declared as a dependency with:

``` shell
nanoarrow_dep = dependency('nanoarrow')
example_exec = executable('example_meson_minimal_app',
'src/app.cc',
dependencies: [nanoarrow_dep])
```

## Contributors

This release consists of contributions from 10 contributors in addition
to the invaluable advice and support of the Apache Arrow community.

```console
$ git shortlog -sn apache-arrow-nanoarrow-0.6.0.dev..apache-arrow-nanoarrow-0.6.0 | grep -v "GitHub Actions"
64 Dewey Dunnington
19 William Ayd
16 Benjamin Kietzman
5 Cocoa
2 Abhishek Singh
1 Ashwin Srinath
1 Dane Pitkin
1 Jacob Wujciak-Jens
1 Matt Topol
1 Tao Zuhong
```