Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Commit

Permalink
More improvements.
Browse files Browse the repository at this point in the history
  • Loading branch information
jorgecarleitao committed Oct 2, 2021
1 parent 2be5d34 commit a5bbffe
Show file tree
Hide file tree
Showing 11 changed files with 110 additions and 61 deletions.
19 changes: 0 additions & 19 deletions guide/src/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,22 +10,3 @@ Arrow2 is divided into three main parts:
* a [low-level API](./low_level.md) to efficiently operate with contiguous memory regions;
* a [high-level API](./high_level.md) to operate with arrow arrays;
* a [metadata API](./metadata.md) to declare and operate with logical types and metadata.

## Cargo features

This crate has a significant number of cargo features to reduce compilation
time and number of dependencies. The feature `"full"` activates most
functionality, such as:

* `io_ipc`: to interact with the IPC format
* `io_ipc_compression`: to read and write compressed IPC (v2)
* `io_csv` to read and write CSV
* `io_json` to read and write JSON
* `io_parquet` to read and write parquet
* `io_parquet_compression` to read and write compressed parquet
* `io_print` to write batches to formatted ASCII tables
* `compute` to operate on arrays (addition, sum, sort, etc.)

The feature `simd` (not part of `full`) produces more explicit SIMD instructions
via [`packed_simd`](https://github.com/rust-lang/packed_simd), but requires the
nightly channel.
2 changes: 1 addition & 1 deletion guide/src/arrow.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Introduction

Welcome to the Arrow2 guide for the Rust programming language. This guide was
Welcome to the Arrow2 guide for the Rust programming language. This guide was
created to help you become familiar with the Arrow2 crate and its
functionalities.

Expand Down
26 changes: 19 additions & 7 deletions guide/src/compute.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,18 @@
# Compute API

When compiled with the feature `compute`, this crate offers a wide range of functions to perform both vertical (e.g. add two arrays) and horizontal (compute the sum of an array) operations.
When compiled with the feature `compute`, this crate offers a wide range of functions
to perform both vertical (e.g. add two arrays) and horizontal
(compute the sum of an array) operations.

```rust
{{#include ../../examples/arithmetics.rs}}
```
The overall design of the `compute` module is that it offers two APIs:

* statically typed, such as `sum_primitive<T>(&PrimitiveArray<T>) -> Option<T>`
* dynamically typed, such as `sum(&dyn Array) -> Box<dyn Scalar>`

the dynamically typed API usually has a function `can_*(&DataType) -> bool` denoting whether
the operation is defined for the particular logical type.

An overview of the implemented functionality.
Overview of the implemented functionality:

* arithmetics, checked, saturating, etc.
* `sum`, `min` and `max`
Expand All @@ -17,7 +23,13 @@ An overview of the implemented functionality.
* `sort`, `hash`, `merge-sort`
* `if-then-else`
* `nullif`
* `lenght` (of string)
* `hour`, `year` (of temporal logical types)
* `length` (of string)
* `hour`, `year`, `month`, `iso_week` (of temporal logical types)
* `regex`
* (list) `contains`

and an example of how to use them:

```rust
{{#include ../../examples/arithmetics.rs}}
```
7 changes: 6 additions & 1 deletion guide/src/extension.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,12 @@
# Extension types

This crate supports Arrows' ["extension type"](https://arrow.apache.org/docs/format/Columnar.html#extension-types), to declare, use, and share custom logical types.
The follow example shows how to declare one:

An extension type is just a `DataType` with a name and some metadata.
In particular, its physical representation is equal to its inner `DataType`, which implies
that all functionality in this crate works as if it was the inner `DataType`.

The following example shows how to declare one:

```rust
{{#include ../../examples/extension.rs}}
Expand Down
4 changes: 2 additions & 2 deletions guide/src/ffi.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@ has a specification, which allows languages to share data
structures via foreign interfaces at zero cost (i.e. via pointers).
This is known as the [C Data interface](https://arrow.apache.org/docs/format/CDataInterface.html).

This crate supports importing from and exporting to all `DataType`s. Follow the
example below to learn how to import and export:
This crate supports importing from and exporting to all its physical types. The
example below demonstrates how to use the API:

```rust
{{#include ../../examples/ffi.rs}}
Expand Down
57 changes: 39 additions & 18 deletions guide/src/high_level.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
# High-level API

The simplest way to think about an arrow `Array` is that it represents
`Vec<Option<T>>` and has a logical type (see [metadata](../metadata.md))) associated with it.
Arrow core trait the `Array`, which you can think of as representing `Arc<Vec<Option<T>>`
with associated metadata (see [metadata](../metadata.md))).
Contrarily to `Arc<Vec<Option<T>>`, arrays in this crate are represented in such a way
that they can be zero-copied to any other Arrow implementation via foreign interfaces (FFI).

Probably the simplest array in this crate is `PrimitiveArray<T>`. It can be constructed
from a slice as follows:
Probably the simplest `Array` in this crate is the `PrimitiveArray<T>`. It can be
constructed as from a slice of option values,

```rust
# use arrow2::array::{Array, PrimitiveArray};
Expand Down Expand Up @@ -72,7 +74,7 @@ let array = PrimitiveArray::<i32>::from_slice([1, 0, 123]);
assert_eq!(array.data_type(), &DataType::Int32);
# }
```
they can be cheaply converted to via `.to(DataType)`.
they can be cheaply (`O(1)`) converted to via `.to(DataType)`.

The following arrays are supported:

Expand All @@ -88,11 +90,10 @@ The following arrays are supported:
* `UnionArray` (every row has a different logical type)
* `DictionaryArray<K>` (nested array with encoded values)

## Dynamic Array
## Array as a trait object

There is a more powerful aspect of arrow arrays, and that is that they all
implement the trait `Array` and can be cast to `&dyn Array`, i.e. they can be turned into
a trait object. This enables arrays to have types that are dynamic in nature.
`Array` is object safe, and all implementations of `Array` and can be casted
to `&dyn Array`, which enables run-time nesting.

```rust
# use arrow2::array::{Array, PrimitiveArray};
Expand All @@ -106,7 +107,7 @@ let a: &dyn Array = &a;

Given a trait object `array: &dyn Array`, we know its physical type via
`PhysicalType: array.data_type().to_physical_type()`, which we use to downcast the array
to its concrete type:
to its concrete physical type:

```rust
# use arrow2::array::{Array, PrimitiveArray};
Expand Down Expand Up @@ -135,6 +136,7 @@ an each implementation of `Array` (a struct):
| `FixedSizeList` | `FixedSizeListArray` |
| `Struct` | `StructArray` |
| `Union` | `UnionArray` |
| `Map` | `MapArray` |
| `Dictionary(_)` | `DictionaryArray<_>` |

where `_` represents each of the variants (e.g. `PrimitiveType::Int32 <-> i32`).
Expand Down Expand Up @@ -174,13 +176,18 @@ and how to make them even more efficient.
This crate's APIs are generally split into two patterns: whether an operation leverages
contiguous memory regions or whether it does not.

If yes, then use:
What this means is that certain operations can be performed irrespectively of whether a value
is "null" or not (e.g. `PrimitiveArray<i32> + i32` can be applied to _all_ values via SIMD and
only copy the validity bitmap independently).

When an operation benefits from such arrangement, it is advantageous to use

* `Buffer::from_iter`
* `Buffer::from_trusted_len_iter`
* `Buffer::try_from_trusted_len_iter`

If not, then use the builder API, such as `MutablePrimitiveArray<T>`, `MutableUtf8Array<O>` or `MutableListArray`.
If not, then use the `MutableArray` API, such as
`MutablePrimitiveArray<T>`, `MutableUtf8Array<O>` or `MutableListArray`.

We have seen examples where the latter API was used. In the last example of this page
you will be introduced to an example of using the former for SIMD.
Expand Down Expand Up @@ -209,14 +216,23 @@ Like `FromIterator`, this crate contains two sets of APIs to iterate over data.
an array `array: &PrimitiveArray<T>`, the following applies:

1. If you need to iterate over `Option<&T>`, use `array.iter()`
2. If you can operate over the values and validity independently, use `array.values() -> &Buffer<T>` and `array.validity() -> Option<&Bitmap>`
2. If you can operate over the values and validity independently,
use `array.values() -> &Buffer<T>` and `array.validity() -> Option<&Bitmap>`

Note that case 1 is useful when e.g. you want to perform an operation that depends on both validity and values, while the latter is suitable for SIMD and copies, as they return contiguous memory regions (buffers and bitmaps). We will see below how to leverage these APIs.
Note that case 1 is useful when e.g. you want to perform an operation that depends on both
validity and values, while the latter is suitable for SIMD and copies, as they return
contiguous memory regions (buffers and bitmaps). We will see below how to leverage these APIs.

This idea holds more generally in this crate's arrays: `values()` returns something that has a contiguous in-memory representation, while `iter()` returns items taking validity into account. To get an iterator over contiguous values, use `array.values().iter()`.
This idea holds more generally in this crate's arrays: `values()` returns something that has
a contiguous in-memory representation, while `iter()` returns items taking validity into account.
To get an iterator over contiguous values, use `array.values().iter()`.

There is one last API that is worth mentioning, and that is `Bitmap::chunks`. When performing
bitwise operations, it is often more performant to operate on chunks of bits instead of single bits. `chunks` offers a `TrustedLen` of `u64` with the bits + an extra `u64` remainder. We expose two functions, `unary(Bitmap, Fn) -> Bitmap` and `binary(Bitmap, Bitmap, Fn) -> Bitmap` that use this API to efficiently perform bitmap operations.
bitwise operations, it is often more performant to operate on chunks of bits
instead of single bits. `chunks` offers a `TrustedLen` of `u64` with the bits
+ an extra `u64` remainder. We expose two functions, `unary(Bitmap, Fn) -> Bitmap`
and `binary(Bitmap, Bitmap, Fn) -> Bitmap` that use this API to efficiently
perform bitmap operations.

## Vectorized operations

Expand All @@ -238,18 +254,23 @@ where
O: NativeType,
F: Fn(I) -> O,
{
// create the iterator over _all_ values
let values = array.values().iter().map(|v| op(*v));
let values = Buffer::from_trusted_len_iter(values);

// create the new array, cloning its validity
PrimitiveArray::<O>::from_data(data_type.clone(), values, array.validity().cloned())
}
```

Some notes:

1. We used `array.values()`, as described above: this operation leverages a contiguous memory region.
1. We used `array.values()`, as described above: this operation leverages a
contiguous memory region.

2. We leveraged normal rust iterators for the operation.

3. We used `op` on the array's values irrespectively of their validity,
and cloned its validity. This approach is suitable for operations whose branching off is more expensive than operating over all values. If the operation is expensive, then using `PrimitiveArray::<O>::from_trusted_len_iter` is likely faster.
and cloned its validity. This approach is suitable for operations whose branching off
is more expensive than operating over all values. If the operation is expensive,
then using `PrimitiveArray::<O>::from_trusted_len_iter` is likely faster.
19 changes: 14 additions & 5 deletions guide/src/low_level.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,24 @@
# Low-level API

The starting point of this crate is the idea that data must be stored in memory in a specific arrangement to be interoperable with Arrow's ecosystem. With this in mind, this crate does not use `Vec` but instead has its own containers to store data, including sharing and consuming it via FFI.
The starting point of this crate is the idea that data is stored in memory in a specific arrangement to be interoperable with Arrow's ecosystem.

The most important design decision of this crate is that contiguous regions are shared via an `Arc`. In this context, the operation of slicing a memory region is `O(1)` because it corresponds to changing an offset and length. The tradeoff is that once under an `Arc`, memory regions are immutable.
The most important design aspect of this crate is that contiguous regions are shared via an
`Arc`. In this context, the operation of slicing a memory region is `O(1)` because it
corresponds to changing an offset and length. The tradeoff is that once under
an `Arc`, memory regions are immutable.

The second important aspect is that Arrow has two main types of data buffers: bitmaps, whose offsets are measured in bits, and byte types (such as `i32`), whose offsets are measured in bytes. With this in mind, this crate has 2 main types of containers of contiguous memory regions:
The second most important aspect is that Arrow has two main types of data buffers: bitmaps,
whose offsets are measured in bits, and byte types (such as `i32`), whose offsets are
measured in bytes. With this in mind, this crate has 2 main types of containers of
contiguous memory regions:

* `Buffer<T>`: handle contiguous memory regions of type T whose offsets are measured in items
* `Bitmap`: handle contiguous memory regions of bits whose offsets are measured in bits

These hold _all_ data-related memory in this crate.

Due to their intrinsic immutability, each container has a corresponding mutable (and non-shareable) variant:
Due to their intrinsic immutability, each container has a corresponding mutable
(and non-shareable) variant:

* `MutableBuffer<T>`
* `MutableBitmap`
Expand Down Expand Up @@ -44,7 +51,8 @@ assert_eq!(x.as_slice(), &[0, 5, 2, 10])
```

The following demonstrates how to efficiently
perform an operation from an iterator of [TrustedLen](https://doc.rust-lang.org/std/iter/trait.TrustedLen.html):
perform an operation from an iterator of
[TrustedLen](https://doc.rust-lang.org/std/iter/trait.TrustedLen.html):

```rust
# use arrow2::buffer::MutableBuffer;
Expand All @@ -65,6 +73,7 @@ the following physical types:
* `u8-u64`
* `f32` and `f64`
* `arrow2::types::days_ms`
* `arrow2::types::months_days_ns`

This is because the arrow specification only supports the above Rust types; all other complex
types supported by arrow are built on top of these types, which enables Arrow to be a highly
Expand Down
9 changes: 4 additions & 5 deletions guide/src/metadata.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ In Arrow2, logical types are declared as variants of the `enum` `arrow2::datatyp
For example, `DataType::Int32` represents a signed integer of 32 bits.

Each `DataType` has an associated `enum PhysicalType` (many-to-one) representing the
particular in-memory representation, and is associated to specific semantics.
particular in-memory representation, and is associated to a specific semantics.
For example, both `DataType::Date32` and `DataType::Int32` have the same `PhysicalType`
(`PhysicalType::Primitive(PrimitiveType::Int32)`) but `Date32` represents the number of
days since UNIX epoch.
Expand All @@ -23,10 +23,9 @@ Logical types are metadata: they annotate physical types with extra information
## `Field` (column metadata)

Besides logical types, the arrow format supports other relevant metadata to the format.
All this information is stored in `arrow2::datatypes::Field`.

A `Field` is arrow's metadata associated to a column in the context of a columnar format.
It has a name, a logical type `DataType`, whether the column is nullable, etc.
An important one is `Field` broadly corresponding to a column in traditional columnar formats.
A `Field` is composed by a name (`String`), a logical type (`DataType`), whether it is
nullable (`bool`), and optional metadata.

## `Schema` (table metadata)

Expand Down
2 changes: 1 addition & 1 deletion src/compute/merge_sort/mod.rs
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
//! This module exposes functions to perform merge-sorts.
//! Functions to perform merge-sorts.
//!
//! The goal of merge-sort is to merge two sorted arrays, `[a0, a1]`, `merge_sort(a0, a1)`,
//! so that the resulting array is sorted, i.e. the following invariant upholds:
Expand Down
2 changes: 1 addition & 1 deletion src/compute/partition.rs
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
// specific language governing permissions and limitations
// under the License.

//! Defines partition kernel for `ArrayRef`
//! Defines partition kernel for [`crate::array::Array`]

use crate::compute::sort::{build_compare, Compare, SortColumn};
use crate::error::{ArrowError, Result};
Expand Down
24 changes: 23 additions & 1 deletion src/doc/lib.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Welcome to arrow2's documentation. Thanks for checking it out!

This is a library for efficient in-memory data operations using
This is a library for efficient in-memory data operations with
[Arrow in-memory format](https://arrow.apache.org/docs/format/Columnar.html).
It is a re-write from the bottom up of the official `arrow` crate with soundness
and type safety in mind.
Expand Down Expand Up @@ -66,3 +66,25 @@ fn main() -> Result<()> {
Ok(())
}
```

## Cargo features

This crate has a significant number of cargo features to reduce compilation
time and number of dependencies. The feature `"full"` activates most
functionality, such as:

* `io_ipc`: to interact with the Arrow IPC format
* `io_ipc_compression`: to read and write compressed Arrow IPC (v2)
* `io_csv` to read and write CSV
* `io_json` to read and write JSON
* `io_parquet` to read and write parquet
* `io_parquet_compression` to read and write compressed parquet
* `io_print` to write batches to formatted ASCII tables
* `compute` to operate on arrays (addition, sum, sort, etc.)

The feature `simd` (not part of `full`) produces more explicit SIMD instructions
via [`packed_simd`](https://github.com/rust-lang/packed_simd), but requires the
nightly channel.

The feature `cache_aligned` uses a custom allocator instead of `Vec`, which may be
more performant but is not interoperable with `Vec`.

0 comments on commit a5bbffe

Please sign in to comment.