More improvements.

jorgecarleitao · Oct 2, 2021 · a5bbffe · a5bbffe
1 parent 2be5d34
commit a5bbffe
Show file tree

Hide file tree

Showing 11 changed files with 110 additions and 61 deletions.
diff --git a/guide/src/README.md b/guide/src/README.md
@@ -10,22 +10,3 @@ Arrow2 is divided into three main parts:
 * a [low-level API](./low_level.md) to efficiently operate with contiguous memory regions;
 * a [high-level API](./high_level.md) to operate with arrow arrays;
 * a [metadata API](./metadata.md) to declare and operate with logical types and metadata.
-
-## Cargo features
-
-This crate has a significant number of cargo features to reduce compilation
-time and number of dependencies. The feature `"full"` activates most
-functionality, such as:
-
-* `io_ipc`: to interact with the IPC format
-* `io_ipc_compression`: to read and write compressed IPC (v2)
-* `io_csv` to read and write CSV
-* `io_json` to read and write JSON
-* `io_parquet` to read and write parquet
-* `io_parquet_compression` to read and write compressed parquet
-* `io_print` to write batches to formatted ASCII tables
-* `compute` to operate on arrays (addition, sum, sort, etc.)
-
-The feature `simd` (not part of `full`) produces more explicit SIMD instructions
-via [`packed_simd`](https://github.com/rust-lang/packed_simd), but requires the 
-nightly channel.
diff --git a/guide/src/arrow.md b/guide/src/arrow.md
@@ -1,6 +1,6 @@
 # Introduction
 
-Welcome to the Arrow2 guide for the Rust programming language.  This guide was
+Welcome to the Arrow2 guide for the Rust programming language. This guide was
 created to help you become familiar with the Arrow2 crate and its
 functionalities.
 

diff --git a/guide/src/compute.md b/guide/src/compute.md
@@ -1,12 +1,18 @@
 # Compute API
 
-When compiled with the feature `compute`, this crate offers a wide range of functions to perform both vertical (e.g. add two arrays) and horizontal (compute the sum of an array) operations.
+When compiled with the feature `compute`, this crate offers a wide range of functions
+to perform both vertical (e.g. add two arrays) and horizontal
+(compute the sum of an array) operations.
 
-```rust
-{{#include ../../examples/arithmetics.rs}}
-```
+The overall design of the `compute` module is that it offers two APIs:
+
+* statically typed, such as `sum_primitive<T>(&PrimitiveArray<T>) -> Option<T>`
+* dynamically typed, such as `sum(&dyn Array) -> Box<dyn Scalar>`
+
+the dynamically typed API usually has a function `can_*(&DataType) -> bool` denoting whether
+the operation is defined for the particular logical type.
 
-An overview of the implemented functionality.
+Overview of the implemented functionality:
 
 * arithmetics, checked, saturating, etc.
 * `sum`, `min` and `max`
@@ -17,7 +23,13 @@ An overview of the implemented functionality.
 * `sort`, `hash`, `merge-sort`
 * `if-then-else`
 * `nullif`
-* `lenght` (of string)
-* `hour`, `year` (of temporal logical types)
+* `length` (of string)
+* `hour`, `year`, `month`, `iso_week` (of temporal logical types)
 * `regex`
 * (list) `contains`
+
+and an example of how to use them:
+
+```rust
+{{#include ../../examples/arithmetics.rs}}
+```
diff --git a/guide/src/extension.md b/guide/src/extension.md
@@ -1,7 +1,12 @@
 # Extension types
 
 This crate supports Arrows' ["extension type"](https://arrow.apache.org/docs/format/Columnar.html#extension-types), to declare, use, and share custom logical types.
-The follow example shows how to declare one:
+
+An extension type is just a `DataType` with a name and some metadata.
+In particular, its physical representation is equal to its inner `DataType`, which implies
+that all functionality in this crate works as if it was the inner `DataType`.
+
+The following example shows how to declare one:
 
 ```rust
 {{#include ../../examples/extension.rs}}

diff --git a/guide/src/ffi.md b/guide/src/ffi.md
@@ -5,8 +5,8 @@ has a specification, which allows languages to share data
 structures via foreign interfaces at zero cost (i.e. via pointers).
 This is known as the [C Data interface](https://arrow.apache.org/docs/format/CDataInterface.html).
 
-This crate supports importing from and exporting to all `DataType`s. Follow the
-example below to learn how to import and export:
+This crate supports importing from and exporting to all its physical types. The
+example below demonstrates how to use the API:
 
 ```rust
 {{#include ../../examples/ffi.rs}}

diff --git a/guide/src/high_level.md b/guide/src/high_level.md
@@ -1,10 +1,12 @@
 # High-level API
 
-The simplest way to think about an arrow `Array` is that it represents
-`Vec<Option<T>>` and has a logical type (see [metadata](../metadata.md))) associated with it.
+Arrow core trait the `Array`, which you can think of as representing `Arc<Vec<Option<T>>`
+with associated metadata (see [metadata](../metadata.md))).
+Contrarily to `Arc<Vec<Option<T>>`, arrays in this crate are represented in such a way
+that they can be zero-copied to any other Arrow implementation via foreign interfaces (FFI).
 
-Probably the simplest array in this crate is `PrimitiveArray<T>`. It can be constructed
-from a slice as follows:
+Probably the simplest `Array` in this crate is the `PrimitiveArray<T>`. It can be
+constructed as from a slice of option values,
 
 ```rust
 # use arrow2::array::{Array, PrimitiveArray};
@@ -72,7 +74,7 @@ let array = PrimitiveArray::<i32>::from_slice([1, 0, 123]);
 assert_eq!(array.data_type(), &DataType::Int32);
 # }
 ```
-they can be cheaply converted to via `.to(DataType)`.
+they can be cheaply (`O(1)`) converted to via `.to(DataType)`.
 
 The following arrays are supported:
 
@@ -88,11 +90,10 @@ The following arrays are supported:
 * `UnionArray` (every row has a different logical type)
 * `DictionaryArray<K>` (nested array with encoded values)
 
-## Dynamic Array
+## Array as a trait object
 
-There is a more powerful aspect of arrow arrays, and that is that they all
-implement the trait `Array` and can be cast to `&dyn Array`, i.e. they can be turned into
-a trait object. This enables arrays to have types that are dynamic in nature.
+`Array` is object safe, and all implementations of `Array` and can be casted
+to `&dyn Array`, which enables run-time nesting.
 
 ```rust
 # use arrow2::array::{Array, PrimitiveArray};
@@ -106,7 +107,7 @@ let a: &dyn Array = &a;
 
 Given a trait object `array: &dyn Array`, we know its physical type via
 `PhysicalType: array.data_type().to_physical_type()`, which we use to downcast the array
-to its concrete type:
+to its concrete physical type:
 
 ```rust
 # use arrow2::array::{Array, PrimitiveArray};
@@ -135,6 +136,7 @@ an each implementation of `Array` (a struct):
 | `FixedSizeList`   | `FixedSizeListArray`   |
 | `Struct`          | `StructArray`          |
 | `Union`           | `UnionArray`           |
+| `Map`             | `MapArray`             |
 | `Dictionary(_)`   | `DictionaryArray<_>`   |
 
 where `_` represents each of the variants (e.g. `PrimitiveType::Int32 <-> i32`).
@@ -174,13 +176,18 @@ and how to make them even more efficient.
 This crate's APIs are generally split into two patterns: whether an operation leverages
 contiguous memory regions or whether it does not.
 
-If yes, then use:
+What this means is that certain operations can be performed irrespectively of whether a value
+is "null" or not (e.g. `PrimitiveArray<i32> + i32` can be applied to _all_ values via SIMD and 
+only copy the validity bitmap independently).
+
+When an operation benefits from such arrangement, it is advantageous to use
 
 * `Buffer::from_iter`
 * `Buffer::from_trusted_len_iter`
 * `Buffer::try_from_trusted_len_iter`
 
-If not, then use the builder API, such as `MutablePrimitiveArray<T>`, `MutableUtf8Array<O>` or `MutableListArray`.
+If not, then use the `MutableArray` API, such as
+`MutablePrimitiveArray<T>`, `MutableUtf8Array<O>` or `MutableListArray`.
 
 We have seen examples where the latter API was used. In the last example of this page
 you will be introduced to an example of using the former for SIMD.
@@ -209,14 +216,23 @@ Like `FromIterator`, this crate contains two sets of APIs to iterate over data.
 an array `array: &PrimitiveArray<T>`, the following applies:
 
 1. If you need to iterate over `Option<&T>`, use `array.iter()`
-2. If you can operate over the values and validity independently, use `array.values() -> &Buffer<T>` and `array.validity() -> Option<&Bitmap>`
+2. If you can operate over the values and validity independently,
+   use `array.values() -> &Buffer<T>` and `array.validity() -> Option<&Bitmap>`
 
-Note that case 1 is useful when e.g. you want to perform an operation that depends on both validity and values, while the latter is suitable for SIMD and copies, as they return contiguous memory regions (buffers and bitmaps). We will see below how to leverage these APIs.
+Note that case 1 is useful when e.g. you want to perform an operation that depends on both
+validity and values, while the latter is suitable for SIMD and copies, as they return
+contiguous memory regions (buffers and bitmaps). We will see below how to leverage these APIs.
 
-This idea holds more generally in this crate's arrays: `values()` returns something that has a contiguous in-memory representation, while `iter()` returns items taking validity into account. To get an iterator over contiguous values, use `array.values().iter()`.
+This idea holds more generally in this crate's arrays: `values()` returns something that has
+a contiguous in-memory representation, while `iter()` returns items taking validity into account. 
+To get an iterator over contiguous values, use `array.values().iter()`.
 
 There is one last API that is worth mentioning, and that is `Bitmap::chunks`. When performing
-bitwise operations, it is often more performant to operate on chunks of bits instead of single bits. `chunks` offers a `TrustedLen` of `u64` with the bits + an extra `u64` remainder. We expose two functions, `unary(Bitmap, Fn) -> Bitmap` and `binary(Bitmap, Bitmap, Fn) -> Bitmap` that use this API to efficiently perform bitmap operations.
+bitwise operations, it is often more performant to operate on chunks of bits
+instead of single bits. `chunks` offers a `TrustedLen` of `u64` with the bits
++ an extra `u64` remainder. We expose two functions, `unary(Bitmap, Fn) -> Bitmap`
+and `binary(Bitmap, Bitmap, Fn) -> Bitmap` that use this API to efficiently
+perform bitmap operations.
 
 ## Vectorized operations
 
@@ -238,18 +254,23 @@ where
     O: NativeType,
     F: Fn(I) -> O,
 {
+    // create the iterator over _all_ values
     let values = array.values().iter().map(|v| op(*v));
     let values = Buffer::from_trusted_len_iter(values);
 
+    // create the new array, cloning its validity
     PrimitiveArray::<O>::from_data(data_type.clone(), values, array.validity().cloned())
 }
 ```
 
 Some notes:
 
-1. We used `array.values()`, as described above: this operation leverages a contiguous memory region.
+1. We used `array.values()`, as described above: this operation leverages a
+   contiguous memory region.
 
 2. We leveraged normal rust iterators for the operation.
 
 3. We used `op` on the array's values irrespectively of their validity,
-and cloned its validity. This approach is suitable for operations whose branching off is more expensive than operating over all values. If the operation is expensive, then using `PrimitiveArray::<O>::from_trusted_len_iter` is likely faster.
+   and cloned its validity. This approach is suitable for operations whose branching off
+   is more expensive than operating over all values. If the operation is expensive,
+   then using `PrimitiveArray::<O>::from_trusted_len_iter` is likely faster.
diff --git a/guide/src/low_level.md b/guide/src/low_level.md
@@ -1,17 +1,24 @@
 # Low-level API
 
-The starting point of this crate is the idea that data must be stored in memory in a specific arrangement to be interoperable with Arrow's ecosystem. With this in mind, this crate does not use `Vec` but instead has its own containers to store data, including sharing and consuming it via FFI.
+The starting point of this crate is the idea that data is stored in memory in a specific arrangement to be interoperable with Arrow's ecosystem.
 
-The most important design decision of this crate is that contiguous regions are shared via an `Arc`. In this context, the operation of slicing a memory region is `O(1)` because it corresponds to changing an offset and length. The tradeoff is that once under an `Arc`, memory regions are immutable.
+The most important design aspect of this crate is that contiguous regions are shared via an
+`Arc`. In this context, the operation of slicing a memory region is `O(1)` because it
+corresponds to changing an offset and length. The tradeoff is that once under
+an `Arc`, memory regions are immutable.
 
-The second important aspect is that Arrow has two main types of data buffers: bitmaps, whose offsets are measured in bits, and byte types (such as `i32`), whose offsets are measured in bytes. With this in mind, this crate has 2 main types of containers of contiguous memory regions:
+The second most important aspect is that Arrow has two main types of data buffers: bitmaps,
+whose offsets are measured in bits, and byte types (such as `i32`), whose offsets are
+measured in bytes. With this in mind, this crate has 2 main types of containers of
+contiguous memory regions:
 
 * `Buffer<T>`: handle contiguous memory regions of type T whose offsets are measured in items
 * `Bitmap`: handle contiguous memory regions of bits whose offsets are measured in bits
 
 These hold _all_ data-related memory in this crate.
 
-Due to their intrinsic immutability, each container has a corresponding mutable (and non-shareable) variant:
+Due to their intrinsic immutability, each container has a corresponding mutable
+(and non-shareable) variant:
 
 * `MutableBuffer<T>`
 * `MutableBitmap`
@@ -44,7 +51,8 @@ assert_eq!(x.as_slice(), &[0, 5, 2, 10])
 ```
 
 The following demonstrates how to efficiently
-perform an operation from an iterator of [TrustedLen](https://doc.rust-lang.org/std/iter/trait.TrustedLen.html):
+perform an operation from an iterator of
+[TrustedLen](https://doc.rust-lang.org/std/iter/trait.TrustedLen.html):
 
 ```rust
 # use arrow2::buffer::MutableBuffer;
@@ -65,6 +73,7 @@ the following physical types:
 * `u8-u64`
 * `f32` and `f64`
 * `arrow2::types::days_ms`
+* `arrow2::types::months_days_ns`
 
 This is because the arrow specification only supports the above Rust types; all other complex
 types supported by arrow are built on top of these types, which enables Arrow to be a highly

diff --git a/guide/src/metadata.md b/guide/src/metadata.md
@@ -13,7 +13,7 @@ In Arrow2, logical types are declared as variants of the `enum` `arrow2::datatyp
 For example, `DataType::Int32` represents a signed integer of 32 bits.
 
 Each `DataType` has an associated `enum PhysicalType` (many-to-one) representing the
-particular in-memory representation, and is associated to specific semantics.
+particular in-memory representation, and is associated to a specific semantics.
 For example, both `DataType::Date32` and `DataType::Int32` have the same `PhysicalType`
 (`PhysicalType::Primitive(PrimitiveType::Int32)`) but `Date32` represents the number of
 days since UNIX epoch.
@@ -23,10 +23,9 @@ Logical types are metadata: they annotate physical types with extra information
 ## `Field` (column metadata)
 
 Besides logical types, the arrow format supports other relevant metadata to the format.
-All this information is stored in `arrow2::datatypes::Field`.
-
-A `Field` is arrow's metadata associated to a column in the context of a columnar format. 
-It has a name, a logical type `DataType`, whether the column is nullable, etc.
+An important one is `Field` broadly corresponding to a column in traditional columnar formats.
+A `Field` is composed by a name (`String`), a logical type (`DataType`), whether it is
+nullable (`bool`), and optional metadata.
 
 ## `Schema` (table metadata)
 

diff --git a/src/compute/merge_sort/mod.rs b/src/compute/merge_sort/mod.rs
@@ -1,4 +1,4 @@
-//! This module exposes functions to perform merge-sorts.
+//! Functions to perform merge-sorts.
 //!
 //! The goal of merge-sort is to merge two sorted arrays, `[a0, a1]`, `merge_sort(a0, a1)`,
 //! so that the resulting array is sorted, i.e. the following invariant upholds:

diff --git a/src/compute/partition.rs b/src/compute/partition.rs
@@ -15,7 +15,7 @@
 // specific language governing permissions and limitations
 // under the License.
 
-//! Defines partition kernel for `ArrayRef`
+//! Defines partition kernel for [`crate::array::Array`]
 
 use crate::compute::sort::{build_compare, Compare, SortColumn};
 use crate::error::{ArrowError, Result};

diff --git a/src/doc/lib.md b/src/doc/lib.md
@@ -1,6 +1,6 @@
 Welcome to arrow2's documentation. Thanks for checking it out!
 
-This is a library for efficient in-memory data operations using
+This is a library for efficient in-memory data operations with
 [Arrow in-memory format](https://arrow.apache.org/docs/format/Columnar.html).
 It is a re-write from the bottom up of the official `arrow` crate with soundness
 and type safety in mind.
@@ -66,3 +66,25 @@ fn main() -> Result<()> {
     Ok(())
 }
 ```
+
+## Cargo features
+
+This crate has a significant number of cargo features to reduce compilation
+time and number of dependencies. The feature `"full"` activates most
+functionality, such as:
+
+* `io_ipc`: to interact with the Arrow IPC format
+* `io_ipc_compression`: to read and write compressed Arrow IPC (v2)
+* `io_csv` to read and write CSV
+* `io_json` to read and write JSON
+* `io_parquet` to read and write parquet
+* `io_parquet_compression` to read and write compressed parquet
+* `io_print` to write batches to formatted ASCII tables
+* `compute` to operate on arrays (addition, sum, sort, etc.)
+
+The feature `simd` (not part of `full`) produces more explicit SIMD instructions
+via [`packed_simd`](https://github.com/rust-lang/packed_simd), but requires the 
+nightly channel.
+
+The feature `cache_aligned` uses a custom allocator instead of `Vec`, which may be
+more performant but is not interoperable with `Vec`.