Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Commit

Permalink
Improved README (#735)
Browse files Browse the repository at this point in the history
  • Loading branch information
jorgecarleitao authored Jan 8, 2022
1 parent 9fb57e5 commit 772ef0f
Showing 1 changed file with 64 additions and 148 deletions.
212 changes: 64 additions & 148 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,78 +6,80 @@
[![](https://img.shields.io/crates/dv/arrow2.svg)](https://crates.io/crates/arrow2)
[![](https://docs.rs/arrow2/badge.svg)](https://docs.rs/arrow2/)

A Rust crate to work with the [Arrow format](https://arrow.apache.org/).
The most feature-complete implementation of the Arrow format after the official C++
A Rust crate to work with [Apache Arrow](https://arrow.apache.org/).
The most feature-complete implementation of the Arrow format after the C++
implementation.

Check out [the guide](https://jorgecarleitao.github.io/arrow2/) for a general introduction
on how to use this crate, and
[API docs](https://jorgecarleitao.github.io/arrow2/docs/arrow2/index.html) for a detailed
documentation of each of its APIs.

## Tests

The test suite includes roundtrip tests against:
* Arrow IPC (both little and big endian)
* parquet format (in its different configurations)

Furthermore, the CI runs all integration tests against [apache/arrow@master](https://github.com/apache/arrow), demonstrating full interoperability with C++, Java, Go, C# and JS official implementations.

Finally, the CI has integration tests against parquet files generated by `pyarrow` under different
configurations, as well as integration tests against `pyspark`, demonstrating compatibility with
its `parquet` reader.
## Features

* Most feature-complete implementation of Apache Arrow after the reference implementation (C++)
* Float 16 unsupported (not a Rust native type)
* Decimal 256 unsupported (not a Rust native type)
* FFI supported for all Arrow types
* Full interoperability with Rust's `Vec`
* MutableArray API to work with bitmaps and arrays in-place
* Full support for timestamps with timezones, including arithmetics that take
timezones into account
* Support to read from, and write to:
* CSV
* Apache Arrow IPC (all types)
* Apache Arrow Flight (all types)
* Apache Parquet (except deep nested types)
* Apache Avro (not all types yet)
* NJSON
* Extensive suite of compute operations
* aggregations
* arithmetics
* cast
* comparison
* sort and merge-sort
* boolean (AND, OR, etc) and boolean kleene
* filter, take
* hash
* if-then-else
* nullif
* temporal (day, month, week day, hour, etc.)
* window
* ... and more ...
* Extensive set of cargo feature flags to reduce compilation time and binary size
* Fully-decoupled IO between CPU-bounded and IO-bounded tasks, allowing
this crate to both be used in `async` contexts without blocking and leverage parallelism
* Fastest known implementation of Avro and Parquet (e.g. faster than the official
C++ implementations)

## Safety and Security

This crate uses `unsafe` when strickly necessary:
* when the compiler can't prove certain invariants and
* FFI
We have extensive tests over these, all of which run and pass under MIRI.
Most uses of `unsafe` fall into 3 categories:
* The Arrow format has invariants over utf8 that can't be written in safe Rust
* `TrustedLen` and trait specialization are still nightly features
* FFI

We actively monitor for vulnerabilities in Rust's advisory and either patch or mitigate
them (see e.g. `.cargo/audit.yaml` and `.github/workflows/security.yaml`).

Reading parquet and IPC currently `panic!` when they receive invalid. We are
actively addressing this.

## Integration tests

Our tests include roundtrip against:
* Apache Arrow IPC (both little and big endian) generated by C++, Java, Go, C# and JS
implementations.
* Apache Parquet format (in its different configurations) generated by Arrow's C++ and
Spark's implementation
* Apache Avro generated by the official Rust Avro implementation

Check [DEVELOPMENT.md](DEVELOPMENT.md) for our development practices.

## Features in `arrow2` not in the `arrow` crate

### Safety and Security

* safe by design (i.e. no transmutes, runtime type checking nor pointer casts)
* Uses Rust's compiler whenever possible to prove that memory reads are sound
* All tests pass MIRI checks (MIRI can't open files atm, so we can't check IO atm).

### Arrow Format

* IPC passes all integration tests and supports both big and little endian
* `MutableArray` API to work with arrays in-place.
* faster IPC reader (different design that avoids an extra copy of all data)
* supports read and write IPC 2.0 (also known as feather, or arrow with compression)
* supports extension types
* Passes all FFI integration tests against pyarrow / C++
* Passes all IPC integration tests against other implementations except two tests

### Parquet

* 5-10x faster at reading parquet (single core) and deserialization is parallelizable
* 3-10x faster at writing parquet (single core) and serialization is parallelizable
* parquet IO has no `unsafe`
* support for `async` read and write via `futures`.

### Others

* Compatibility with `std::Vec`
* Support to read `avro` format
* Support for timestamps with timezones.
* More predictable JSON reader
* Generalized parsing of CSV based on logical data types

## Features in `arrow` not yet available in this crate

* Parquet read and write of struct and nested lists.

## Features in this crate not in pyarrow

* Read and write of delta-encoded utf8 to and from parquet
* parquet roundtrip of all supported arrow types.
* Read from `avro`
* Async writer of the Arrow stream format
* Async read and write of parquet

## Features in pyarrow not in this crate

* Parquet read and write of struct and nested lists.

## Versioning

We use the SemVer 2.0 used by Cargo and the remaining of the Rust ecosystem,
Expand All @@ -97,92 +99,6 @@ Design documents about each of the parts of this repo are available on their res

## FAQ

### Why re-writing `arrow`

This crate started as a re-write of the official `arrow` crate.

The `arrow` crate uses `Buffer`, a generic struct to store contiguous memory regions (of bytes).
This construct is used to store data from all arrays in the rust implementation.
The simplest example is a buffer containing `1i32`, that is represented as
`&[0u8, 0u8, 0u8, 1u8]` or `&[1u8, 0u8, 0u8, 0u8]` depending on endianness.

When a user wishes to read from a buffer, e.g. to perform a mathematical operation with
its values, it needs to interpret the buffer in the target type. Because `Buffer` is
a contiguous region of bytes with no type information, users must transmute its data
into the respective type.

Arrow currently transmutes buffers on almost all operations, and very often does not
verify that there is type alignment nor correct length when we transmute it to a slice
of type `&[T]`.

Just as an example, in v5.0.0, the following code compiles, does not panic, is
unsound and results in UB:

```rust
let buffer = Buffer::from_slic_ref(&[0i32, 2i32])
let data = ArrayData::new(DataType::Int64, 10, 0, None, 0, vec![buffer], vec![]);
let array = Float64Array::from(Arc::new(data));

println!("{:?}", array.value(1));
```

Note how this initializes a buffer with bytes from `i32`, initializes an `ArrayData` with dynamic type `Int64`, and then an array `Float64Array` from `Arc<ArrayData>`. `Float64Array`'s internals will essentially consume the pointer from the buffer, re-interpret it as `f64`, and offset it by `1`.

Still within this example, if we were to use `ArrayData`'s datatype, `Int64`, to transmute the buffer, we would be creating `&[i64]` out of a buffer created out of `i32`.

Any Rust developer acknowledges that this behavior goes very much against Rust's core premise that a functions' behvavior must not be undefined depending on whether the arguments are correct. The obvious observation is that transmute is one of the most `unsafe` Rust operations and not allowing the compiler to verify the necessary invariants is a large burden for users and developers to take.

This simple example indicates a broader problem with the current design, that we now explore in detail.

#### Root cause analysis

At its core, Arrow's current design is centered around two main `structs`:

1. untyped `Buffer`
2. untyped `ArrayData`

##### 1. untyped `Buffer`

The crate's buffers are untyped, which implies that once created, the type
information is lost. Consequently, the compiler has no way
of verifying that a certain read can be performed. As such, any read from it requires an alignment and size check at runtime. This is not only detrimental to performance, but also very cumbersome.

For the past 4 months, I have identified and fixed more than 10 instances of unsound code derived from the misuse, within the crate itself, of `Buffer`. This hints that downstream dependencies using this crate and use this API are likely do be even more affected by this.

##### 2. untyped `ArrayData`

`ArrayData` is a `struct` containing buffers and child data that does not differentiate which type of array it represents at compile time.

Consequently, all buffer reads from `ArrayData`'s buffers are effectively `unsafe`, as they require certain invariants to hold. These invariants are strictly related to `ArrayData::datatype`: this `enum` differentiates how to transmute the `ArrayData::buffers`. For example, an `ArrayData::datatype` equal to `DataType::UInt32` implies that the buffer should be transmuted to `u32`.

The challenge with the above struct is that it is not possible to prove that `ArrayData`'s creation
is sound at compile time. As the sample above showed, there was nothing wrong, during compilation, with passing a buffer with `i32` to an `ArrayData` expecting `i64`. We could of course check it at runtime, and we should, but we are defeating the whole purpose of using a typed system as powerful as Rust offers.

The main consequence of this observation is that the current code has a significant maintenance cost, as we have to be rigorously check the types of the buffers we are working with. The example above shows
that, even with that rigour, we fail to identify obvious problems at runtime.

Overall, there are many instances of our code where we expose public APIs marked as `safe` that are `unsafe` and lead to undefined behavior if used incorrectly. This goes against the core goals of the Rust language, and significantly weakens Arrow Rust's implementation core premise that the compiler and borrow checker proves many of the memory safety concerns that we may have.

Equally important, the inability of the compiler to prove certain invariants is detrimental to performance. As an example, the implementation of the `take` kernel in this repo is semantically equivalent to the current master, but 1.3-2x faster.

### How?

Contrarily to the original implementation, this implementation does not transmutate
byte buffers based on runtime types, and instead requires all buffers to be typed
(in Rust's sense of a generic).

This removes many existing bugs and enables the compiler to prove all type invariants
with the only exception of FFI and IPC boundaries.

This crate also has a different design towards arrays' `offsets` that removes many
out of bound reads consequent of using byte and slot offsets interchangibly.

This crate's design of primitive types is also more explicit about its logical and
physical representation, enabling support for `Timestamps` with timezones and a
safe implementation of the `Interval` type.

Consequently, this crate is easier to use, develop, maintain, and debug.

### Any plans to merge with the Apache Arrow project?

Maybe. The primary reason to have this repo and crate is to be able to propotype
Expand Down

0 comments on commit 772ef0f

Please sign in to comment.