An implementation of Arrow targeting .NET Standard.
See our current feature matrix for currently available features.
- Arrow specification 1.0.0. (Support for reading 0.11+.)
- C# 11
- .NET Standard 2.0 and .NET 6.0
- Asynchronous I/O
- Uses modern .NET runtime features such as Span<T>, Memory<T>, MemoryManager<T>, and System.Buffers primitives for memory allocation, memory storage, and fast serialization.
- Uses Acyclic Visitor Pattern for array types and arrays to facilitate serialization, record batch traversal, and format growth.
- Cannot read Arrow files containing tensors.
- Cannot easily modify allocation strategy without implementing a custom memory pool. All allocations are currently 64-byte aligned and padded to 8-bytes.
- Default memory allocation strategy uses an over-allocation strategy with pointer fixing, which results in significant memory overhead for small buffers. A buffer that requires a single byte for storage may be backed by an allocation of up to 64-bytes to satisfy alignment requirements.
- There are currently few builder APIs available for specific array types. Arrays must be built manually with an arrow buffer builder abstraction.
- FlatBuffer code generation is not included in the build process.
- Serialization implementation does not perform exhaustive validation checks during deserialization in every scenario.
- Throws exceptions with vague, inconsistent, or non-localized messages in many situations
- Throws exceptions that are non-specific to the Arrow implementation in some circumstances where it probably should (eg. does not throw ArrowException exceptions)
- Lack of code documentation
- Lack of usage examples
using System.Diagnostics;
using System.IO;
using System.Threading.Tasks;
using Apache.Arrow;
using Apache.Arrow.Ipc;
public static async Task<RecordBatch> ReadArrowAsync(string filename)
{
using (var stream = File.OpenRead(filename))
using (var reader = new ArrowFileReader(stream))
{
var recordBatch = await reader.ReadNextRecordBatchAsync();
Debug.WriteLine("Read record batch with {0} column(s)", recordBatch.ColumnCount);
return recordBatch;
}
}
- Allocations are 64-byte aligned and padded to 8-bytes.
- Allocations are automatically garbage collected
- Int8, Int16, Int32, Int64
- UInt8, UInt16, UInt32, UInt64
- Float, Double, Half-float (.NET 6+)
- Binary (variable-length)
- String (utf-8)
- Null
- Timestamp
- Date32
- Date64
- Decimal
- Time32
- Time64
- Binary (fixed-length)
- List
- Struct
- Union
- Map
- Duration
- Interval
- Data Types
- Fields
- Schema
- File
- Stream
- Buffer compression and decompression is supported, but requires installing the
Apache.Arrow.Compression
package. When reading compressed data, you must pass anApache.Arrow.Compression.CompressionCodecFactory
instance to theArrowFileReader
orArrowStreamReader
constructor, and when writing compressed data aCompressionCodecFactory
must be set in theIpcOptions
. Alternatively, a custom implementation ofICompressionCodecFactory
can be used.
- Serialization
- Exhaustive validation
- Run End Encoding
- Types
- Tensor
- Arrays
- Large Arrays. There are large array types provided to help with interoperability with other libraries,
but these do not support buffers larger than 2 GiB and an exception will be raised if trying to import an array that is too large.
- Large Binary
- Large List
- Large String
- Views
- Binary
- List
- String
- Large Binary
- Large List
- Large String
- Large Arrays. There are large array types provided to help with interoperability with other libraries,
but these do not support buffers larger than 2 GiB and an exception will be raised if trying to import an array that is too large.
- Array Operations
- Equality / Comparison
- Casting
- Compute
- There is currently no API available for a compute / kernel abstraction.
Install the latest .NET Core SDK
from https://dotnet.microsoft.com/download.
dotnet build
To build the NuGet package run the following command to build a debug flavor, preview package into the artifacts folder.
dotnet pack
When building the officially released version run: (see Note below about current git
repository)
dotnet pack -c Release
Which will build the final/stable package.
NOTE: When building the officially released version, ensure that your git
repository has the origin
remote set to https://github.com/apache/arrow.git
, which will ensure Source Link is set correctly. See https://github.com/dotnet/sourcelink/blob/main/docs/README.md for more information.
There are two output artifacts:
Apache.Arrow.<version>.nupkg
- this contains the executable assembliesApache.Arrow.<version>.snupkg
- this contains the debug symbols files
Both of these artifacts can then be uploaded to https://www.nuget.org/packages/manage/upload.
Build from the Apache Arrow project root.
docker build -f csharp/build/docker/Dockerfile .
dotnet test
All build artifacts are placed in the artifacts folder in the project root.
This project follows the coding style specified in Coding Style.
See https://google.github.io/flatbuffers/flatbuffers_guide_use_java_c-sharp.html for how to get the flatc
executable.
Run flatc --csharp
on each .fbs
file in the format folder. And replace the checked in .cs
files under FlatBuf with the generated files.
Update the non-generated FlatBuffers .cs
files with the files from the google/flatbuffers repo.