Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Reduce the overhead of DataTypes #1469

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

teh-cmc
Copy link

@teh-cmc teh-cmc commented Apr 17, 2023

Fixes #439

The entire PR pretty much comes down to this diff:

@@ -70,7 +111,7 @@ pub enum DataType {
     /// * An absolute time zone offset of the form +XX:XX or -XX:XX, such as +07:30
     /// When the timezone is not specified, the timestamp is considered to have no timezone
     /// and is represented _as is_
-    Timestamp(TimeUnit, Option<String>),
+    Timestamp(TimeUnit, Option<Arc<String>>),
     /// An [`i32`] representing the elapsed time since UNIX epoch (1970-01-01)
     /// in days.
     Date32,
@@ -100,16 +141,16 @@ pub enum DataType {
     /// A variable-length UTF-8 encoded string whose offsets are represented as [`i64`].
     LargeUtf8,
     /// A list of some logical data type whose offsets are represented as [`i32`].
-    List(Box<Field>),
+    List(Arc<Field>),
     /// A list of some logical data type with a fixed number of elements.
-    FixedSizeList(Box<Field>, usize),
+    FixedSizeList(Arc<Field>, usize),
     /// A list of some logical data type whose offsets are represented as [`i64`].
-    LargeList(Box<Field>),
+    LargeList(Arc<Field>),
     /// A nested [`DataType`] with a given number of [`Field`]s.
-    Struct(Vec<Field>),
+    Struct(Arc<Vec<Field>>),
     /// A nested datatype that can represent slots of differing types.
     /// Third argument represents mode
-    Union(Vec<Field>, Option<Vec<i32>>, UnionMode),
+    Union(Arc<Vec<Field>>, Option<Arc<Vec<i32>>>, UnionMode),
     /// A nested type that is represented as
     ///
     /// List<entries: Struct<key: K, value: V>>
@@ -135,7 +176,7 @@ pub enum DataType {
     /// The metadata is structured so that Arrow systems without special handling
     /// for Map can make Map an alias for List. The "layout" attribute for the Map
     /// field must have the same contents as a List.
-    Map(Box<Field>, bool),
+    Map(Arc<Field>, bool),
     /// A dictionary encoded array (`key_type`, `value_type`), where
     /// each array element is an index of `key_type` into an
     /// associated dictionary of `value_type`.
@@ -148,7 +189,7 @@ pub enum DataType {
     /// arrays or a limited set of primitive types as integers.
     ///
     /// The `bool` value indicates the `Dictionary` is sorted if set to `true`.
-    Dictionary(IntegerType, Box<DataType>, bool),
+    Dictionary(IntegerType, Arc<DataType>, bool),
     /// Decimal value with precision and scale
     /// precision is the number of digits in the number and
     /// scale is the number of decimal places.
@@ -157,12 +198,15 @@ pub enum DataType {
     /// Decimal backed by 256 bits
     Decimal256(usize, usize),
     /// Extension type.
-    Extension(String, Box<DataType>, Option<String>),
+    Extension(String, Arc<DataType>, Option<Arc<String>>),
 }

everything else is just a lot of grunt work and pain to accommodate for these new types.

As mentioned in #439 (comment): I went for the path of least resistance, so this isn't optimal, but it is already quite the improvement.

I have branches ready for arrow2_convert, polars and rerun.
In Rerun, we've seen up to 50% reduced memory requirements in some use cases with this PR.

@ritchie46
Copy link
Collaborator

ritchie46 commented Apr 17, 2023

Thanks for the PR. This hits a lot of code so I want to tune in @jorgecarleitao as well.

In Rerun, we've seen up to 50% reduced memory requirements in some use cases with this PR.

Could you elaborate a bit on this? What did you benchmark? How was the data type such a huge bottleneck?

What is the new data type size and what was the old one?

Somewhat related, In polars we use smallstring which might also be an interesting route.

@codecov
Copy link

codecov bot commented Apr 17, 2023

Codecov Report

Attention: 16 lines in your changes are missing coverage. Please review.

Comparison is base (3ddc6a1) 83.39% compared to head (3c3c6ed) 83.96%.

❗ Current head 3c3c6ed differs from pull request most recent head 40541b4. Consider uploading reports for the commit 40541b4 to get more accurate results

Files Patch % Lines
src/datatypes/mod.rs 89.74% 4 Missing ⚠️
src/io/avro/read/schema.rs 50.00% 3 Missing ⚠️
src/array/struct_/mod.rs 0.00% 2 Missing ⚠️
src/compute/cast/dictionary_to.rs 0.00% 2 Missing ⚠️
src/io/parquet/read/schema/convert.rs 98.07% 2 Missing ⚠️
src/temporal_conversions.rs 50.00% 2 Missing ⚠️
src/io/parquet/write/mod.rs 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1469      +/-   ##
==========================================
+ Coverage   83.39%   83.96%   +0.57%     
==========================================
  Files         391      387       -4     
  Lines       43008    41739    -1269     
==========================================
- Hits        35867    35048     -819     
+ Misses       7141     6691     -450     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@emilk
Copy link

emilk commented Jan 12, 2024

Any chance to get this merged? We do a lot of cloning of datatypes, and the memory use adds up extremely quickly. Using Arc reduces the memory footprint tremendously.

emilk added a commit to rerun-io/re_arrow2 that referenced this pull request Jan 15, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Investigate how to reduce size of DataType or arrays without making the API too awkward
3 participants