Reduce the overhead of `DataType`s #1469

teh-cmc · 2023-04-17T08:56:44Z

Fixes #439

The entire PR pretty much comes down to this diff:

@@ -70,7 +111,7 @@ pub enum DataType {
     /// * An absolute time zone offset of the form +XX:XX or -XX:XX, such as +07:30
     /// When the timezone is not specified, the timestamp is considered to have no timezone
     /// and is represented _as is_
-    Timestamp(TimeUnit, Option<String>),
+    Timestamp(TimeUnit, Option<Arc<String>>),
     /// An [`i32`] representing the elapsed time since UNIX epoch (1970-01-01)
     /// in days.
     Date32,
@@ -100,16 +141,16 @@ pub enum DataType {
     /// A variable-length UTF-8 encoded string whose offsets are represented as [`i64`].
     LargeUtf8,
     /// A list of some logical data type whose offsets are represented as [`i32`].
-    List(Box<Field>),
+    List(Arc<Field>),
     /// A list of some logical data type with a fixed number of elements.
-    FixedSizeList(Box<Field>, usize),
+    FixedSizeList(Arc<Field>, usize),
     /// A list of some logical data type whose offsets are represented as [`i64`].
-    LargeList(Box<Field>),
+    LargeList(Arc<Field>),
     /// A nested [`DataType`] with a given number of [`Field`]s.
-    Struct(Vec<Field>),
+    Struct(Arc<Vec<Field>>),
     /// A nested datatype that can represent slots of differing types.
     /// Third argument represents mode
-    Union(Vec<Field>, Option<Vec<i32>>, UnionMode),
+    Union(Arc<Vec<Field>>, Option<Arc<Vec<i32>>>, UnionMode),
     /// A nested type that is represented as
     ///
     /// List<entries: Struct<key: K, value: V>>
@@ -135,7 +176,7 @@ pub enum DataType {
     /// The metadata is structured so that Arrow systems without special handling
     /// for Map can make Map an alias for List. The "layout" attribute for the Map
     /// field must have the same contents as a List.
-    Map(Box<Field>, bool),
+    Map(Arc<Field>, bool),
     /// A dictionary encoded array (`key_type`, `value_type`), where
     /// each array element is an index of `key_type` into an
     /// associated dictionary of `value_type`.
@@ -148,7 +189,7 @@ pub enum DataType {
     /// arrays or a limited set of primitive types as integers.
     ///
     /// The `bool` value indicates the `Dictionary` is sorted if set to `true`.
-    Dictionary(IntegerType, Box<DataType>, bool),
+    Dictionary(IntegerType, Arc<DataType>, bool),
     /// Decimal value with precision and scale
     /// precision is the number of digits in the number and
     /// scale is the number of decimal places.
@@ -157,12 +198,15 @@ pub enum DataType {
     /// Decimal backed by 256 bits
     Decimal256(usize, usize),
     /// Extension type.
-    Extension(String, Box<DataType>, Option<String>),
+    Extension(String, Arc<DataType>, Option<Arc<String>>),
 }

everything else is just a lot of grunt work and pain to accommodate for these new types.

As mentioned in #439 (comment): I went for the path of least resistance, so this isn't optimal, but it is already quite the improvement.

I have branches ready for arrow2_convert, polars and rerun.
In Rerun, we've seen up to 50% reduced memory requirements in some use cases with this PR.

ritchie46 · 2023-04-17T09:19:06Z

Thanks for the PR. This hits a lot of code so I want to tune in @jorgecarleitao as well.

In Rerun, we've seen up to 50% reduced memory requirements in some use cases with this PR.

Could you elaborate a bit on this? What did you benchmark? How was the data type such a huge bottleneck?

What is the new data type size and what was the old one?

Somewhat related, In polars we use smallstring which might also be an interesting route.

codecov · 2023-04-17T09:23:26Z

Codecov Report

Attention: 16 lines in your changes are missing coverage. Please review.

Comparison is base (3ddc6a1) 83.39% compared to head (3c3c6ed) 83.96%.

❗ Current head 3c3c6ed differs from pull request most recent head 40541b4. Consider uploading reports for the commit 40541b4 to get more accurate results

Files	Patch %	Lines
src/datatypes/mod.rs	89.74%	4 Missing ⚠️
src/io/avro/read/schema.rs	50.00%	3 Missing ⚠️
src/array/struct_/mod.rs	0.00%	2 Missing ⚠️
src/compute/cast/dictionary_to.rs	0.00%	2 Missing ⚠️
src/io/parquet/read/schema/convert.rs	98.07%	2 Missing ⚠️
src/temporal_conversions.rs	50.00%	2 Missing ⚠️
src/io/parquet/write/mod.rs	50.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1469      +/-   ##
==========================================
+ Coverage   83.39%   83.96%   +0.57%     
==========================================
  Files         391      387       -4     
  Lines       43008    41739    -1269     
==========================================
- Hits        35867    35048     -819     
+ Misses       7141     6691     -450

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

emilk · 2024-01-12T10:59:06Z

Any chance to get this merged? We do a lot of cloning of datatypes, and the memory use adds up extremely quickly. Using Arc reduces the memory footprint tremendously.

See jorgecarleitao#1469 --------- Co-authored-by: Clement Rey <[email protected]>

teh-cmc added 4 commits April 13, 2023 16:53

Taking care of List, LargeList, FixedSizeList and Map

3f75972

Taking care of Timestamp in the most pain-free way I could come up with

c1e58ad

Every other arm, limiting pain as much as possible

d274275

Merge remote-tracking branch 'upstream/main' into cmc/arc_datatype

3c3c6ed

clarkzinzow mentioned this pull request May 18, 2023

[Extension Types] Add support for cross-lang extension types. Eventual-Inc/Daft#899

Merged

1 task

emilk mentioned this pull request Jan 11, 2024

Fork arrow2 and get rid of polars rerun-io/rerun#4789

Closed

Merge branch 'main' into cmc/arc_datatype

664d021

cargo fmt

40541b4

This was referenced Jan 15, 2024

Use Arc in Datatype to reduce memory use #1603

Closed

Use Arc in Datatype to reduce memory overhead rerun-io/re_arrow2#3

Merged

emilk force-pushed the cmc/arc_datatype branch from ec5b7f2 to 40541b4 Compare January 15, 2024 15:36

emilk added a commit to rerun-io/re_arrow2 that referenced this pull request Jan 15, 2024

Use Arc in Datatype to reduce memory overhead (#3)

33a3200

See jorgecarleitao#1469 --------- Co-authored-by: Clement Rey <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce the overhead of `DataType`s #1469

Reduce the overhead of `DataType`s #1469

teh-cmc commented Apr 17, 2023

ritchie46 commented Apr 17, 2023 •

edited

Loading

codecov bot commented Apr 17, 2023 •

edited

Loading

emilk commented Jan 12, 2024 •

edited

Loading

Reduce the overhead of DataTypes #1469

Are you sure you want to change the base?

Reduce the overhead of DataTypes #1469

Conversation

teh-cmc commented Apr 17, 2023

ritchie46 commented Apr 17, 2023 • edited Loading

codecov bot commented Apr 17, 2023 • edited Loading

Codecov Report

emilk commented Jan 12, 2024 • edited Loading

Reduce the overhead of `DataType`s #1469

Reduce the overhead of `DataType`s #1469

ritchie46 commented Apr 17, 2023 •

edited

Loading

codecov bot commented Apr 17, 2023 •

edited

Loading

emilk commented Jan 12, 2024 •

edited

Loading