Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41136: [C#] Recompute null count for sliced arrays on demand #41144

Merged
merged 6 commits into from
Apr 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion csharp/src/Apache.Arrow/Arrays/Array.cs
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ protected Array(ArrayData data)

public int Offset => Data.Offset;

public int NullCount => Data.NullCount;
public int NullCount => Data.GetNullCount();

public ArrowBuffer NullBitmapBuffer => Data.Buffers[0];

Expand Down
60 changes: 57 additions & 3 deletions csharp/src/Apache.Arrow/Arrays/ArrayData.cs
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,6 @@

using Apache.Arrow.Memory;
using Apache.Arrow.Types;
using Google.FlatBuffers;
using System;
using System.Collections.Generic;
using System.Linq;
Expand All @@ -28,12 +27,30 @@ public sealed class ArrayData : IDisposable

public readonly IArrowType DataType;
public readonly int Length;
public readonly int NullCount;

/// <summary>
/// The number of null values in the Array. May be -1 if the null count has not been computed.
/// </summary>
public int NullCount;

public readonly int Offset;
public readonly ArrowBuffer[] Buffers;
public readonly ArrayData[] Children;
public readonly ArrayData Dictionary; // Only used for dictionary type

/// <summary>
/// Get the number of null values in the Array, computing the count if required.
/// </summary>
public int GetNullCount()
{
if (NullCount == RecalculateNullCount)
{
NullCount = ComputeNullCount();
}

return NullCount;
}

// This is left for compatibility with lower version binaries
// before the dictionary type was supported.
public ArrayData(
Expand Down Expand Up @@ -111,7 +128,25 @@ public ArrayData Slice(int offset, int length)
length = Math.Min(Length - offset, length);
offset += Offset;

return new ArrayData(DataType, length, RecalculateNullCount, offset, Buffers, Children, Dictionary);
int nullCount;
if (NullCount == 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would force calculation of the null count on the original array. Is it worth preserving laziness and saying "if we haven't calculated the original null count then mark the sliced null count as requiring calculation"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, yes definitely

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was fixed as a side effect of reverting NullCount back to being a normal int member.

{
nullCount = 0;
}
else if (NullCount == Length)
{
nullCount = length;
}
else if (offset == Offset && length == Length)
{
nullCount = NullCount;
}
else
{
nullCount = RecalculateNullCount;
}

return new ArrayData(DataType, length, nullCount, offset, Buffers, Children, Dictionary);
}

public ArrayData Clone(MemoryAllocator allocator = default)
Expand All @@ -125,5 +160,24 @@ public ArrayData Clone(MemoryAllocator allocator = default)
Children?.Select(b => b.Clone(allocator))?.ToArray(),
Dictionary?.Clone(allocator));
}

private int ComputeNullCount()
{
if (DataType.TypeId == ArrowTypeId.Union)
{
return UnionArray.ComputeNullCount(this);
}

if (Buffers == null || Buffers.Length == 0 || Buffers[0].IsEmpty)
{
return 0;
}

// Note: Dictionary arrays may be logically null if there is a null in the dictionary values,
// but this isn't accounted for by the IArrowArray.IsNull implementation,
// so we maintain consistency with that behaviour here.
Comment on lines +176 to +178
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The C++ implementation actually has a separate ComputeLogicalNullCount method on ArrayData and Array that handles union arrays, dictionary arrays and run end encoded arrays, and there the null_count for union arrays always returns 0. The C# UnionArray.IsValid method depends on the NullCount value though, so I figured it made sense to compute the actual null count for union arrays.


return Length - BitUtility.CountBits(Buffers[0].Span, Offset, Length);
}
}
}
2 changes: 1 addition & 1 deletion csharp/src/Apache.Arrow/Arrays/ArrayDataConcatenator.cs
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ public ArrayDataConcatenationVisitor(IReadOnlyList<ArrayData> arrayDataList, Mem
foreach (ArrayData arrayData in _arrayDataList)
{
_totalLength += arrayData.Length;
_totalNullCount += arrayData.NullCount;
_totalNullCount += arrayData.GetNullCount();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could make sense to check NullCount for -1 here, and then set the final null count to -1 if it was -1 in any array, rather than forcing a computation. But this class has bigger problems as it doesn't account for non-zero offsets anywhere as far as I can see, although it does seem to account for array lengths being less than the full buffer size. Maybe it should throw a NotImplementedException if it encounters a non-zero offset?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you file a separate bug for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep sure, I've made #41164

}
}

Expand Down
23 changes: 23 additions & 0 deletions csharp/src/Apache.Arrow/Arrays/DenseUnionArray.cs
Original file line number Diff line number Diff line change
Expand Up @@ -53,5 +53,28 @@ protected override bool FieldIsValid(IArrowArray fieldArray, int index)
{
return fieldArray.IsValid(ValueOffsets[index]);
}

internal new static int ComputeNullCount(ArrayData data)
{
var offset = data.Offset;
var length = data.Length;
var typeIds = data.Buffers[0].Span.Slice(offset, length);
var valueOffsets = data.Buffers[1].Span.CastTo<int>().Slice(offset, length);
var childArrays = new IArrowArray[data.Children.Length];
for (var childIdx = 0; childIdx < data.Children.Length; ++childIdx)
{
childArrays[childIdx] = ArrowArrayFactory.BuildArray(data.Children[childIdx]);
}

var nullCount = 0;
for (var i = 0; i < length; ++i)
{
var typeId = typeIds[i];
var valueOffset = valueOffsets[i];
nullCount += childArrays[typeId].IsNull(valueOffset) ? 1 : 0;
}

return nullCount;
}
}
}
2 changes: 1 addition & 1 deletion csharp/src/Apache.Arrow/Arrays/NullArray.cs
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ public NullArray(int length)

public int Offset => Data.Offset;

public int NullCount => Data.NullCount;
public int NullCount => Data.GetNullCount();

public void Dispose() { }
public bool IsNull(int index) => true;
Expand Down
21 changes: 21 additions & 0 deletions csharp/src/Apache.Arrow/Arrays/SparseUnionArray.cs
Original file line number Diff line number Diff line change
Expand Up @@ -47,5 +47,26 @@ protected override bool FieldIsValid(IArrowArray fieldArray, int index)
{
return fieldArray.IsValid(index);
}

internal new static int ComputeNullCount(ArrayData data)
{
var offset = data.Offset;
var length = data.Length;
var typeIds = data.Buffers[0].Span.Slice(offset, length);
var childArrays = new IArrowArray[data.Children.Length];
for (var childIdx = 0; childIdx < data.Children.Length; ++childIdx)
{
childArrays[childIdx] = ArrowArrayFactory.BuildArray(data.Children[childIdx]);
}

var nullCount = 0;
for (var i = 0; i < data.Length; ++i)
{
var typeId = typeIds[i];
nullCount += childArrays[typeId].IsNull(offset + i) ? 1 : 0;
}

return nullCount;
}
}
}
12 changes: 11 additions & 1 deletion csharp/src/Apache.Arrow/Arrays/UnionArray.cs
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ public abstract class UnionArray : IArrowArray

public int Offset => Data.Offset;

public int NullCount => Data.NullCount;
public int NullCount => Data.GetNullCount();

public bool IsValid(int index) => NullCount == 0 || FieldIsValid(Fields[TypeIds[index]], index);

Expand Down Expand Up @@ -91,6 +91,16 @@ protected static void ValidateMode(UnionMode expected, UnionMode actual)
}
}

internal static int ComputeNullCount(ArrayData data)
{
return ((UnionType)data.DataType).Mode switch
{
UnionMode.Sparse => SparseUnionArray.ComputeNullCount(data),
UnionMode.Dense => DenseUnionArray.ComputeNullCount(data),
_ => throw new InvalidOperationException("unknown union mode in null count computation")
};
}

private IReadOnlyList<IArrowArray> InitializeFields()
{
IArrowArray[] result = new IArrowArray[Data.Children.Length];
Expand Down
2 changes: 1 addition & 1 deletion csharp/src/Apache.Arrow/C/CArrowArrayExporter.cs
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,7 @@ private unsafe static void ConvertArray(ExportedAllocationOwner sharedOwner, Arr
{
cArray->length = array.Length;
cArray->offset = array.Offset;
cArray->null_count = array.NullCount;
cArray->null_count = array.NullCount; // The C Data interface allows the null count to be -1
cArray->release = ReleaseArrayPtr;
cArray->private_data = MakePrivateData(sharedOwner);

Expand Down
2 changes: 1 addition & 1 deletion csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs
Original file line number Diff line number Diff line change
Expand Up @@ -376,7 +376,7 @@ private void CreateSelfAndChildrenFieldNodes(ArrayData data)
CreateSelfAndChildrenFieldNodes(data.Children[i]);
}
}
Flatbuf.FieldNode.CreateFieldNode(Builder, data.Length, data.NullCount);
Flatbuf.FieldNode.CreateFieldNode(Builder, data.Length, data.GetNullCount());
}

private static int CountAllNodes(IReadOnlyList<Field> fields)
Expand Down
15 changes: 15 additions & 0 deletions csharp/test/Apache.Arrow.Tests/ArrowArrayTests.cs
Original file line number Diff line number Diff line change
Expand Up @@ -185,6 +185,7 @@ public void SlicePrimitiveArrayWithNulls()
TestSlice<Date64Array, Date64Array.Builder>(x => x.Append(new DateTime(2019, 1, 1)).Append(new DateTime(2019, 1, 2)).AppendNull().Append(new DateTime(2019, 1, 3)));
TestSlice<Time32Array, Time32Array.Builder>(x => x.Append(10).Append(20).AppendNull().Append(30));
TestSlice<Time64Array, Time64Array.Builder>(x => x.Append(10).Append(20).AppendNull().Append(30));
TestSlice<Int32Array, Int32Array.Builder>(x => x.AppendNull().AppendNull().AppendNull()); // All nulls

static void TestNumberSlice<T, TArray, TBuilder>()
where T : struct, INumber<T>
Expand Down Expand Up @@ -314,6 +315,8 @@ private void ValidateArrays<T>(PrimitiveArray<T> slicedArray)
.SequenceEqual(slicedArray.Values));

Assert.Equal(baseArray.GetValue(slicedArray.Offset), slicedArray.GetValue(0));

ValidateNullCount(slicedArray);
}

private void ValidateArrays(BooleanArray slicedArray)
Expand All @@ -333,6 +336,8 @@ private void ValidateArrays(BooleanArray slicedArray)
#pragma warning disable CS0618
Assert.Equal(baseArray.GetBoolean(slicedArray.Offset), slicedArray.GetBoolean(0));
#pragma warning restore CS0618

ValidateNullCount(slicedArray);
}

private void ValidateArrays(BinaryArray slicedArray)
Expand All @@ -347,6 +352,16 @@ private void ValidateArrays(BinaryArray slicedArray)
.SequenceEqual(slicedArray.ValueOffsets));

Assert.True(baseArray.GetBytes(slicedArray.Offset).SequenceEqual(slicedArray.GetBytes(0)));

ValidateNullCount(slicedArray);
}

private static void ValidateNullCount(IArrowArray slicedArray)
{
var expectedNullCount = Enumerable.Range(0, slicedArray.Length)
.Select(i => slicedArray.IsNull(i) ? 1 : 0)
.Sum();
Assert.Equal(expectedNullCount, slicedArray.NullCount);
}
}
}
Expand Down
49 changes: 42 additions & 7 deletions csharp/test/Apache.Arrow.Tests/UnionArrayTests.cs
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,46 @@ public class UnionArrayTests
[InlineData(UnionMode.Sparse)]
[InlineData(UnionMode.Dense)]
public void UnionArray_IsNull(UnionMode mode)
{
var (array, expectedNull) = BuildUnionArray(mode, 100);

for (var i = 0; i < array.Length; ++i)
{
Assert.Equal(expectedNull[i], array.IsNull(i));
Assert.Equal(!expectedNull[i], array.IsValid(i));
}
}

[Theory]
[InlineData(UnionMode.Sparse)]
[InlineData(UnionMode.Dense)]
public void UnionArray_Slice(UnionMode mode)
{
var (array, expectedNull) = BuildUnionArray(mode, 10);

for (var offset = 0; offset < array.Length; ++offset)
{
for (var length = 0; length < array.Length - offset; ++length)
{
var slicedArray = ArrowArrayFactory.Slice(array, offset, length);

var nullCount = 0;
for (var i = 0; i < slicedArray.Length; ++i)
{
// TODO: Shouldn't need to add offset in IsNull/IsValid calls,
// see https://github.com/apache/arrow/issues/41140
Assert.Equal(expectedNull[offset + i], slicedArray.IsNull(offset + i));
Assert.Equal(!expectedNull[offset + i], slicedArray.IsValid(offset + i));
nullCount += expectedNull[offset + i] ? 1 : 0;
}

Assert.True(nullCount == slicedArray.NullCount, $"offset = {offset}, length = {length}");
Assert.Equal(nullCount, slicedArray.NullCount);
}
}
}

private static (UnionArray array, bool[] isNull) BuildUnionArray(UnionMode mode, int length)
{
var fields = new Field[]
{
Expand All @@ -34,7 +74,6 @@ public void UnionArray_IsNull(UnionMode mode)
var typeIds = fields.Select(f => (int) f.DataType.TypeId).ToArray();
var type = new UnionType(fields, typeIds, mode);

const int length = 100;
var nullCount = 0;
var field0Builder = new Int32Array.Builder();
var field1Builder = new FloatArray.Builder();
Expand All @@ -44,7 +83,7 @@ public void UnionArray_IsNull(UnionMode mode)

for (var i = 0; i < length; ++i)
{
var isNull = i % 5 == 0;
var isNull = i % 3 == 0;
expectedNull[i] = isNull;
nullCount += isNull ? 1 : 0;

Expand Down Expand Up @@ -104,10 +143,6 @@ public void UnionArray_IsNull(UnionMode mode)
? new DenseUnionArray(type, length, children, typeIdsBuffer, valuesOffsetBuffer, nullCount)
: new SparseUnionArray(type, length, children, typeIdsBuffer, nullCount);

for (var i = 0; i < length; ++i)
{
Assert.Equal(expectedNull[i], array.IsNull(i));
Assert.Equal(!expectedNull[i], array.IsValid(i));
}
return (array, expectedNull);
}
}
Loading