Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[c++] Fix bug with sliced Arrow-table writes, with string columns #3433

Merged
merged 1 commit into from
Dec 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions apis/python/tests/test_dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -1968,3 +1968,35 @@ def test_pass_configs(tmp_path, arrow_schema):
}
)
)


def test_arrow_table_sliced_writer(tmp_path):
"""Tests writes of sliced Arrow tables, with fixed-length and variable-length attributes"""
uri = tmp_path.as_posix()
num_rows = 50

schema = pa.schema(
[
("myint", pa.int32()),
("mystring", pa.large_string()),
]
)

pydict = {
"soma_joinid": list(range(num_rows)),
"myint": [(e + 1) * 10 for e in range(num_rows)],
"mystring": ["s_%08d" % e for e in range(num_rows)],
}
table = pa.Table.from_pydict(pydict)

domain = [[0, len(table) - 1]]

with soma.DataFrame.create(uri, schema=schema, domain=domain) as sdf:
mid = num_rows // 2
sdf.write(table[:mid])
sdf.write(table[mid:])

with soma.DataFrame.open(uri) as sdf:
pdf = sdf.read().concat().to_pandas()
assert list(pdf["myint"]) == pydict["myint"]
assert list(pdf["mystring"]) == pydict["mystring"]
54 changes: 32 additions & 22 deletions libtiledbsoma/src/soma/managed_query.cc
Original file line number Diff line number Diff line change
Expand Up @@ -966,35 +966,45 @@ bool ManagedQuery::_cast_column_aux<std::string>(
ArrowSchema* schema, ArrowArray* array, ArraySchemaEvolution se) {
(void)se; // se is unused in std::string specialization

const void* data = nullptr;
const void* offset = nullptr;
const void* validity = nullptr;

if (array->n_buffers == 3) {
data = array->buffers[2];
offset = array->buffers[1];
validity = array->buffers[0];
} else {
data = array->buffers[1];
offset = nullptr;
validity = array->buffers[0];
// A few things in play here:
// * Whether the column (array) has 3 buffers (validity, offset, data)
// or 2 (validity, data).
Comment on lines +970 to +971
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this comment confusing since we throw if the column (array) doesn't have 3 buffers.

// * The data is always char* and the validity is always uint8*
// but the offsets are 32-bit or 64-bit.
// * The array-level offset might not be zero. (This happens
// when people pass of things like arrow_table[:n] or arrow_table[n:]
// from Python/R.)

if (array->n_buffers != 3) {
throw TileDBSOMAError(std::format(
"[ManagedQuery] internal error: Arrow-table string column should "
"have 3 buffers; got {}",
array->n_buffers));
}

const char* data = (const char*)array->buffers[2];
uint8_t* validity = (uint8_t*)array->buffers[0];

// If this is a table-slice, slice into the validity buffer.
if (validity != nullptr) {
validity += array->offset;
}
// If this is a table-slice, do *not* slice into the data
// buffer since it is indexed via offsets, which we slice
// into below.

if ((strcmp(schema->format, "U") == 0) ||
(strcmp(schema->format, "Z") == 0)) {
// If this is a table-slice, slice into the offsets buffer.
uint64_t* offset = (uint64_t*)array->buffers[1] + array->offset;
setup_write_column(
schema->name,
array->length,
(const void*)data,
(uint64_t*)offset,
(uint8_t*)validity);
schema->name, array->length, (const void*)data, offset, validity);

} else {
// If this is a table-slice, slice into the offsets buffer.
uint32_t* offset = (uint32_t*)array->buffers[1] + array->offset;
setup_write_column(
schema->name,
array->length,
(const void*)data,
(uint32_t*)offset,
(uint8_t*)validity);
schema->name, array->length, (const void*)data, offset, validity);
}
return false;
}
Expand Down
8 changes: 7 additions & 1 deletion libtiledbsoma/src/soma/managed_query.h
Original file line number Diff line number Diff line change
Expand Up @@ -614,11 +614,17 @@ class ManagedQuery {
// additional processing steps

UserType* buf;
// The array->offset is non-zero when we are passed sliced
// Arrow tables like arrow_table[:m] or arrow_table[m:].
if (array->n_buffers == 3) {
buf = (UserType*)array->buffers[2] + array->offset;
} else {
buf = (UserType*)array->buffers[1] + array->offset;
}
uint8_t* validity = (uint8_t*)array->buffers[0];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This would be a good opportunity to move away from the C-style cast to a named cast (static_cast in this case).

if (validity != nullptr) {
validity += array->offset;
}

bool has_attr = schema_->has_attribute(schema->name);
if (has_attr && attr_has_enum(schema->name)) {
Expand Down Expand Up @@ -646,7 +652,7 @@ class ManagedQuery {
casted_values.size(),
(const void*)casted_values.data(),
(uint64_t*)nullptr,
(uint8_t*)array->buffers[0]);
validity);

// Return false because we do not extend the enumeration
return false;
Expand Down