Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] Use true ASCII attributes in dataframes #359

Merged
merged 9 commits into from
Oct 11, 2022
Merged

[python] Use true ASCII attributes in dataframes #359

merged 9 commits into from
Oct 11, 2022

Conversation

johnkerl
Copy link
Member

@johnkerl johnkerl commented Oct 3, 2022

What

This is a forward-port of #273 from the main-old branch to the main branch.

Why

Lack of backward compatibility

Prerequisites

PR context

This is the second in a group of three related PRs:

These three changesets could be done in a single PR, but that would be unmerciful to the reviewers.

Problem

Issue I'm seeing on this PR which confuses me:

Script:

#!/usr/bin/env python

import os, shutil
import pyarrow as pa
import tiledbsoma as t
import tiledb

uri = "foo"
if os.path.exists(uri):
    shutil.rmtree(uri)

schema = pa.schema(
    [
        ("soma_rowid", pa.uint64()),
        ("A", pa.int64()),
        ("B", pa.float64()),
        ("C", pa.string()),
    ]
)

sdf = t.SOMADataFrame(uri=uri)
sdf.create(schema=schema.remove(schema.get_field_index("soma_rowid")))

data = {
    "soma_rowid": [0, 1, 2, 3],
    "A": [10, 11, 12, 13],
    "B": [100.1, 200.2, 300.3, 400.4],
    "C": ["this", "is", "a", "test"],
}
n_data = len(data["soma_rowid"])
rb = pa.Table.from_pydict(data)
sdf.write(rb)

TileDB schema:

ArraySchema(
  domain=Domain(*[
    Dim(name='soma_rowid', domain=(0, 18446744073709551614), tile=2048, dtype='uint64', filters=FilterList([ZstdFilter(level=3), ])),
  ]),
  attrs=[
    Attr(name='A', dtype='int64', var=False, nullable=False, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='B', dtype='float64', var=False, nullable=False, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='C', dtype='ascii', var=True, nullable=False, filters=FilterList([ZstdFilter(level=-1), ])), <----- AS DESIRED
  ],
  cell_order='row-major',
  tile_order='row-major',
  capacity=100000,
  sparse=True,
  allows_duplicates=False,
)

Reading back from C++ or from Python:

>>> import tiledbsoma.libtiledbsoma as clib
>>> clib.SOMAReader('foo')
<tiledbsoma.libtiledbsoma.SOMAReader object at 0x7fe69b851070>
>>> sr=clib.SOMAReader('foo')
>>> sr.
sr.read_next(        sr.results_complete( sr.set_dim_points(   sr.set_dim_ranges(   sr.submit(
>>> sr.submit()
>>> o=sr.read_next()
>>> o
pyarrow.Table
soma_rowid: uint64
A: int64
B: double
C: large_string <------- this is why I am seeing unit-test failures -- same reason
----
soma_rowid: [[0,1,2,3]]
A: [[10,11,12,13]]
B: [[100.1,200.2,300.3,400.4]]
C: [["this","is","a","test"]]

In summary:

  • TileDB arrays are created with "ascii" attributes
  • When I read them back as Arrow using C++ or Python, I now get large_string not string in the Arrow readback schema

cc @gspowley and @nguyenv

@nguyenv I'll get you a simpler reprex tomorrow

Update 2022-10-04: simpler repro https://gist.github.com/johnkerl/278bb246c3aa05cc4dfb1694503744c6

@johnkerl johnkerl force-pushed the kerl/ascii branch 2 times, most recently from 46b4ae3 to a8bb90b Compare October 11, 2022 00:18
@johnkerl johnkerl changed the title Use true ASCII attributes in dataframes [WIP] Use true ASCII attributes in dataframes Oct 11, 2022
@johnkerl johnkerl requested a review from bkmartinjr October 11, 2022 00:18
@johnkerl johnkerl marked this pull request as ready for review October 11, 2022 00:18
@johnkerl johnkerl requested a review from gspowley October 11, 2022 00:27
@johnkerl
Copy link
Member Author

fyi, now that python unit-test cases are passing, there are C++ unit-test failures which i am currently debugging

@johnkerl
Copy link
Member Author

fyi, now that python unit-test cases are passing, there are C++ unit-test failures which i am currently debugging

fixed on latest commit:

9bdd02f

@johnkerl
Copy link
Member Author

johnkerl commented Oct 11, 2022

Note: tests will fail with

E   TypeError: dtype is not compatible with var-length attribute

until TileDB-Inc/TileDB-Py#1337 wends its way into a TileDB-Py release we can update to -- cc @nguyenv and @ihnorton

@johnkerl johnkerl merged commit b589716 into main Oct 11, 2022
@johnkerl johnkerl deleted the kerl/ascii branch October 11, 2022 18:56
@johnkerl johnkerl changed the title Use true ASCII attributes in dataframes [python] Use true ASCII attributes in dataframes Oct 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants