Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] Unicode data generates error upon write to dataframe #415

Closed
bkmartinjr opened this issue Oct 15, 2022 · 8 comments · Fixed by #777
Closed

[python] Unicode data generates error upon write to dataframe #415

bkmartinjr opened this issue Oct 15, 2022 · 8 comments · Fixed by #777

Comments

@bkmartinjr
Copy link
Member

bkmartinjr commented Oct 15, 2022

Creating a dataframe with a string column works fine, but when you try to write to it, it generates an error. It also appears there is no unit test for this, which would be a nice addition.

See #420 for an example unit test (currently marked xfail).

Test case:

import numpy as np
import pyarrow as pa
import pandas as pd
import tiledbsoma as soma

soma_df = soma.SOMADataFrame("./test_dataframe")

df = pd.DataFrame(data={
  'soma_rowid': np.arange(2, dtype=np.int64),
  'soma_joinid': np.arange(2, dtype=np.int64),
  'unicode': ['\N{LATIN CAPITAL LETTER E}\N{COMBINING CIRCUMFLEX ACCENT}', 'a \N{GREEK CAPITAL LETTER DELTA} test'], 
  'ascii': ['aa', 'bbb',]
})
tbl = pa.Table.from_pandas(df)
print(tbl.schema)

soma_df.create(schema=tbl.schema)
print(soma_df.schema)

soma_df.write(tbl)

Output:

soma_rowid: int64
soma_joinid: int64
unicode: string
ascii: string
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 718
soma_rowid: int64
soma_joinid: int64
unicode: large_string
ascii: large_string
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
File tiledb/libtiledb.pyx:4617, in tiledb.libtiledb._setitem_impl_sparse()

UnicodeEncodeError: 'ascii' codec can't encode character '\u0302' in position 1: ordinal not in range(128)

During handling of the above exception, another exception occurred:

TileDBError                               Traceback (most recent call last)
Cell In [18], line 20
     17 soma_df.create(schema=tbl.schema)
     18 print(soma_df.schema)
---> 20 soma_df.write(tbl)

File ~/projects/TileDB-SOMA/apis/python/src/tiledbsoma/soma_dataframe.py:263, in SOMADataFrame.write(self, values)
    260 if self._get_is_sparse():
    261     # sparse write
    262     with self._tiledb_open("w") as A:
--> 263         A[rowids] = attr_cols_map
    264 else:
    265     # TODO: This was a quick thing to bootstrap some early ingestion tests but needs more thought.
    266     # In particular, rowids needn't be either zero-up or contiguous.
    267     assert len(rowids) > 0

File tiledb/libtiledb.pyx:4691, in tiledb.libtiledb.SparseArrayImpl.__setitem__()

File tiledb/libtiledb.pyx:4619, in tiledb.libtiledb._setitem_impl_sparse()

TileDBError: Attr's dtype is "ascii" but attr_val contains invalid ASCII characters
@johnkerl
Copy link
Member

See #274 -- it is correct that we do not support writes of non-ASCII data to TileDB-SOMA at present. This is pending true Unicode support in TileDB Core.

@bkmartinjr
Copy link
Member Author

FYI: writes succeeded until last week, when it started generating errors - not sure how. or what was actually stored.

@johnkerl
Copy link
Member

johnkerl commented Oct 17, 2022

#359 was merged last week -- this along with #355 was one of the 'stack of three' setting up #360 (which has become #400). This was part of the conversation we had about needing to regenerate test-census data.

@johnkerl johnkerl removed their assignment Oct 31, 2022
@johnkerl
Copy link
Member

@maniarathi @bkmartinjr I believe this is a duplicate of #274

@bkmartinjr
Copy link
Member Author

@maniarathi @bkmartinjr I believe this is a duplicate of #274

Potentially - the description in #274 is a bit unclear to me. But as long as "support Unicode in attributes, including value_filter'ing" is the intent, OK w/ me to close as dup.

@johnkerl
Copy link
Member

@bkmartinjr I assert that these are the same. I closed #274 in favor of this one.

@johnkerl
Copy link
Member

@maniarathi I see your assign to @bkmartinjr -- I'm happy to help here as well as I was deep in this last year. Also note that much of this is core work within TileDB; mainly here we will be merging/validating a core release.

@johnkerl johnkerl changed the title Unicode data generates error upon write to dataframe [python] Unicode data generates error upon write to dataframe Jan 19, 2023
@johnkerl
Copy link
Member

🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants