Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ak.transform converts string to list of bytes if multiple arrays are passed #1672

Closed
ivirshup opened this issue Sep 5, 2022 · 8 comments · Fixed by #1679
Closed

ak.transform converts string to list of bytes if multiple arrays are passed #1672

ivirshup opened this issue Sep 5, 2022 · 8 comments · Fixed by #1679
Assignees
Labels
bug The problem described is something that must be fixed

Comments

@ivirshup
Copy link

ivirshup commented Sep 5, 2022

Version of Awkward Array

1.9.0

Description and code to reproduce

Calling ak.transform on a single array containing strings will maintain the string type, but if multiple arrays are passed the strings are converted to lists of characters.

In [1]: import awkward._v2 as ak

In [2]: a = ak.Array(["foo", "foo", "bar"])
   ...: a
Out[2]: <Array ['foo', 'foo', 'bar'] type='3 * string'>

In [3]: ak.transform(lambda *args, **kwargs: None, a)
Out[3]: <Array ['foo', 'foo', 'bar'] type='3 * string'>

In [4]: ak.transform(lambda *args, **kwargs: None, a, a)
Out[4]: 
(<Array ['foo', 'foo', 'bar'] type='3 * var * char'>,
 <Array ['foo', 'foo', 'bar'] type='3 * var * char'>)
@ivirshup ivirshup added the bug (unverified) The problem described would be a bug, but needs to be triaged label Sep 5, 2022
@jpivarski
Copy link
Member

Apparently, the __array__ = "string" parameter is being lost from the ListType (but not the __array__ = "char" from the NumpyType).

Thanks!

@ivirshup
Copy link
Author

ivirshup commented Sep 5, 2022

Something that I have been finding confusing is which of type, form, or layout is equivalent to the data shape string. I think this is quite related, since strings and byte strings are cases where this breaks down.

@agoose77
Copy link
Collaborator

agoose77 commented Sep 5, 2022

I think it's probably this region that is producing a list layout without the array parameter:

if isinstance(offsets, Index):
return tuple(
ListOffsetArray(offsets, x).toListOffsetArray64(False)
for x in outcontent
)
elif isinstance(starts, Index) and isinstance(stops, Index):
return tuple(
ListArray(starts, stops, x).toListOffsetArray64(False)
for x in outcontent
)
else:
raise ak._v2._util.error(
AssertionError(
"unexpected offsets, starts: {}, {}".format(
type(offsets), type(starts)
)
)
)

@agoose77
Copy link
Collaborator

agoose77 commented Sep 5, 2022

@ivirshup strings in Awkward array are just special views over lists of characters. We have a reasonable amount of special-case logic to detect this, but ultimately there is:

list {__array__: string}
    numpyarray {__array__: char}
        [116 104 105 115 116 104  97 116]

The "string" __array__ parameter means that we have an array of strings, e.g. ['this', 'that']. The "char" __array__ parameter provides a "character" view over an array of uint8 (utf8) characters.

@agoose77 agoose77 added bug The problem described is something that must be fixed and removed bug (unverified) The problem described would be a bug, but needs to be triaged labels Sep 5, 2022
@jpivarski
Copy link
Member

Something that I have been finding confusing is which of type, form, or layout is equivalent to the data shape string. I think this is quite related, since strings and byte strings are cases where this breaks down.

The Datashape string is the Type.

@ivirshup
Copy link
Author

ivirshup commented Sep 6, 2022

The Datashape string is the Type.

I find this confusing when both string and var are ListType. The number of entries in the typestr field is based on how far I can recurse down .type, except for some cases.

strings in Awkward array are just special views over lists of characters

I agree that strings are like variable size lists of characters, but text strings can be pretty special. For instance:

In [28]: "怎么样了"[1:]
Out[28]: '么样了'

In [29]: ak.Array(["怎么样了"])[:, 1:]
Out[29]: <Array ['\udc80\udc8e么样了'] type='1 * string'>

@jpivarski
Copy link
Member

Yes, strings are special, but they're special in how they're operated upon, not how they're stored or represented in memory. There's a lot of code that would be the same between strings and lists, so strings are implemented as a list with a parameter that we can use to override behaviors. (At a very early stage in Awkward's development, they were distinct implementations, but there was a lot of duplication.)

One such behavior that has already been overridden is

>>> import awkward._v2 as ak
>>> array = ak.Array(["one", "two", "three", "two", "two", "one"])
>>> array == "two"
<Array [False, True, False, True, True, False] type='6 * bool'>

as opposed to

>>> array2 = ak.without_parameters(ak.Array(["one", "two", "two"]))
>>> array2 == [[ord(x) for x in "two"]]
<Array [[False, False, False], ..., [True, ..., True]] type='3 * var * bool'>

and the idea is to keep adding behaviors that recognize the specialness of strings. (The ak.behavior mechanism was invented first to implement strings, and then domain-specific objects like https://github.com/scikit-hep/vector.)

@martindurant is planning on adding a suite of string manipulation functions (I found this mention of it, #1269, though we've been talking about it in a lot of different places).

As to whether strings should have codepoint-aware slicing is a significant question. It's what users might assume (and then have to modify what they're doing when they find out that slicing is numbered by bytes, not codepoints). It would be possible to implement through some new behavior mechanism (overloading deep __getitem__), though it would be a major project to implement, in part because finding the byte position of a specific codepoint is not random access: something has to scan, and that should be in compiled code. This isn't hampered by the fact that strings are implemented through lists; it's just a hard problem overall. And since it would break backward-compatibility of slicing, it would have to happen at a major version boundary, like Awkward v3.

@martindurant
Copy link
Contributor

whether strings should have codepoint-aware slicing

Please, no! You should have different methods or a whole namespace for string-specific operations. string_len, string_slice or something (ak.str?). 

Aside: the current hacking implementation for string functions in awkward-pandas does push all the ak.* (array) functions and utf8 and ascii functions into a namespace, but the description of each method and often its name makes it clear what the expected input is. This is just because series.ak.str.string_op seems too long to type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The problem described is something that must be fixed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants