-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ak.transform converts string to list of bytes if multiple arrays are passed #1672
Comments
Apparently, the Thanks! |
Something that I have been finding confusing is which of type, form, or layout is equivalent to the data shape string. I think this is quite related, since strings and byte strings are cases where this breaks down. |
I think it's probably this region that is producing a list layout without the array parameter: awkward/src/awkward/_v2/_broadcasting.py Lines 592 to 609 in 1bcfd70
|
@ivirshup strings in Awkward array are just special views over lists of characters. We have a reasonable amount of special-case logic to detect this, but ultimately there is:
The |
The Datashape string is the |
I find this confusing when both
I agree that strings are like variable size lists of characters, but text strings can be pretty special. For instance: In [28]: "怎么样了"[1:]
Out[28]: '么样了'
In [29]: ak.Array(["怎么样了"])[:, 1:]
Out[29]: <Array ['\udc80\udc8e么样了'] type='1 * string'> |
Yes, strings are special, but they're special in how they're operated upon, not how they're stored or represented in memory. There's a lot of code that would be the same between strings and lists, so strings are implemented as a list with a One such behavior that has already been overridden is >>> import awkward._v2 as ak
>>> array = ak.Array(["one", "two", "three", "two", "two", "one"])
>>> array == "two"
<Array [False, True, False, True, True, False] type='6 * bool'> as opposed to >>> array2 = ak.without_parameters(ak.Array(["one", "two", "two"]))
>>> array2 == [[ord(x) for x in "two"]]
<Array [[False, False, False], ..., [True, ..., True]] type='3 * var * bool'> and the idea is to keep adding behaviors that recognize the specialness of strings. (The @martindurant is planning on adding a suite of string manipulation functions (I found this mention of it, #1269, though we've been talking about it in a lot of different places). As to whether strings should have codepoint-aware slicing is a significant question. It's what users might assume (and then have to modify what they're doing when they find out that slicing is numbered by bytes, not codepoints). It would be possible to implement through some new behavior mechanism (overloading deep |
Please, no! You should have different methods or a whole namespace for string-specific operations. string_len, string_slice or something (ak.str?). Aside: the current hacking implementation for string functions in awkward-pandas does push all the ak.* (array) functions and utf8 and ascii functions into a namespace, but the description of each method and often its name makes it clear what the expected input is. This is just because series.ak.str.string_op seems too long to type. |
Version of Awkward Array
1.9.0
Description and code to reproduce
Calling
ak.transform
on a single array containing strings will maintain the string type, but if multiple arrays are passed the strings are converted to lists of characters.The text was updated successfully, but these errors were encountered: