ak.transform converts string to list of bytes if multiple arrays are passed #1672

ivirshup · 2022-09-05T20:11:47Z

Version of Awkward Array

1.9.0

Description and code to reproduce

Calling ak.transform on a single array containing strings will maintain the string type, but if multiple arrays are passed the strings are converted to lists of characters.

In [1]: import awkward._v2 as ak

In [2]: a = ak.Array(["foo", "foo", "bar"])
   ...: a
Out[2]: <Array ['foo', 'foo', 'bar'] type='3 * string'>

In [3]: ak.transform(lambda *args, **kwargs: None, a)
Out[3]: <Array ['foo', 'foo', 'bar'] type='3 * string'>

In [4]: ak.transform(lambda *args, **kwargs: None, a, a)
Out[4]: 
(<Array ['foo', 'foo', 'bar'] type='3 * var * char'>,
 <Array ['foo', 'foo', 'bar'] type='3 * var * char'>)

The text was updated successfully, but these errors were encountered:

jpivarski · 2022-09-05T20:29:25Z

Apparently, the __array__ = "string" parameter is being lost from the ListType (but not the __array__ = "char" from the NumpyType).

Thanks!

ivirshup · 2022-09-05T20:52:26Z

Something that I have been finding confusing is which of type, form, or layout is equivalent to the data shape string. I think this is quite related, since strings and byte strings are cases where this breaks down.

agoose77 · 2022-09-05T21:04:09Z

I think it's probably this region that is producing a list layout without the array parameter:

awkward/src/awkward/_v2/_broadcasting.py

Lines 592 to 609 in 1bcfd70

    
           if isinstance(offsets, Index): 
        
               return tuple( 
        
                   ListOffsetArray(offsets, x).toListOffsetArray64(False) 
        
                   for x in outcontent 
        
               ) 
        
           elif isinstance(starts, Index) and isinstance(stops, Index): 
        
               return tuple( 
        
                   ListArray(starts, stops, x).toListOffsetArray64(False) 
        
                   for x in outcontent 
        
               ) 
        
           else: 
        
               raise ak._v2._util.error( 
        
                   AssertionError( 
        
                       "unexpected offsets, starts: {}, {}".format( 
        
                           type(offsets), type(starts) 
        
                       ) 
        
                   ) 
        
               )

agoose77 · 2022-09-05T21:07:01Z

@ivirshup strings in Awkward array are just special views over lists of characters. We have a reasonable amount of special-case logic to detect this, but ultimately there is:

list {__array__: string}
    numpyarray {__array__: char}
        [116 104 105 115 116 104  97 116]

The "string" __array__ parameter means that we have an array of strings, e.g. ['this', 'that']. The "char" __array__ parameter provides a "character" view over an array of uint8 (utf8) characters.

jpivarski · 2022-09-05T21:49:53Z

Something that I have been finding confusing is which of type, form, or layout is equivalent to the data shape string. I think this is quite related, since strings and byte strings are cases where this breaks down.

The Datashape string is the Type.

ivirshup · 2022-09-06T10:06:33Z

The Datashape string is the Type.

I find this confusing when both string and var are ListType. The number of entries in the typestr field is based on how far I can recurse down .type, except for some cases.

strings in Awkward array are just special views over lists of characters

I agree that strings are like variable size lists of characters, but text strings can be pretty special. For instance:

In [28]: "怎么样了"[1:]
Out[28]: '么样了'

In [29]: ak.Array(["怎么样了"])[:, 1:]
Out[29]: <Array ['\udc80\udc8e么样了'] type='1 * string'>

jpivarski · 2022-09-06T17:13:25Z

Yes, strings are special, but they're special in how they're operated upon, not how they're stored or represented in memory. There's a lot of code that would be the same between strings and lists, so strings are implemented as a list with a parameter that we can use to override behaviors. (At a very early stage in Awkward's development, they were distinct implementations, but there was a lot of duplication.)

One such behavior that has already been overridden is

>>> import awkward._v2 as ak
>>> array = ak.Array(["one", "two", "three", "two", "two", "one"])
>>> array == "two"
<Array [False, True, False, True, True, False] type='6 * bool'>

as opposed to

>>> array2 = ak.without_parameters(ak.Array(["one", "two", "two"]))
>>> array2 == [[ord(x) for x in "two"]]
<Array [[False, False, False], ..., [True, ..., True]] type='3 * var * bool'>

and the idea is to keep adding behaviors that recognize the specialness of strings. (The ak.behavior mechanism was invented first to implement strings, and then domain-specific objects like https://github.com/scikit-hep/vector.)

@martindurant is planning on adding a suite of string manipulation functions (I found this mention of it, #1269, though we've been talking about it in a lot of different places).

As to whether strings should have codepoint-aware slicing is a significant question. It's what users might assume (and then have to modify what they're doing when they find out that slicing is numbered by bytes, not codepoints). It would be possible to implement through some new behavior mechanism (overloading deep __getitem__), though it would be a major project to implement, in part because finding the byte position of a specific codepoint is not random access: something has to scan, and that should be in compiled code. This isn't hampered by the fact that strings are implemented through lists; it's just a hard problem overall. And since it would break backward-compatibility of slicing, it would have to happen at a major version boundary, like Awkward v3.

martindurant · 2022-09-06T18:13:15Z

whether strings should have codepoint-aware slicing

Please, no! You should have different methods or a whole namespace for string-specific operations. string_len, string_slice or something (ak.str?).

Aside: the current hacking implementation for string functions in awkward-pandas does push all the ak.* (array) functions and utf8 and ascii functions into a namespace, but the description of each method and often its name makes it clear what the expected input is. This is just because series.ak.str.string_op seems too long to type.

ivirshup added the bug (unverified) The problem described would be a bug, but needs to be triaged label Sep 5, 2022

ivirshup mentioned this issue Sep 5, 2022

transform, but not caring about broadcasting regular dimensions #1668

Closed

agoose77 added bug The problem described is something that must be fixed and removed bug (unverified) The problem described would be a bug, but needs to be triaged labels Sep 5, 2022

agoose77 self-assigned this Sep 7, 2022

agoose77 mentioned this issue Sep 7, 2022

fix: carry parameters through broadcasting #1679

Merged

agoose77 closed this as completed in #1679 Sep 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ak.transform converts string to list of bytes if multiple arrays are passed #1672

ak.transform converts string to list of bytes if multiple arrays are passed #1672

ivirshup commented Sep 5, 2022

jpivarski commented Sep 5, 2022

ivirshup commented Sep 5, 2022

agoose77 commented Sep 5, 2022

agoose77 commented Sep 5, 2022 •

edited

Loading

jpivarski commented Sep 5, 2022

ivirshup commented Sep 6, 2022

jpivarski commented Sep 6, 2022

martindurant commented Sep 6, 2022

ak.transform converts string to list of bytes if multiple arrays are passed #1672

ak.transform converts string to list of bytes if multiple arrays are passed #1672

Comments

ivirshup commented Sep 5, 2022

Version of Awkward Array

Description and code to reproduce

jpivarski commented Sep 5, 2022

ivirshup commented Sep 5, 2022

agoose77 commented Sep 5, 2022

agoose77 commented Sep 5, 2022 • edited Loading

jpivarski commented Sep 5, 2022

ivirshup commented Sep 6, 2022

jpivarski commented Sep 6, 2022

martindurant commented Sep 6, 2022

agoose77 commented Sep 5, 2022 •

edited

Loading