improper types when using new DataFrames #30

ExpandingMan · 2016-10-10T18:29:36Z

When using DataFrames with NullableArrays (currently their master) a DataFrame that is saved an reloaded with Feather has the wrong eltypes.

For example, the dataframe with these eltypes

julia> eltypes(df)
5-element Array{Type,1}:
 Nullable{Float64} 
 Nullable{Int64}   
 Nullable{String}  
 Nullable{DateTime}
 Nullable{Date}

becomes a dataframe with these eltypes

julia> eltypes(df_test)
5-element Array{Type,1}:
 Float64                       
 Int64                         
 Nullable{WeakRefString{UInt8}}
 DateTime                      
 Date

So, as far as I can tell the only problems are that all types wind up not being in NullableArrays except for strings, which have the ("wrong") type WeakRefString{UInt8}.

The text was updated successfully, but these errors were encountered:

quinnj · 2016-10-10T19:48:41Z

This will be due to the fact that your first 2 columns don't have any null values, so when they get deserialized, they will be regular Vector{T} instead of NullableVectors. The Nullable{WeakRefString{UInt8}} is also due to an optimization of using WeakRefStrings for performance because materializing Strings can currently kill the performance of Feather. Nothing unexpected here, but we could probably use better docs around this.

ExpandingMan · 2016-10-10T20:11:04Z

I see. It seems like a problem that deserialized dataframes have significantly different behavior from the serialized ones they result from. It seems to me that the responsibility falls to the DataFrames package for the first issue (whether the new DataFrame has NullableArrays or not). The latter issue seems like something that should be addressed here. Perhaps with a flag to Feather.read or something, I'm not sure what the best answer is.

nalimilan · 2016-10-10T20:34:57Z

I don't really like the idea that the type changes depending on whether there are null values or not in the data (there's the same issue in RCall). Since Feather only support nullable arrays, maybe the default should be to always return a NullableArray, with an option to get a standard array when there are no nulls?

ExpandingMan · 2016-10-10T20:48:31Z

I don't really like the idea that the type changes depending on whether there are null values or not in the data

This is definitely a big issue. Currently, however, this issue is sort of baked into DataFrames, or else it wouldn't even be possible to do this. The way things stand right now I'd opt for everything always being a NullableArray.

I think what needs to happen is that there need to be two different types down graph from AbstractDataFrame, one which has NullableArrays and one which is not allowed to have null values at all. Then, there'd be a Type argument to Feather.read specifying which one you want.

Also, I'm wondering under just what circumstances WeakRefString is actually usable. If you always have to convert to Strings to actually do anything, it seems like that conversion would have to take place regardless of how expensive it is.

quinnj · 2016-10-19T11:43:11Z

Ok, so the new behavior (as of 2135bf6) is that all columns will be returned as NullableVector{T}. There's now a keyword argument, nullable::Bool=true, which, if set to false, will return any column with null_count == 0 as a regular Vector{T}. I like the consistency and predictability of returning NullableVectors all the time, while still providing a way for purists to get their vectors out if they want.

ExpandingMan · 2016-10-19T13:20:12Z

That seems ideal.

I think we need something analogous for converting the WeakRefStrings, as the wreak havoc on things that expect String, so I haven't found it possible to actually use them.

quinnj · 2016-10-19T13:27:33Z

Can you give me cases where you ran into problems with WeakRefStrings? I'd definitely be interested to hear about cases where they're causing problems.

ExpandingMan · 2016-10-19T13:32:27Z

I don't have a record of the exact cases when I encountered an issue to be honest, if I encounter more I'll keep track of them and let you know, but frankly I've been converting them for a while so I haven't seen more.

As I recall, there weren't any problems that seemed like bugs to me, so I'm not sure there'd be anything to actually fix, it was more of an interface issue. If I think of anything really specific I'll let you know.

jpfairbanks · 2016-11-16T23:11:10Z

I just ran into this problem. If you do a join with two DataFrames on a column where in one DataFrame it is a column of WeakRefStrings and in the other DataFrame the column is of type String the join fails with an ArgumentError because the two columns have different types.

I guess your solution to this is to use WeakRefStrings in the second DataFrame.

ExpandingMan · 2016-11-17T14:32:18Z

This issue is actually a really significant performance barrier. Deserialization into normal strings is much faster in the Python implementation as things are right now, that seems like a problem. Why should that be the case, does it have something to do with the way Julia strings work? Having a different string type for this is definitely not ideal, there is always the potential for them to work differently than String somehow, the joinng is a good example, but also, I'm guessing that any operation on them that causes them to be converted into String would be (perhaps unexpectedly) slow and it makes more sense to me to absorb all slowness into the (somewhat expectedly) slow deserialization process.

jpfairbanks · 2016-11-17T15:44:51Z

Is the problem with deserialization at the time of reading the file, or at any time? Is it possible to deserialize the file into WeakRefStrings and then later bulk convert them to regular strings faster than the time it takes to read from the file into regular strings? Or is this a problem with the performance of regular strings in general?

dmbates · 2016-11-17T17:18:25Z

I think the use of WeakRefStrings may be because of terminating NULL characters. The Julia String type appends a NULL. The Feather specification doesn't. A Vector{String} is stored as a single long Vector{UInt8} and a Vector{Int32} of offsets, as described in https://github.com/wesm/feather/blob/master/doc/FORMAT.md

<null bitmask, optional> <int32_t* value offsets> <uint8_t* value data>

There are no NULLs between adjacent strings in Feather.

ExpandingMan · 2016-11-17T17:23:58Z

It seems likely to me that since CPython is written in C, internally Python strings are null-terminated. If that's the case it means that what @dmbates told us shouldn't mean that the Julia deserialization of strings is necessarily slower.

quinnj · 2016-11-17T17:30:02Z

No, that's incorrect.

The use of WeakRefStrings is purely a performance optimization. Currently, a Julia String is defined to hold a Vector{UInt8} internally. Vector{UInt8} come with a non-trivial overhead (something like ~80 bytes last time I heard), not to mention the construction/destruction costs. It really boils back to the fact that when you have an immutable type that has a mutable field, the immutable type ends up not being as efficient as it could be right now in Julia.

WeakRefStrings solve these issues by being an actual pure immutable type that can "point" at existing data. Their overhead is exactly 24 bytes, and can be seen as a "StringView" of sorts, similar to a SubArray.

And we really need to squash the myth that's been going around that Julia String types have a terminating NULL character; this IS NOT THE CASE. In certain calls to ccall, a conversion to a Cstring is inserted which appends a NULL character for compatibility with C APIs, but in general, the Julia String type has no need or care with NULL terminating-characters.

dmbates · 2016-11-17T17:34:31Z

@quinnj Okay. I was wrong.

ExpandingMan · 2016-11-17T17:38:04Z

That makes it sound like there's really no choice but to wait for Julia to improve its String implementation. As a general comment, I tend to notice that the speed-up of Julia vs Python when it comes to numerics tends to be a factor of 20 to 30, while with strings it tends to be more like 10 (of course that's an absurdly broad statement).

Anyway, do you know if the String overhead issue is something that's being actively worked on for v0.6? I'm assuming that the fact that Julia strings are not null-terminated is good news for the ultimate attainable speed of feather deserialization, though I guess it depends on how exactly it's possible to store them in memory.

quinnj · 2016-11-17T17:57:38Z

Yes, we'd need JuliaLang/julia#18632 to be cleaned up/merged (probably non-trivial). Ultimately, we'd also want JuliaLang/julia#12447 to ensure that the Julia String type was backed by the lowest-level/lowest-overhead "raw bytes" type possible.

quinnj closed this as completed Oct 19, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improper types when using new DataFrames #30

improper types when using new DataFrames #30

ExpandingMan commented Oct 10, 2016

quinnj commented Oct 10, 2016

ExpandingMan commented Oct 10, 2016

nalimilan commented Oct 10, 2016

ExpandingMan commented Oct 10, 2016

quinnj commented Oct 19, 2016

ExpandingMan commented Oct 19, 2016

quinnj commented Oct 19, 2016

ExpandingMan commented Oct 19, 2016

jpfairbanks commented Nov 16, 2016

ExpandingMan commented Nov 17, 2016

jpfairbanks commented Nov 17, 2016

dmbates commented Nov 17, 2016

ExpandingMan commented Nov 17, 2016

quinnj commented Nov 17, 2016

dmbates commented Nov 17, 2016

ExpandingMan commented Nov 17, 2016

quinnj commented Nov 17, 2016

improper types when using new DataFrames #30

improper types when using new DataFrames #30

Comments

ExpandingMan commented Oct 10, 2016

quinnj commented Oct 10, 2016

ExpandingMan commented Oct 10, 2016

nalimilan commented Oct 10, 2016

ExpandingMan commented Oct 10, 2016

quinnj commented Oct 19, 2016

ExpandingMan commented Oct 19, 2016

quinnj commented Oct 19, 2016

ExpandingMan commented Oct 19, 2016

jpfairbanks commented Nov 16, 2016

ExpandingMan commented Nov 17, 2016

jpfairbanks commented Nov 17, 2016

dmbates commented Nov 17, 2016

ExpandingMan commented Nov 17, 2016

quinnj commented Nov 17, 2016

dmbates commented Nov 17, 2016

ExpandingMan commented Nov 17, 2016

quinnj commented Nov 17, 2016