Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improper types when using new DataFrames #30

Closed
ExpandingMan opened this issue Oct 10, 2016 · 17 comments
Closed

improper types when using new DataFrames #30

ExpandingMan opened this issue Oct 10, 2016 · 17 comments

Comments

@ExpandingMan
Copy link
Collaborator

When using DataFrames with NullableArrays (currently their master) a DataFrame that is saved an reloaded with Feather has the wrong eltypes.

For example, the dataframe with these eltypes

julia> eltypes(df)
5-element Array{Type,1}:
 Nullable{Float64} 
 Nullable{Int64}   
 Nullable{String}  
 Nullable{DateTime}
 Nullable{Date} 

becomes a dataframe with these eltypes

julia> eltypes(df_test)
5-element Array{Type,1}:
 Float64                       
 Int64                         
 Nullable{WeakRefString{UInt8}}
 DateTime                      
 Date    

So, as far as I can tell the only problems are that all types wind up not being in NullableArrays except for strings, which have the ("wrong") type WeakRefString{UInt8}.

@quinnj
Copy link
Member

quinnj commented Oct 10, 2016

This will be due to the fact that your first 2 columns don't have any null values, so when they get deserialized, they will be regular Vector{T} instead of NullableVectors. The Nullable{WeakRefString{UInt8}} is also due to an optimization of using WeakRefStrings for performance because materializing Strings can currently kill the performance of Feather. Nothing unexpected here, but we could probably use better docs around this.

@ExpandingMan
Copy link
Collaborator Author

I see. It seems like a problem that deserialized dataframes have significantly different behavior from the serialized ones they result from. It seems to me that the responsibility falls to the DataFrames package for the first issue (whether the new DataFrame has NullableArrays or not). The latter issue seems like something that should be addressed here. Perhaps with a flag to Feather.read or something, I'm not sure what the best answer is.

@nalimilan
Copy link
Member

I don't really like the idea that the type changes depending on whether there are null values or not in the data (there's the same issue in RCall). Since Feather only support nullable arrays, maybe the default should be to always return a NullableArray, with an option to get a standard array when there are no nulls?

@ExpandingMan
Copy link
Collaborator Author

I don't really like the idea that the type changes depending on whether there are null values or not in the data

This is definitely a big issue. Currently, however, this issue is sort of baked into DataFrames, or else it wouldn't even be possible to do this. The way things stand right now I'd opt for everything always being a NullableArray.

I think what needs to happen is that there need to be two different types down graph from AbstractDataFrame, one which has NullableArrays and one which is not allowed to have null values at all. Then, there'd be a Type argument to Feather.read specifying which one you want.

Also, I'm wondering under just what circumstances WeakRefString is actually usable. If you always have to convert to Strings to actually do anything, it seems like that conversion would have to take place regardless of how expensive it is.

@quinnj
Copy link
Member

quinnj commented Oct 19, 2016

Ok, so the new behavior (as of 2135bf6) is that all columns will be returned as NullableVector{T}. There's now a keyword argument, nullable::Bool=true, which, if set to false, will return any column with null_count == 0 as a regular Vector{T}. I like the consistency and predictability of returning NullableVectors all the time, while still providing a way for purists to get their vectors out if they want.

@quinnj quinnj closed this as completed Oct 19, 2016
@ExpandingMan
Copy link
Collaborator Author

That seems ideal.

I think we need something analogous for converting the WeakRefStrings, as the wreak havoc on things that expect String, so I haven't found it possible to actually use them.

@quinnj
Copy link
Member

quinnj commented Oct 19, 2016

Can you give me cases where you ran into problems with WeakRefStrings? I'd definitely be interested to hear about cases where they're causing problems.

@ExpandingMan
Copy link
Collaborator Author

I don't have a record of the exact cases when I encountered an issue to be honest, if I encounter more I'll keep track of them and let you know, but frankly I've been converting them for a while so I haven't seen more.

As I recall, there weren't any problems that seemed like bugs to me, so I'm not sure there'd be anything to actually fix, it was more of an interface issue. If I think of anything really specific I'll let you know.

@jpfairbanks
Copy link

I just ran into this problem. If you do a join with two DataFrames on a column where in one DataFrame it is a column of WeakRefStrings and in the other DataFrame the column is of type String the join fails with an ArgumentError because the two columns have different types.

I guess your solution to this is to use WeakRefStrings in the second DataFrame.

@ExpandingMan
Copy link
Collaborator Author

This issue is actually a really significant performance barrier. Deserialization into normal strings is much faster in the Python implementation as things are right now, that seems like a problem. Why should that be the case, does it have something to do with the way Julia strings work? Having a different string type for this is definitely not ideal, there is always the potential for them to work differently than String somehow, the joinng is a good example, but also, I'm guessing that any operation on them that causes them to be converted into String would be (perhaps unexpectedly) slow and it makes more sense to me to absorb all slowness into the (somewhat expectedly) slow deserialization process.

@jpfairbanks
Copy link

Is the problem with deserialization at the time of reading the file, or at any time? Is it possible to deserialize the file into WeakRefStrings and then later bulk convert them to regular strings faster than the time it takes to read from the file into regular strings? Or is this a problem with the performance of regular strings in general?

@dmbates
Copy link
Contributor

dmbates commented Nov 17, 2016

I think the use of WeakRefStrings may be because of terminating NULL characters. The Julia String type appends a NULL. The Feather specification doesn't. A Vector{String} is stored as a single long Vector{UInt8} and a Vector{Int32} of offsets, as described in https://github.com/wesm/feather/blob/master/doc/FORMAT.md

<null bitmask, optional> <int32_t* value offsets> <uint8_t* value data>

There are no NULLs between adjacent strings in Feather.

@ExpandingMan
Copy link
Collaborator Author

It seems likely to me that since CPython is written in C, internally Python strings are null-terminated. If that's the case it means that what @dmbates told us shouldn't mean that the Julia deserialization of strings is necessarily slower.

@quinnj
Copy link
Member

quinnj commented Nov 17, 2016

No, that's incorrect.

The use of WeakRefStrings is purely a performance optimization. Currently, a Julia String is defined to hold a Vector{UInt8} internally. Vector{UInt8} come with a non-trivial overhead (something like ~80 bytes last time I heard), not to mention the construction/destruction costs. It really boils back to the fact that when you have an immutable type that has a mutable field, the immutable type ends up not being as efficient as it could be right now in Julia.

WeakRefStrings solve these issues by being an actual pure immutable type that can "point" at existing data. Their overhead is exactly 24 bytes, and can be seen as a "StringView" of sorts, similar to a SubArray.

And we really need to squash the myth that's been going around that Julia String types have a terminating NULL character; this IS NOT THE CASE. In certain calls to ccall, a conversion to a Cstring is inserted which appends a NULL character for compatibility with C APIs, but in general, the Julia String type has no need or care with NULL terminating-characters.

@dmbates
Copy link
Contributor

dmbates commented Nov 17, 2016

@quinnj Okay. I was wrong.

@ExpandingMan
Copy link
Collaborator Author

That makes it sound like there's really no choice but to wait for Julia to improve its String implementation. As a general comment, I tend to notice that the speed-up of Julia vs Python when it comes to numerics tends to be a factor of 20 to 30, while with strings it tends to be more like 10 (of course that's an absurdly broad statement).

Anyway, do you know if the String overhead issue is something that's being actively worked on for v0.6? I'm assuming that the fact that Julia strings are not null-terminated is good news for the ultimate attainable speed of feather deserialization, though I guess it depends on how exactly it's possible to store them in memory.

@quinnj
Copy link
Member

quinnj commented Nov 17, 2016

Yes, we'd need JuliaLang/julia#18632 to be cleaned up/merged (probably non-trivial). Ultimately, we'd also want JuliaLang/julia#12447 to ensure that the Julia String type was backed by the lowest-level/lowest-overhead "raw bytes" type possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants