Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Base DataFrame on NamedTuple #1335

Closed
nalimilan opened this issue Dec 30, 2017 · 47 comments
Closed

Base DataFrame on NamedTuple #1335

nalimilan opened this issue Dec 30, 2017 · 47 comments
Labels

Comments

@nalimilan
Copy link
Member

Now that NamedTuple is in Julia Base (on 0.7), we should use it in DataFrames. A possible approach which has been discussed on Slack would be to define a new TypedDataFrame{C<:NamedTuple} <: AbstractDataFrame type, which would just wrap a NamedTuple, and include information regarding the column names and types in its type parameter C. Then DataFrame would be a convenience type wrapping a TypedDataFrame object, with the advantages that 1) columns can be added/removed/renamed, and 2) functions do not need to be compiled for each combination of column types when it doesn't improve performance. TypedDataFrame would be useful to improve performance where the user provides a custom function operating on each row (notably groupby), as specialization is essential for performance in that case.

I have experimented this approach in the nl/typed branch, using the NamedTuples.jl package (since I used Julia 0.6 at first). It's mostly working (and passes some tests), but lots of fixes are needed to pass all tests. In particular, I originally removed the Index type since I thought NamedTuple could replace it (hence the sym_colinds and int_colinds attempts), but I'm not sure that's possible since we sometimes need to convert symbols to integer indices and vice-versa (cf. #1305). Handling of duplicate column names need to be improved (rebased on top of #1308).

@bkamins This touches areas you improved recently, and I'm not sure I'll be able to find the time to finish this right now. To avoid stepping on each other's toes (since this is required to implement the getindex improvements you've mentioned), if you want to take this up just let me know.

@quinnj
Copy link
Member

quinnj commented Dec 30, 2017

but I'm not sure that's possible since we sometimes need to convert symbols to integer indices and vice-versa

Can't you just rely on the fact that getfield(::NamedTuple, i) works w/ i being a Symbol or Integer? What are the cases where you actually need to convert between?

@nalimilan
Copy link
Member Author

Can't you just rely on the fact that getfield(::NamedTuple, i) works w/ i being a Symbol or Integer? What are the cases where you actually need to convert between?

The way the code is currently written, we sometimes need to convert all indices from either symbols or integers to a common representation since the function accepts both, and we don't want to write it twice. The lookup operation should be fast, so unless NamedTuple can provide us some way of doing it, we have to keep a dictionary mapping names to indices.

@quinnj
Copy link
Member

quinnj commented Dec 31, 2017

Wait, are talking about cases where the user is providing both symbols and integers? Why even allow that? It seems totally reasonable to just say, use symbols, or use integers, but you can't mix both.

@nalimilan
Copy link
Member Author

No, I mean that functions allow passing either symbols or integer, but internally they work with only one type of index. For example, that's the case of the stack/unstack functions. Maybe they could be refactored to support both without any conversion, but for now that's not how they work.

@bkamins
Copy link
Member

bkamins commented Dec 31, 2017

@nalimilan I would help - but because of the size of the PR we have to discuss how to work with it.

I would start with the performance - this relates to the question if it is worth to switch to NamedTuples. This again has several layers. Let me start with the most basic one (I do not want to start too many topics at the same time).

Field access in NamedTuple using Symbol has O(n) complexity. Similarly checking if field is present (haskey) or field index (this can be verified using Base.fieldindex).

This means that we get type stability of the return value of the function:
https://github.com/JuliaData/DataFrames.jl/compare/nl/typed#diff-ba2bdc24748beed6a8c4e5b45ab5d644R266

at the cost of performance. To give you the scale. Accessing column of DataFrame with Symbol that is 1000th column is 100x slower using NamedTuple implementation than using current Dict implementation (still both operations are reasonably fast - we would get into problems if we created DataFrames having millions of columns - which is not unthinkable, but not a common use case).

If we accept this penalty then we do not need internal Index as we can use Base.fieldindex to get the index (just we have to check with Julia core what are the plans for this function as it is currently not exported).

@nalimilan
Copy link
Member Author

The interest of using NamedTuple is mainly to get type-stability regarding the columns, which is essential in particular for groupby. In that case, the cost of resolving the name of the column is only paid once, then on each iteration the compiler should optimize out that lookup and turn it into a simpler getindex call on the corresponding column vector.

In general, the cost of looking up a column should never be significant in the computing time, or the code is not designed correctly. Indeed for efficient operation you really need to work directly with column vectors.

That said, it's always good to make an operation as efficient as possible, so I think we should keep Index if it really makes a performance difference when accessing a data frame column for reasonably-sized data frames. I'd say "reasonable" can go up to a few thousand columns, probably not more than that. When you say "to give you the scale", do you mean these are roughly the numbers you obtained with benchmarking? That would sound quite large indeed.

@bkamins
Copy link
Member

bkamins commented Jan 2, 2018

OK. I have done some more careful tests - earlier I was mixing different things when timing.
I use @time to get realistic user experience.

Conclusions:

  • In global scope access to NamedTuple is slower than to DataFrame (this problem should not be significant inside a function as probably this can be optimized by the compiler);
  • Construction of NamedTuple on 0.6.2 is prohibitively slow (not an issue in the long term);
  • Construction of NamedTuple on 0.7 is slow, it seems that complexity is O(n^2), but I have not investigated it in detail; but for small DataFrames it should be acceptable.

As a comment I think we should target the performance up to 100 000 number of columns, as those are the numbers of explanatory variables you can meet in practice in modeling in analytical data set. But this judgement is conditional on target audience of DataFrames package - in particular how it should be positioned against JuliaDB (for what usage scenarios which package should be recommended). I am not 100% clear about this.

Benchmark results below:

On 0.6.2:

julia> using NamedTuples
julia> using DataFrames
julia> cnames = [Symbol(string("a", i)) for i in 1:5000];
julia> columns = [[1] for i in 1:5000];
julia> @time C = eval(:(@NT($(cnames...)))){map(typeof, columns)...};
  1.756493 seconds (284.89 k allocations: 14.464 MiB, 0.47% gc time)
julia> @time nt = C(columns...);
 26.728015 seconds (1.99 M allocations: 105.869 MiB, 0.16% gc time)
julia> @time df = DataFrame(columns, cnames)
  0.889981 seconds (479.70 k allocations: 25.268 MiB, 0.82% gc time)
julia> # additionally check direct Dict lookup times
julia> z = Dict{Symbol,Vector{Int}}()
Dict{Symbol,Array{Int64,1}} with 0 entries
julia> for i in 1:5000 z[cnames[i]] = columns[i] end
julia> @time for i in 1:10^6 nt[:a5000] end;
  2.774900 seconds (11 allocations: 912 bytes)
julia> @time for i in 1:10^6 df[:a5000] end;
  0.077999 seconds (1.00 M allocations: 15.260 MiB, 3.94% gc time)
julia> @time for i in 1:10^6 z[:a5000] end;
  0.049605 seconds (32 allocations: 2.141 KiB)

On 0.7:

julia> @time nt = NamedTuple{(cnames...,)}((columns...,));
  3.819478 seconds (590.83 k allocations: 30.013 MiB, 0.23% gc time)

but for 20000 columns (which is not unthinkable):

julia> @time nt = NamedTuple{(cnames...,)}((columns...,));
 49.894595 seconds (2.40 M allocations: 157.249 MiB, 0.17% gc time)

The same on 0.6.2 is probably unacceptable:

julia> cnames = [Symbol(string("a", i)) for i in 1:20000];

julia> columns = [[1] for i in 1:20000];

julia> @time C = eval(:(@NT($(cnames...)))){map(typeof, columns)...};
 19.307509 seconds (898.39 k allocations: 44.163 MiB, 0.10% gc time)

julia> @time nt = C(columns...);
492.488972 seconds (7.32 M allocations: 392.716 MiB, 0.04% gc time)

@nalimilan
Copy link
Member Author

Don't worry about 0.6.2, we already have enough work on our plate. Also, note that most of the time is only needed the first time the type is created, i.e. the first time a table with specific column names and types is encountered. The second call is very fast.

@bkamins
Copy link
Member

bkamins commented Jan 2, 2018

I know, but the problem with DataFrame is that the chance to encounter the same name-type combination is very low.
Actually maybe it is possible to speedup the construction of large named tuples in Base - do you know who would be best to ask?

As a side note - your current proposed implementation is for 0.6.2 and will not work on 0.7.

@nalimilan
Copy link
Member Author

I know, but the problem with DataFrame is that the chance to encounter the same name-type combination is very low.

Yes. But 3 seconds for 5000 columns isn't too bad, anyway with that many columns many operations are going to be slow. For 500 columns it's already much more reasonable.

Actually maybe it is possible to speedup the construction of large named tuples in Base - do you know who would be best to ask?

The best person is probably Jeff, I can ask him on Slack.

As a side note - your current proposed implementation is for 0.6.2 and will not work on 0.7.

Yes, unfortunately I started working on that before NamedTuple was merged in Base. But it would be OK to stop using NamedTuples.jl and require 0.7.

@nalimilan
Copy link
Member Author

BTW, Jameson says we'd better not use NamedTuple at all for TypedDataFrame, and instead store of tuple of column names and a tuple of column types as two type parameters. I haven't investigated this yet, but he pointed at an example implementation similar to NamedTuple at https://github.com/bramtayl/Keys.jl/blob/master/src/Keys.jl.

NamedTuple would then only be used to pass a row in the public API (#1337).

@davidanthoff
Copy link
Contributor

@nalimilan Did Jameson say why that would be a better option? Just curios. Or is that discussion somewhere in the open so that I can read up on it?

@ararslan
Copy link
Member

ararslan commented Jan 2, 2018

Did Jameson say why that would be a better option?

Not really

@bkamins
Copy link
Member

bkamins commented Jan 2, 2018

I do not know Jameson's reason, but it seems to me that working this way would allow to create the required types faster. The question is how getindex_unrolled would behave for really large number of columns.

Additionally given discussion in #1337 I have (again - sorry) a question for the reasons we need a DataFrame as a wrapper of TypedDataFrame. Let me sum up what I understood as the reason:

  • in general we want to pass around DataFrame in high level and not performance critical functions - to save compilation time (we compile everything once - possibly even precompile everything to give a smooth experience for the users);
  • in performance critical parts we use TypedDataFrame as here speed is of crucial importance and compilation time each time a different TypedDataFrame is passed can be ignored (here precompilation is probably not possible).
  • we make sure that the whole codebase of DataFrames package in general accepts AbstractDataFrame as an argument and the user has three options:
    1. create DataFrame and keep using it everywhere for the smooth experience;
    2. create TypedDataFrame to have speed everywhere at the cost of precompilation;
    3. create DataFrame and generally use it but have convert function allowing to get TypedDataFrame when speed is needed.

Is this the line of thinking along which we want to go? If yes there are in general two possible designs:

  1. leave DataFrame as is and create TypedDataFrame as a separate type;
  2. have DataFrame wrap TypedDataFrame.

I understand that @nalimilan implemented the second option in his proposal. And I think it is better if we are sure that (I have not benchmarked it all):

  1. Creation of TypedDataFrame can be reasonably fast (this is what I have benchmarked);
  2. getindex/setindex methods for TypedDataFrame recompile fast (as they will have to recompile every time a new names/types structure of DataFrame is encountered) - this is probably true as those are small;
  3. functions like names! and insert! have a property that their recompilation+execution is fast against current implementation of DataFrame (as again they have to be recompiled every time and possibly what they do can be more time consuming as a new type has to be created inside).

@nalimilan
Copy link
Member Author

I think whether or not we use NamedTuple under the hood is relatively secondary, as most of the implementation will be the same anyway.

Your summary sounds good. The reason why I decided to implement DataFrame in terms of TypedDataFrame is that it allows using specialized functions automatically without code duplication, so that we just need to override functions that we don't want to be specialized. If we made TypedDataFrame a completely separate type, we would have to duplicate many functions.

It also means it is cheap to work on the TypedDataFrame object, since it's stored inside the DataFrame. This will be useful if we decide to encourage a public API where the user passes an anonymous function operating over rows (since type-stability is required for efficiency).

Regarding names! and insert!, these functions can only work on DataFrame and imply replacing the wrapped TypedDataFrame object, so they represent the case where storing a TypedDataFrame is not useful at all. Basing DataFrame on it relies on the assumption that you typically perform several operations without altering the columns; else, creating the TypedDataFrame on the fly for each operation would be as efficient.

The case of small getindex and setindex functions deserves some consideration, as they are so small and by definition type-unstable that delegating them to the TypedDataFrame object cannot make them faster. If they are inlined (as they probably should), the runtime overhead should be zero, but there can be a significant compilation overhead which we could avoid by not delegating them. Indeed it makes no point to specialize inherently type-unstable functions, forcing recompilation when it makes no difference.

Anyway, the decision to base DataFrame on TypedDataFrame can be changed at any point as it's purely an internal detail.

@bkamins
Copy link
Member

bkamins commented Jan 2, 2018

Thank you. Your comment actually clarified my thinking how I would approach this change in development process.

I would do the following to split the work:

  1. concentrate on developing TypedDataFrame and have it working and efficient - targeting Julia 0.7; at this stage we will have two AbstractDataFrame types - it will be easy to compare them in tests (especially benchmarks) and it will be clear how much duplicate code we have;
  2. next make a decision how to change DataFrame given what we learn from the stage one.

Looking at your branch you have more or less both steps covered. However, at least for me splitting it to two stages would simplify work on it.

Additionally - just to clarify - did you plan to expose TypedDataFrame as public. It is not in your https://github.com/JuliaData/DataFrames.jl/compare/nl/typed branch. Was this on purpose? My thinking is that there would be many use cases when people would want to work with it directly. This is - as I understand it - e.g. the reason why CompositeDataFrame in DataFramesMeta was introduced.

@nalimilan
Copy link
Member Author

I think it makes sense to do both steps as the same time, because most of the work consists in moving functions from dataframe.jl to typeddataframe.jl, adapting them, and writing a wrapper function in dataframe.jl to call the TypedDataFrame method. If you start implementing TypedDataFrame first, then you need to duplicate functions, which will not be covered by tests unless you duplicate lots of tests for DataFrame and TypedDataFrame. By basing DataFrame on TypedDataFrame, you also ensure tests cover both types at the same time (at least for common code). I'm not really concerned about benchmarking, since it's easy to switch git branches to compare the current and the new implementation.

I think I didn't export TypedDataFrame just because my first goal was to get DataFrame to pass the tests after basing it on TypedDataFrame. Then it would make sense to export it, but it would have to be documented, and the API might need some work.

@bkamins
Copy link
Member

bkamins commented Jan 3, 2018

OK. I start with a detailed look at typeddataframe.jl in your branch first anyway 😄.

@nalimilan
Copy link
Member Author

OK. The strategy is open for discussion of course, as this is very exploratory. We can change it if it turns out it doesn't work well.

There's also a question to investigate regarding the possible relationship with IndexedTable and it's Columns type, which is similar to TypedDataFrame.

@bkamins
Copy link
Member

bkamins commented Jan 3, 2018

Columns type is why I have asked earlier how we want to position DataFrames vs JuliaDB (can you please give your point of view on it as probably there were some discussions about it). Its design is pretty similar.

And has the problems I have encountered in my recent investigations on 0.6.2 like:

julia> z = Columns([[1] for i in 1:10000]...);
ERROR: StackOverflowError:

(probably this will not happen on 0.7 but the current implementation does not support 0.7 yet).

@nalimilan
Copy link
Member Author

I don't have a clear idea about the relationship with JuliaDB yet. We should discuss this with JuliaDB people. A first idea which comes to mind would be to have DataFrame work as it currently does, and IndexedTable/TypedDataFrame be the type-stable variant. All types could use a common API, at least everywhere it's possible.

@bkamins
Copy link
Member

bkamins commented Jan 3, 2018

I am asking because - at least for me - it is crucial for design decisions (i.e. to concentrate on excelling at the important use-cases of the given type) and to make sure that the efforts are coordinated.

My thinking was the following (but I might be missing something important here):

  1. JuliaDB:
    • out-of-core processing of huge data volumes; especially with very large number of rows;
    • the drawback is that mutating table is expensive (currently a new table is created every time);
    • target usage: queries/joins/aggregation;
    • parallelization via processes;
  2. DataFrames:
    • lightweight type for in-core computations; data fitting in RAM, but possibly large number of columns;
    • very cheap (in-place) mutating of DataFrame;
    • target usage: preparation of analytical data sets for modeling purposes; storage of low-volume (in memory) data created dynamically during execution of Julia program;
    • parallelization via threads (this is something that I guess has not started yet?);

@shashi
Copy link

shashi commented Jan 4, 2018

We are yet to port IndexedTables to Julia 0.7. I think that would be my first step in debugging JuliaData/IndexedTables.jl#81 ... Switching to Tuples/NamedTuples as a representation of rows will have this problem at least on 0.6. But I think since datasets only have a handful of distinct types it is possible to create large tuple types that have some kind of "compressed" type representation.

the drawback is that mutating table is expensive (currently a new table is created every time);

This is not necessarily expensive since a table is a cheap wrapper around data. The data will only be copied if you mutate / add / remove an indexed column.

@bkamins
Copy link
Member

bkamins commented Jan 4, 2018

@shashi Thank you for the response. The "compressed" type representation seems very promising the challenge is probably to ensure return type inference when allowing indexing both by column number or column name.

Regarding table mutation performance - the question is what would be time complexity of operation like:

  • start with table with 1000 variables
  • iteratively add additional derived 10000 variables (in the simplest scenario - adding one variable in one step)

My fear was that this will be O(n^2).

Look at a typical scenario (during last 20 years I have been in hundreds of projects where something like this was required, especially in exploratory data analysis, where you do not know upfront which variables and what transformations you want):

using JuliaDB
using DataFrames

function test_table(lead)
    tbl = table([1], names=[Symbol(lead*"1")])

    for i in 2:100
        tbl = pushcol(tbl, Symbol(lead*"$i"), [i])
    end
    tbl
end

function test_df(lead)
    df = DataFrame([[1]], [Symbol(lead*"1")])

    for i in 2:100
        df[Symbol(lead*"$i")] = [i]
    end
    df
end

And now timing (I use @time on purpose as this is what users will see in real life), if you want to stress test this you can increase 100 to 1000 (which is still small by current industry practice) and probably check the results the next day (and 10000 is probably totally prohibitive):

julia> @time test_table("a");
 57.216356 seconds (34.81 M allocations: 1.965 GiB, 1.25% gc time)

julia> @time test_table("b");
 45.485015 seconds (29.35 M allocations: 1.664 GiB, 1.55% gc time)

julia> @time test_df("a");
  0.489610 seconds (59.30 k allocations: 3.147 MiB)

julia> @time test_df("b");
  0.000161 seconds (455 allocations: 35.078 KiB)

another example with column renaming (a bit less extreme and artificial but rooted in the same problems):

function n_table()
    t = table([0.01, 0.05], names=[:t1])
    for i in 1:100
        old = Symbol("t"*"$i")
        new = Symbol("t"*"$(i+1)")
        t = renamecol(t, old, new)
    end
    t
end

function n_df()
    df = DataFrame([[0.01, 0.05]], [:t1])
    for i in 1:100
        new = Symbol("t"*"$(i+1)")
        names!(df, [new])
    end
    df
end

and timing:

julia> @time n_table();
 19.575409 seconds (7.75 M allocations: 423.706 MiB, 0.84% gc time)

julia> @time n_df();
  0.489630 seconds (133.28 k allocations: 6.840 MiB)

This time complexity is linear but simply the operation is much more expensive.

In consequence there are the following possibilities:

  • the above code can be dramatically faster on table on 0.7 - then we have a simple situation - also in relation to DataFrames as we get what we want with an additional benefit of indexing, which we do not have now in DataFrames;
  • it is impossible to have such code work fast interactively - then in my opinion we need two kinds of data structures fitting different use cases and efficient methods allowing conversions.
  • the third option is to teach people to have a different analytical workflow - but this will be problematic, because this will require changing habits.

What I want to avoid is the pain currently people have with plotting times (and for plots I believe that this is solvable with precompilation, but here I am not 100% sure it is possible as each NamedTuple is a new type every time e.g. name of variable changes - but maybe I am wrong).

If we want Julia ecosystem to be appealing to a casual user (which I believe is crucial for large-scale adoption) we have to make operations as above fast in interactive use.

@shashi
Copy link

shashi commented Jan 4, 2018

Bummer. Most slowdown in those examples seems to come from compilation.

There's a type in JuliaDB named ColDict which lets you represent a table a dictionary of columns and do these things much faster, constructing the typed version of the table only once:

julia> function test_coldict(lead)
           tbl = table([1], names=[Symbol(lead*"1")])
           cdict = ColDict(tbl)
           for i in 2:100
               cdict[Symbol(lead*"$i")] = [i]
           end
           cdict[]
       end
test_coldict (generic function with 1 method)

julia> @time test_coldict("b");
  0.022794 seconds (3.46 k allocations: 336.899 KiB)
julia> function n_coldict()
           t = table([0.01, 0.05], names=[:t1]); d = ColDict(t);
           for i in 1:100
               old = Symbol("t"*"$i")
               new = Symbol("t"*"$(i+1)")
               IndexedTables.rename!(d, old, new)
           end
           d[]
       end
n_coldict (generic function with 1 method)

julia> @time n_coldict()
  0.016808 seconds (3.51 k allocations: 178.416 KiB)

It would be nice if one day ColDict could just be DataFrame. But I'd hate to have different implementations of all operations for the two types.

Edit: update timing to be of first run

@bkamins
Copy link
Member

bkamins commented Jan 4, 2018

Excellent! I was not aware of ColDict but this is exactly what I had in mind.

The issues are (but they are solvable I suppose):

  • for 1000 columns we still get StackOverflow on Julia 0.6.2 (a known problem);
  • you are reporting the time after compilation and exactly compilation time is key here; it is now around 5 seconds for several hundreds of columns and it is acceptable I would say (with 0.7 maybe it will get faster) - we have to wait what happens till we can test 10_000 or 100_000 columns.

The only difference is that ColDict holds reference to parent Table and DataFrame does not.

@nalimilan - any thoughts on this?

@shashi
Copy link

shashi commented Jan 4, 2018

Updated the timings to reflect compilation overhead.

I think it would be good to document the ColDict API although it is redundant with the functional counterparts.

@shashi
Copy link

shashi commented Jan 4, 2018

I have discussed this with @quinnj once. One path for merging the JuliaDB and DataFrames effort is this:

  1. Move Columns out into a package (Jacob prefers just doing this with NamedTuples, I'm open to that)
  2. Move out the groupreduce, groupby and join kernels into a RelationalKernels.jl package: all optimizations benefit everyone using this package. Jacob is working on implementing these kernels in a streaming fashion in https://github.com/JuliaData/DataStreams.jl/blob/master/src/query.jl -- I think it would need specialized methods for DataStreams where all the columns are in memory (rather than stream) -- it seems this leads to algorithm with better memory access patterns.
  3. Get IndexedTables and DataFrames to use RelationalKernels.jl
  4. Make DataFrames <-> IndexedTables.Table conversion straightforward and cheap. (Here is where ColDict and DataFrame can be made to be the same thing)
  5. Profit

@nalimilan
Copy link
Member Author

It's great to have this discussion. A few comments @shashi's points:

  1. What does Columns offer compared with NamedTuple? It looks like it's an intermediate type which provides a few more features than NamedTuples, but which is much more limited than Table/DataFrame. Wouldn't it make sense to get rid of that intermediate type and use either (depending on cases) a very simple NamedTuple, or a full-fledged table implementing the IndexedTablesTable/DataFrame interface?
  2. How would RelationalKernels.jl relate to SplitApplyCombine.jl? Would it make sense to move these functions there? Note that there are implementations in DataFrames too, probably with different algorithms.
  3. What's the fundamental difference between IndexedTables.Table and a type-stable variant of DataFrame like TypedDataFrame? Is it just that the former allows using the indexed columns with getindex? I'm asking because I think DataFrame would benefit from supporting indexed columns, just like databases. So we could use the same structure under the hood, and just using slightly different interfaces where needed.

Regarding the performance of adding/renaming columns, in think we should benchmark the different possible solutions in 0.7 (no need to care about 0.6 at this point) to see whether creating the underlying TypedDataFrame (in the context of the implementation discussed at the top of this issue) would be too expensive or not. There are at least two different possible implementations: if storing a tuple of names and a tuple of column types is faster than storing a NamedTuple, we can do that (as Jameson recommended). Finally, if that's too slow, we can change the approach I proposed above and only create the TypedDataFrame object right before performing operations which would benefit from type stability (e.g. filter): you wouldn't have to pay the cost each time you add or rename a column if you don't perform these operations in the mean time.

@shashi
Copy link

shashi commented Jan 4, 2018

What does Columns offer compared with NamedTuple?

Columns is an iterable of named tuples but internally holds a named tuple of columns. It's at best 100 lines of wrapper. I think the main benefit is being able implement operations using AbstractArray interface. e.g. iteration, getindex, setindex, length etc.

How would RelationalKernels.jl relate to SplitApplyCombine.jl? Would it make sense to move these functions there?

It would, for now the point is keeping the kernels in a common place that other packages can use. Jacob has been prototyping this in DataStreams as well -- although I think it could be cleanly moved to a different package which depends on DataStreams.

What's the fundamental difference between IndexedTables.Table and a type-stable variant of DataFrame like TypedDataFrame? Is it just that the former allows using the indexed columns with getindex?

Not much of a difference except df[1] returns the first row rather than the first column. It's also not indexed by the keys, that datastructure was renamed back to NDSparse (see JuliaDB datastructures)

I'm asking because I think DataFrame would benefit from supporting indexed columns, just like databases. So we could use the same structure under the hood, and just using slightly different interfaces where needed.

I think IndexedTables.Table can definitely play the role of the backing store. It's also possible to define like 3 methods to get all of the indexing machinery in IndexedTables hooked up with dataframes.

@cstjean
Copy link
Contributor

cstjean commented Jan 24, 2018

BTW, Jameson says we'd better not use NamedTuple at all for TypedDataFrame, and instead store of tuple of column names and a tuple of column types as two type parameters.

That's what TypedTables.jl does. It works ok.

@bkamins
Copy link
Member

bkamins commented Jan 26, 2018

Yes, the challenge is syntax, as you have to use Val there.

@shashi
Copy link

shashi commented Jan 26, 2018

BTW, Jameson says we'd better not use NamedTuple at all for TypedDataFrame, and instead store of tuple of column names and a tuple of column types as two type parameters.

@vtjnash Why is this?

Yes, the challenge is syntax, as you have to use Val there.

Maybe constant propagation would make that unnecessary?

@bkamins
Copy link
Member

bkamins commented Dec 20, 2018

Continuing from #1643: we can do either what you propose:

it would have to be part of the type

or exactly the opposite - do not store it as a part of the type (this is what I was thinking of). Then we can freely add/remove/change type of columns in the DataFrame and this will be cheap (no need to recalculate typed as long as we do not read it). The drawback is that we have to do one dynamic dispatch to resolve the exact type of typed via a function barrier (this will be a problem only when we do many unrelated very atomic operations), but the benefit is that we have it already materialized so if we do multiple references to typed it is already there and we pay only the cost of this one multiple dispatch.

@bkamins bkamins mentioned this issue Jan 15, 2019
31 tasks
@bkamins
Copy link
Member

bkamins commented Feb 5, 2019

@nalimilan I have investigated the current status of NamedTuple solution. Here are the benchmarks:

julia> v = [[1,2,3], ["1","2","3"], [1.0,2,3], [true, false, true], ['a','b','c'], ["a", missing, "b"], [missing, 1,2], categorical([1,2,3]), categorical('a':'c')];

julia> using Tables

julia> df = DataFrame(rand(v, 100)); @time Tables.columntable(df);
  0.443643 seconds (1.26 M allocations: 64.079 MiB, 3.31% gc time)

julia> df = DataFrame(rand(v, 1000)); @time Tables.columntable(df);
  1.443612 seconds (2.11 M allocations: 119.052 MiB, 5.88% gc time)

julia> df = DataFrame(rand(v, 2000)); @time Tables.columntable(df);
  3.745693 seconds (4.24 M allocations: 250.952 MiB, 2.31% gc time)

julia> df = DataFrame(rand(v, 4000)); @time Tables.columntable(df);
 54.084183 seconds (8.50 M allocations: 560.535 MiB, 0.46% gc time)

julia> df = DataFrame(rand(v, 3000)); @time Tables.columntable(df);
 33.326847 seconds (6.37 M allocations: 398.119 MiB, 0.61% gc time)

julia> df = DataFrame(rand(v, 2500)); @time Tables.columntable(df);
  5.436689 seconds (5.30 M allocations: 322.788 MiB, 2.15% gc time)

julia> df = DataFrame(rand(v, 2750)); @time Tables.columntable(df);
  6.554823 seconds (5.84 M allocations: 359.988 MiB, 2.85% gc time)

julia> df = DataFrame(rand(v, 2900)); @time Tables.columntable(df);
 32.121291 seconds (6.21 M allocations: 385.672 MiB, 0.51% gc time)

julia> df = DataFrame(rand(v, 2800)); @time Tables.columntable(df);
 30.857733 seconds (5.94 M allocations: 367.524 MiB, 0.67% gc time)

It seems we are hitting some wall around 2800 columns in this case. Not sure why it starts exploding. My simple conclusion from this (maybe oversimplified) is the following (given our recent discussions):

  • keep index as is
  • add nt field that is an evaluated version of Tables.columntable(df) (this field will not be type stable)
  • some operations will populate nt by default (e.g. groupby) conditional on some threshold (i.e. if there are too many columns then this would not happen); probably it should be possible to force creation of this nt field by some custom method to override automatic behavior
  • have a dirty flag keeps track if nt is valid (i.e. if we mutated Index in a way that invalidates nt if it was created earlier)

In this way we would be only one-step (one barier function) away from a type stable NamedTuple where needed (e.g. combine) while being fast on column add/delete/mutate name/mutate type.

@nalimilan
Copy link
Member Author

Interesting. But even for 100 columns, isn't the overhead already quite large? Some operations are really fast currently, adding a half a second delay doesn't sound great.

@bkamins
Copy link
Member

bkamins commented Feb 7, 2019

I agree. But I assume that we would not be materializing nt with every operation.
The simplest design would be to:

  • never materialize nt by default
  • have a function that forces this materialization
  • then some functions, if nt is mateialized and is not dirty could use it

Alternatively we could use some threshold value to decide when to do automatic nt creation (of course it is not easy to do and would depend on many different things, that is why the path outlined in points above is a safe opt-in approach).

@nalimilan
Copy link
Member Author

Yeah, it would be interesting to experiment with adding a type parameter which would default to Nothing or something like that, and be able to use it to store column types when needed. That would be a first step and then we could experiment further with doing that automatically. But I guess that would require adapting getindex (and other functions) quite significantly for it to have any effect on performance.

@bkamins
Copy link
Member

bkamins commented Feb 7, 2019

that would require adapting getindex (and other functions) quite significantly for it to have any effect on performance

Actually I would not touch these standard "cheap" functions at all, as the effect of having nt type can be seen only after a function barier anyway. What we could have is:

  • the nt materializer will probably return the nt, so user can then work on it with getindex etc. beyond a barrier function when wanting to work with a NamedTuple of vectors
  • actually the only functions that would use it by default would be the ones that do heavy-lifting stuff (like sorting or grouping)

The only significant problem with this approach I see is that SubDataFrame creation would require a careful thought, because we could not reuse nt from a parent.

@nalimilan
Copy link
Member Author

Actually what I suggested was that by storing column types in a type parameter, we could make a function such as f(df::DataFrame) = df[:col] type-stable. Of course we could just provide a function to get a named tuple, but AFAIK Tables.columntable already does that.

@bkamins
Copy link
Member

bkamins commented Feb 8, 2019

My point was that this type stability is not something we will get easily anyway:

julia> nt = (a=1,b=1.0,c=true,d="1",e='1')
(a = 1, b = 1.0, c = true, d = "1", e = '1')

julia> f(nt, symb) = nt[symb]
f (generic function with 1 method)

julia> @code_warntype f(nt, :a)
Body::Any
1 ─ %1 = (Base.getfield)(nt, symb)::Any
└──      return %1

From my experience it is only possible to get if what you index with is a constant that is known at compilation time, so you need a function barier with a dynamic dispatch anyway (unless I do not understand something here).

My idea was rather to have Tables.columntable be called automatically (or on request) and the result be cached. In this way you can save time calling and constructing Tables.columntable several times if you perform sequentially some operations on a data frame that would benefit from it.

@cstjean
Copy link
Contributor

cstjean commented Feb 8, 2019

My understanding is that constant propagation might make f(nt, :a) efficient in practical situations.

julia> const nt = (a=1,b=1.0,c=true,d="1",e='1')
(a = 1, b = 1.0, c = true, d = "1", e = '1')

julia> f(nt, symb) = nt[symb]
f (generic function with 1 method)

julia> g() = f(nt, :b)
g (generic function with 1 method)

julia> @code_warntype g()
Body::Float64
1 1return 1.0

I wish there were clear guidelines about when it can be relied upon.

@bkamins
Copy link
Member

bkamins commented Feb 8, 2019

Yes - and this is exactly what I want to say. More generally:

  1. you need a function wrapper for the constant to get propagated
  2. if variable name is not a constant but a true variable you need a barrier function inside to get type stability for heavy-duty work

In both cases you have one dynamic dispatch and might hit a precompilation lag.

I wish there were clear guidelines about when it can be relied upon.

From my experience the two behaviors above work reliably (although @nalimilan noted that sometimes, when inlining happens, constant propagation might not be performed - but in my experience I did not have a situation when compiler performed inlining in such cases).

@jlumpe
Copy link
Contributor

jlumpe commented Sep 16, 2019

What's the current status of this project?

@nalimilan
Copy link
Member Author

That's on hold, at least until we have tagged 1.0.

@bkamins
Copy link
Member

bkamins commented Aug 9, 2020

@nalimilan - do you believe there will be one day when we go back to this discussion? If not I would close it. I think that the current consensus is that DataFrames.jl will just stay type unstable.

@nalimilan
Copy link
Member Author

Yes we're not going to do that in the 1.x series anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants