Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TableTraits.jl integration #76

Merged
merged 2 commits into from
Aug 30, 2017
Merged

Conversation

davidanthoff
Copy link
Contributor

No description provided.

@codecov-io
Copy link

codecov-io commented Aug 23, 2017

Codecov Report

Merging #76 into master will decrease coverage by 0.1%.
The diff coverage is 76.92%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #76      +/-   ##
=========================================
- Coverage   92.61%   92.5%   -0.11%     
=========================================
  Files           6       7       +1     
  Lines        1042    1068      +26     
=========================================
+ Hits          965     988      +23     
- Misses         77      80       +3
Impacted Files Coverage Δ
src/IndexedTables.jl 93.95% <ø> (+0.67%) ⬆️
src/tabletraits.jl 76.92% <76.92%> (ø)
src/utils.jl 92.89% <0%> (+0.59%) ⬆️
src/columns.jl 93.12% <0%> (+0.76%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 565ce82...1d83fc3. Read the comment docs.

Copy link
Collaborator

@shashi shashi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor comments. This is nice! 👍

end
end

function IndexedTable(x; idxcols::Union{Void,Vector{Symbol}}=nothing, datacols::Union{Void,Vector{Symbol}}=nothing)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This could also use an optimized method when x is Columns!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you should just be able to add another method that handles that case, right? It would be good if the named arguments had the same semantics, of course.

I'm also not sure this is the right API, I just was loosely inspired by this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, one more thing: we should also add a code path to this method that deals with an iterator where the element type is Pair{X,S}. If it is just any Pair, it would create an unnamed index and data column. If either X or X are a NamedTuple, it would create named columns for the index and data columns. At that point the following would automatically work:

@from i in source begin
    @select {i.a, i.b} => {i.c,i.d}
    @collect IndexedTable
end

Not in this PR, but could be added later.

source_colnames = TableTraits.column_names(iter)
source_coltypes = TableTraits.column_types(iter)

if idxcols==nothing && datacols==nothing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case should probably result in a Columns(1:n) column as the index, mirroring the behavior of loadfiles in JuliaDB.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then that's not what IndexedTable(xs::Vector...) does...! Maybe it should?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a natural conversion of Columns to IndexedTable is to have a 1:n index: it's the same as a 1-d array of named tuples.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Index columns need unique values, right? So one could also say that in this case it should create an IndexedTable without any index, so that this conversion works for any table data without having things like a unique requirement. But I'm not sure, I think the main thing is to be consistent across the different ways to create things both in IndexedTable and JuliaDB.

@JeffBezanson
Copy link
Contributor

Cool, thanks David! Will there eventually be a way to get a length (or estimate) from the iterator, or better yet reuse whole columns in place?

@davidanthoff
Copy link
Contributor Author

Will there eventually be a way to get a length (or estimate) from the iterator

Yes, the whole design here just follows the iterator interface in base. So if the source returns HasLength() from iteratorsize, length will work. In fact, I think almost all the table sources support that, and in Query.jl queries the HasLength() property also gets preserved if possible (i.e. operations like @select preserve it, @filter obviously doesn't).

So the issue here is just that this PR is not the most efficient implementation :) AND I just realized that this PR is a bit silly. https://github.com/davidanthoff/TableTraitsUtils.jl has a default implementation for the iterable tables trait that handles all these things properly, and there is actually no reason why I couldn't just use that implementation here as well. I just overlooked this, the code in this PR predates those efficient implementations and I forgot to update things. I'll update this PR tomorrow or so to use all of that.

One thing that would help even more is JuliaLang/julia#22467, in that case even more Query.jl queries could use preallocated arrays of the right size when they get materialized. I had a very hackish version of something like that and it allowed me to get my @groupby implementation to within a factor of 2x of the pandas code in terms of speed.

, or better yet reuse whole columns in place?

Not yet, but I'm mulling some designs for that. I think that can come later, though, I would add it as an optional, more performant way to the table traits interface.

@davidanthoff
Copy link
Contributor Author

Alright, the new commit now uses the TableTraitsUtils.jl implementation to generate the vectors that hold the data, and that implementation uses an optimized codepath for sources that return Base.HasLength().

@shashi
Copy link
Collaborator

shashi commented Aug 30, 2017

LGTM.

@shashi shashi merged commit 03c9473 into JuliaData:master Aug 30, 2017
@davidanthoff davidanthoff deleted the tabletraits branch August 30, 2017 15:20
@davidanthoff
Copy link
Contributor Author

Cool.

What is your plan for tagging a new version? I should try to release a version of IterableTables.jl (that no longer has the integration code in it) ideally more or less simultaneously so that folks don't get override warnings.

@shashi
Copy link
Collaborator

shashi commented Aug 31, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants