Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add vcat with source; deprecate indicator in joins in favor of source #2649

Merged
merged 16 commits into from
Mar 15, 2021

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Mar 9, 2021

Fixes #659

I was prompted by https://discourse.julialang.org/t/would-it-help-to-have-a-tool-that-automatically-determines-which-issues-should-be-closed/56274/16 :).

The PR is relatively simple so we can add it in 1.0 release I think.

@bkamins bkamins added feature non-breaking The proposed change is not breaking labels Mar 9, 2021
@bkamins bkamins added this to the 1.0 milestone Mar 9, 2021
@bkamins
Copy link
Member Author

bkamins commented Mar 9, 2021

CI error is unrelated and is fixed in #2648

@pdeffebach
Copy link
Contributor

Should this be called indicator for consistency with joins? See here

@bkamins
Copy link
Member Author

bkamins commented Mar 10, 2021

Should this be called indicator for consistency with joins?

I am OK with indicator. @nalimilan - what do you think?

@nalimilan
Copy link
Member

Consistency with joins sounds like a good reason to use indicator, even though it isn't super explicit. For now these are a bit different, but I imagine at some point we could allow passing a collection of names to *join to choose the values to use, and extend support to more than two data frames, which would make it more similar to this PR.

src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved
src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved
src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved
src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved
src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved
Comment on lines 1627 to 1630
dfs′ = Vector{AbstractDataFrame}(undef, len)
for (i, (v, df)) in enumerate(zip(vals, dfs))
dfs′[i] = insertcols!(copy(df, copycols=false), 1, col => Ref(v))
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not a big deal, but wouldn't it be more efficient (and not really more complex) to create the column only after concatenating?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a design decision related with an empty data frame. As you can see in tests and examples, if you pass DataFrame() to vcat now a single row is created with all missing values except indicator column. Otherwise we would drop such a data frame.

But maybe dropping it is preferable. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have thought about it and concluded that it is better to drop it. I will change the implementation.

test/dataframe.jl Show resolved Hide resolved
test/dataframe.jl Show resolved Hide resolved
@nalimilan
Copy link
Member

Regarding the API, have you considered something like vcat(:A => df1, :B => df2, :C => df3, indicator=:source) instead? I guess which version is most convenient depends on whether you have all the names in a vector already or whether you write them by hand. We could support both if needed.

@bkamins
Copy link
Member Author

bkamins commented Mar 10, 2021

I analyzed it before making the PR.

We could allow:

vcat(:A => df1, :B => df2, :C => df3, indicator=:source)

but I am hesitant to add it as currently:

julia> vcat(:a => DataFrame(a=1), :b => DataFrame(b=2))
2-element Vector{Pair{Symbol, DataFrame}}:
 :a => 1×1 DataFrame
 Row │ a     
     │ Int64 
─────┼───────
   1 │     1
 :b => 1×1 DataFrame
 Row │ b     
     │ Int64 
─────┼───────
   1 │     2

(and this makes sense and is expected)

So adding a kwarg would significantly affect the produced result, which I think is not desirable.

Co-authored-by: Milan Bouchet-Valat <[email protected]>
@bkamins
Copy link
Member Author

bkamins commented Mar 10, 2021

I imagine at some point we could allow passing a collection of names to *join to choose the values to use, and extend support to more than two data frames, which would make it more similar to this PR.

Not doing this in join was not laziness but the issue that it is just different. If you pass e.g. three columns you can have the following outcomes:

  • only 1
  • only 2
  • only 3
  • 1 and 2
  • 1 and 3
  • 1, 2, and 3

Which would all have to be produced, which I thought would be more confusing than clarifying.

Similarly for two columns passed we have three values (left, right, and both) so it is not the same.

But of course the function of the kwarg is similar (clearly indicate the source data frame) so I would not oppose to syncing it between join and vcat.

Given these considerations what do you think. Should we make them consistent?

@bkamins
Copy link
Member Author

bkamins commented Mar 10, 2021

and I will change the kwarg do indicator as suggested.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I kind of wish we would have used source instead of indicator for joins, as that sounds more explicit, but well...

src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved
src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved
Co-authored-by: Milan Bouchet-Valat <[email protected]>
@bkamins
Copy link
Member Author

bkamins commented Mar 12, 2021

I kind of wish we would have used source instead of indicator for joins

Same with me. Maybe we can:

  • deprecate indicator in favor of source in joins, but keep it till 2.0 release? It should be non-problematic from performance perspective and would be clearer? (@oxinabox - could you comment as if I recall correctly I used indicator in join because you have mentioned this name at some point - so maybe you have some opinion here)
  • in this PR use only source?

@oxinabox
Copy link
Contributor

I have no recollection.
Maybe @rofinn ?

@bkamins
Copy link
Member Author

bkamins commented Mar 12, 2021

Sorry for pinging then. Do you have an opinion though 😄?

@rofinn
Copy link
Member

rofinn commented Mar 12, 2021

I also don't recall this discussion. I'm also in favour of source over indicator.

@pdeffebach
Copy link
Contributor

My only comment is that source should go at the end of the data frame, not at the front.

@bkamins
Copy link
Member Author

bkamins commented Mar 13, 2021

My only comment is that source should go at the end of the data frame, not at the front.

OK. I will change it. (it is the place where it goes in joins)

@bkamins
Copy link
Member Author

bkamins commented Mar 13, 2021

OK - I have traced the original issue here #1412. Thank you all for responding.

@bkamins bkamins changed the title add vcat with source add vcat with source; deprecate indicator in joins in favor of source Mar 13, 2021
src/join/composer.jl Outdated Show resolved Hide resolved
src/join/composer.jl Outdated Show resolved Hide resolved
src/join/composer.jl Outdated Show resolved Hide resolved
src/join/composer.jl Outdated Show resolved Hide resolved
src/join/composer.jl Outdated Show resolved Hide resolved
src/join/composer.jl Outdated Show resolved Hide resolved
Co-authored-by: Milan Bouchet-Valat <[email protected]>
Pair{<:SymbolOrString, <:AbstractVector}}=nothing) =
reduce(vcat, dfs; cols=cols, source=source)

"""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nalimilan - I have added a docstring for reduce to make this option more discoverable. Could you please have a look at it before I merge? Thank you!

@bkamins bkamins merged commit 639e3b1 into main Mar 15, 2021
@bkamins bkamins deleted the bk/vcat_with_source branch March 15, 2021 13:22
@bkamins
Copy link
Member Author

bkamins commented Mar 15, 2021

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature non-breaking The proposed change is not breaking
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a function for combine/vcat with source names
6 participants