Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BREAKING] Multicolumn transformations for GoupedDataFrame #2481

Merged
merged 28 commits into from
Nov 1, 2020

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Oct 12, 2020

Fixes #2410

I am opening now to make sure that 0.22 branch of a13db50 does not get lost.

This PR is not finished. I have implemented all (in particular allowing returning multiple columns from functions) except handling AsTable and multipe Symbols as destination columns (this still needs to be implemented as it requires new logic)

@bkamins bkamins added this to the 1.0 milestone Oct 12, 2020
@bkamins bkamins added breaking The proposed change is breaking. grouping priority labels Oct 12, 2020
@bkamins
Copy link
Member Author

bkamins commented Oct 13, 2020

OK - this PR is not cleaned, tested, nor documented yet, but the promised functionality seems (remember - not tested 😄)) to work, so please feel free to experiment with it and comment if you catch something surprising. Thank you!

@bkamins bkamins mentioned this pull request Oct 13, 2020
20 tasks
@bkamins
Copy link
Member Author

bkamins commented Oct 15, 2020

@pdeffebach - with this PR we ensure the following invariant (which I think is relevant for DataFramesMeta.jl design). The result of:

transform(df, args...)

is always the same as the result of

transform(groupby(df, cols), args...)

(also up to errors - if one errors the other also errors)

if args is ByRow transformation.

@pdeffebach
Copy link
Contributor

Wait, why would we want that?

What if I want to perform the operation x .- mean(x)? We put a lot of effort into ensuring transform works by groups.

@bkamins
Copy link
Member Author

bkamins commented Oct 15, 2020

Right - I forgotten to add that args must be ByRow. I will update.

@pdeffebach
Copy link
Contributor

Oh good. You scared me! Yes having the same with ByRow is the correct behavior. Thanks!

@bkamins
Copy link
Member Author

bkamins commented Oct 15, 2020

I still have a lot of tests to write (but now the PR should pass the tests) and documentation to update.
However, @nalimilan - can you please review the changes in docs/src/man/split_apply_combine.md as correcting the documentation is usually a lengthy process, and probably we can reuse things we work-out there in the docstrings later. Thank you!

(all others: feel free to have a look at docs/src/man/split_apply_combine.md if you are interested as it specifies how new rules work)

src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved
docs/src/man/split_apply_combine.md Outdated Show resolved Hide resolved
docs/src/man/split_apply_combine.md Outdated Show resolved Hide resolved
docs/src/man/split_apply_combine.md Outdated Show resolved Hide resolved
docs/src/man/split_apply_combine.md Outdated Show resolved Hide resolved
docs/src/man/split_apply_combine.md Outdated Show resolved Hide resolved
docs/src/man/split_apply_combine.md Outdated Show resolved Hide resolved
docs/src/man/split_apply_combine.md Outdated Show resolved Hide resolved
docs/src/man/split_apply_combine.md Outdated Show resolved Hide resolved
docs/src/man/split_apply_combine.md Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Oct 18, 2020

@nalimilan - documentation updates part 2 (and hopefully final) is pushed here. I made adjustments to the manual per the comments given + I have refactored the docstrings. Now they are consistent with the manual and reuse the same template. This has three benefits I believe:

  1. it will be easier to maintain
  2. the docstrings take less space (which is relevant as the files are very large anyway)
  3. the docstrings show a comparison of similarities and differences between available functions

Another review of documentation would be appreciated (after it is done I will review the tests of the functionality to make sure we cover everything and the PR then will be good for a final review).

@bkamins bkamins marked this pull request as ready for review October 19, 2020 15:39
@bkamins
Copy link
Member Author

bkamins commented Oct 19, 2020

I have added additional tests of correctness of the functionality we expose. This should be good for a whole code review.

@bkamins
Copy link
Member Author

bkamins commented Oct 22, 2020

This PR is holding implementation of where PR.

EDIT: I managed to get it without needing this, see #2496

docs/src/man/split_apply_combine.md Outdated Show resolved Hide resolved
docs/src/man/split_apply_combine.md Outdated Show resolved Hide resolved
docs/src/man/split_apply_combine.md Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
docs/src/man/split_apply_combine.md Outdated Show resolved Hide resolved
test/grouping.jl Show resolved Hide resolved
test/grouping.jl Outdated Show resolved Hide resolved
test/grouping.jl Outdated Show resolved Hide resolved
test/grouping.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Oct 28, 2020

@nalimilan I have incorportated all the recommended changes apart from syncing manual and docs - I will do it when manual is finalized. It should be ready for another round of reviews. Thank you!

docs/src/man/split_apply_combine.md Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
Comment on lines 49 to 52
is undefined (a typical case is that they follow the order of appreance of
respecive values in the grouping columns, but a notable exception is when the
columns are `PooledVector`s, in which case they are ordered accoring to the `pool`
field in these vectors)
Copy link
Member

@nalimilan nalimilan Oct 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's true: in general the order is really undefined due to the dict-like grouping fallback. And the pool of PooledDataArray isn't exposed to users so it's not really useful to tell give them this information.

EDIT: I was wrong, for some reason I hadn't realize that the fallback grouping method uses the order of appearance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the description to make it more precise.

src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved
Comment on lines 143 to 145
(not all keyword arguments are supported in all cases; in general they are allowed
in situations when they are meaningful, see the documentation of the specific functions
for details):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But how about making this part specific to each function's docstring so that we never mention an argument that a function doesn't support?

(BTW I don't see the change to "signature".)

@bkamins
Copy link
Member Author

bkamins commented Oct 30, 2020

@nalimilan - I have incorporated the comments. Let me know if it is OK to move the descriptions from the manual to the docstring. Thank you!

@bkamins
Copy link
Member Author

bkamins commented Nov 1, 2020

TODO: add NEWS.md entry (when @nalimilan approves moving the manual text into the docstrings).

src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Nov 1, 2020

All suggestions are applied, manual and dosstrings are synchronized, and NEWS.md is updated.

src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/groupeddataframe/groupeddataframe.jl Outdated Show resolved Hide resolved
Co-authored-by: Milan Bouchet-Valat <[email protected]>
@bkamins bkamins merged commit 5224175 into JuliaData:master Nov 1, 2020
@bkamins bkamins deleted the multicolumn_grouped_transform branch November 1, 2020 18:22
@bkamins
Copy link
Member Author

bkamins commented Nov 1, 2020

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking The proposed change is breaking. grouping priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BREAKING] Handling of ByRow
3 participants