Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

request - @chinWithMutate #114

Open
Lincoln-Hannah opened this issue Aug 28, 2024 · 4 comments
Open

request - @chinWithMutate #114

Lincoln-Hannah opened this issue Aug 28, 2024 · 4 comments

Comments

@Lincoln-Hannah
Copy link

Lincoln-Hannah commented Aug 28, 2024

Would you consider creating a @chainWithMutate macro that has one difference to the standard @chain macro.
If a line begins with variablename = and a DataFrame is passed from the line above, then it treats it like line starting with @mutate So instead of writing;

@chain begin
    DataFrame(a=1:10)

    @mutate  b = 2a
    @mutate  c = 3b
end

one could just write

@chainWithMutate begin
    DataFrame(a=1:10)

    b = 2a
    c = 3b
end
@Lincoln-Hannah Lincoln-Hannah changed the title some requests request - using calculated columns Aug 30, 2024
@Lincoln-Hannah Lincoln-Hannah changed the title request - using calculated columns request - @chinWithMutate Sep 9, 2024
@kdpsingh
Copy link
Member

kdpsingh commented Sep 9, 2024

Hi @Lincoln-Hannah, sorry for the delay in getting back to you. This is a solid idea - I want to share some initial thoughts on why @mutate() currently functions the way it does, and how we might get closer to what you are looking for.

Right now, @mutate() supports the multi-line syntax you propose here but doesn't support situations where one argument relies on a variable that was created in a previous argument. In the above example, c = 3b relies on the existence of b, which was created in the previous argument. The functionality as currently implemented is intentional because this limitation comes from DataFrames.transform(). This is implemented for a performance reason -- namely, that DataFrames assumes that arguments can be parallelized and thus run faster.

There are 2 ways that we could fix this:

  1. Implement the @chainwithmutate() macro you propose above: I don't like the name (because it would be used inside an existing @chain macro) but we could consider an alternative name like @mutates(), where the s makes it look plural and stands for "sequential".
  2. The second approach, which I would strongly prefer, is for the @mutate() macro to analyze the variables being created (e.g., b and c) and the variables being used (e.g.,a and b) and to automatically run them sequentially in separate calls to DataFrames.transform() if a dependency is detected.

This would be more of a new feature than a bug-fix, so it's slightly lower priority, but I think that option 2 is doable and is something we should pursue.

@Lincoln-Hannah
Copy link
Author

Option 2 is fine. Thank you for considering it.

@chainWithMutate would be more difficult to implement. The idea is it can be used instead of a @chain macro (not sit within one). All the other macros would work within it. But if a line started with variable = it would be treated as a @mutate line. I found that 2/3 of the lines I write within a @chain block are @mutate lines and often they are interspersed with @filter and other macros. It would just be cleaner if I didn't have to keep repeating @mutate .

@chainWithMutate begin

       DataFrame( a=1:10 )

       b = 2a

      @filter   b > 10

      c = 2b

end

@kdpsingh
Copy link
Member

Ah I see what you mean. We probably won't add this macro to the package but it's definitely doable. I can try to put together a code snippet as a starting point if that would be of interest.

@Lincoln-Hannah
Copy link
Author

Lincoln-Hannah commented Sep 12, 2024

Very much so.
I really think if people used it, they would like it.
There are so many @chain blocks I've written with lots of @mutate lines interspersed with @filter @pivot and @join lines. Not having to write @mutate every time would save a lot of code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants