-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
broadcast in DataFrames.jl #1643
Conversation
This PR will pass tests after #1637 is merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I have a few hesitations about this.
First, I wonder whether returning a vector when broadcasting over a DataFrame
is the most useful option. We could implement custom methods which return a DataFrame
, and allow returning a named tuple to create columns. This would be essentially transmute
in the dplyr terminology. Or a possible intermediate approach would be to return a vector if the function returns a single value, but a data frame if it returns a named tuple.
The other annoying point is that this syntax will necessarily be slow, and we don't provide a fast alternative (like we do for by
). Providing a syntax which is convenient but slow could be worse than not providing it at all.
@@ -0,0 +1,9 @@ | |||
module TestDataFrame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put this in an existing file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - I will move it to indexing.jl.
@@ -471,3 +471,70 @@ CSV.write(output, df) | |||
``` | |||
|
|||
The behavior of CSV functions can be adapted via keyword arguments. For more information, see `?CSV.read` and `?CSV.write`, or checkout the online [CSV.jl documentation](https://juliadata.github.io/CSV.jl/stable/). | |||
|
|||
## Broadcasting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we should mention this here given that this syntax is inefficient. We should first show how to achieve this efficiently. Broadcasting over columns is also very useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point with this PR is exactly to decide what kind of broadcasting we want to provide by default. We essentially have four options for broadcasting purposes:
- make
DataFrame
be treated as scalar, then we doRef(df)
; - make
DataFrame
be treated as a collection of rows, then we doeachrow(df)
; - make
DataFrame
be treated as a collection of rows, then we doeachcol(df, false)
; - make
DataFrame
be treated as a two dimensional object, like array; an inefficient way to do this would beMatrix(df)
, probably a more efficient implementation would be needed if we decided on it.
We have to select one. The other three have to be keyed-in by the user manually when broadcasting.
So the question is - how do we want to treat an AbstractDataFrame
in broadcasting by default (when it is not wrapped by some other object).
"1" | ||
"3" | ||
|
||
julia> (row -> string.(row)).(df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this syntax is recommended (and it's quite obscure for newcomers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - I will remove it.
Given (I have just noticed it as it got covered by inline comments):
I am OK to leave it undefined for now, but at least let us try to decide what should be the dimensionality of |
One thing I've been thinking about is if we might want to have our own Having
could be nice. I mean, maybe not, but it's worth wondering if we should save row-iteration for something behind a few more function calls to make it performant. |
Here is what my current thinking is (it would solve many challenges we face in one shot):
In this way we keep normal And for example Notice that then we would not need to write:
but simply:
would be enough and efficient. If
at the cost of generation of Do you see any disadvantages of this approach? |
Interesting. I'm not sure that would work if we only stored a field: it would have to be part of the type, with e.g. But this discussion is off-topic here. Maybe continue it at #1335? |
Closing this as we have a new approach to data frame broadcasting now |
This is a small PR, but a big decision towards making DataFrames.jl broadcasting-ready. What I recommend is:
AbstractDataFrame
as anAbstractVector
ofDataFrameRow
s when broadcasting;DataFrameRow
as a one-dimensional generic collection when broadcasting.The major consequence is that the output of broadcasting aa function over these types will be a vector in both cases.