-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Consolidate shared method between DataFrame
and Series
into Frame
#9038
Comments
As of #9021 I moved forward with the pandas approach of including both |
Partly addresses #9038 This function consolidate several (trivial) functions from `Series` and `DataFrame` into Frame. `__invert__` was consolidated to shared (more efficient) code path using factory methods. `deserialize` was not consolidated because we have to provide backward compatibility to older classes. But factory method was used for faster class construction. Authors: - Michael Wang (https://github.com/isVoid) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Sheilah Kirui (https://github.com/skirui-source) URL: #9059
Partially solves #9038 This PR moves binary ops from `Series` and `DataFrame` into `Frame` to reduce duplications. Because the arguments takes a different order in `series.py` and `dataframe.py` resort to `functools.wraps` to reorder the arguments while keeping the docstrings the same. Authors: - Michael Wang (https://github.com/isVoid) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #9357
#9378) This PR adds a new IndexedFrame subclass of Frame. For the moment this is a minimal class with one user-facing method and the indexer properties, but as we make progress on #9038 we are likely to find more methods that should live here. Additionally, once we remove references to Frame and Index types from the Cython layer of cudf we will be able to move all index-related logic from Frame into this class and provide a suitable ducktyped interface for methods that need to behave differently for frames depending on whether or not they have an index. In the process I also moved SingleColumnFrame into its own module. One class per file is a good organizational pattern, especially for huge classes like these, and it can also help reduce issues with circular imports and improve mypy runtime. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: #9378
This issue has been labeled |
Partial of #9038 This is a rewrite of `reset_index` to share some common logics between `Series` and `DataFrame`. It extends it's capability to handle `level` argument for multi-level index, `col_level` and `col_fill` for multi-level column name support. And adds `name` argument support for series api. Authors: - Michael Wang (https://github.com/isVoid) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #9750
This PR contributes to #9038, merging implementations of memory_usage and __iter__. It also removes an unnecessary override of take (saves about 25% of runtime for data <1M rows) and deprecates Series.merge, which doesn't exist in pandas. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Bradley Dice (https://github.com/bdice) URL: #10167
This PR builds on #10217 and #10287 to bring full ufunc support for Index types, expanding well beyond the small set previously supported in the `cudf.core.ops` namespace. By using most of the machinery introduced for IndexedFrame in the prior two PRs we avoid duplicating much logic so that all ufunc dispatches flow through a relatively standard path of known methods prior to a common cupy dispatch. With this change we are also able to deprecate the various ufunc operations defined in cudf/core/ops.py that exist only for this purpose as well as a number of Frame methods that are not defined for the corresponding pandas types. Users of those APIs are recommended to calling the corresponding numpy/cupy ufuncs instead to leverage the new dispatch. This PR also fixes a bug where index binary operations that output booleans would previously return instances of GenericIndex, whereas those pandas operations would return numpy arrays. cudf now returns cupy arrays in those cases. Resolves #9083. Contributes to #9038. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #10346
This PR contributes to #9038 by combining common code for `append`, `astype`, `drop`, and `explode`. It's not a small LOC change, but most of the changed lines are just moving code around. Some minor rewrites of `astype` in this PR do significantly speed up those calls (2x faster for <1M rows). Additionally, I've deprecated `append` since it is now deprecated in pandas. Note that it has not been removed in pandas, so we should not remove it until they do. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #10381
This PR resolves #9038. Not all the remaining functions could be consolidated (for various reasons), but the combination was explored and where necessary the reasons not to combine the functions was documented in that issue. Some additional cleanup that was done during the exploration is also included in this PR, but ultimately `groupby` is the only method that was moved up to `IndexedFrame`. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Matthew Roeschke (https://github.com/mroeschke) URL: #11315
DataFrame
andSeries
shares much similarity in structure. Operations that works onDataFrame
can work (without much adaptation) inSeries
context, thus, we should consolidate these methods intoFrame
to reduce duplication.Caveats:
Frame
base class is also inherited byIndex
class. This consolidation would allow extra methods that can be operated on Index. It should also be pointed out that operation that depends on the frame having an index cannot be consolidated (such assort_index
).DataFrame
andSeries
. As pointed out by @vyasr in Refactor Frame scans #9021 (comment). Some Pandas API docs also does this by including examples for both dataframe and series in the same docstring section. It's generally more preferable to have less code duplication to have longer docstrings.Potential candidates:
__array_ufunc__
(DataFrame is by necessity substantially more complex due to the need to handle multiple columns, not all of which may exist in all inputs)_concat
_n_largest_or_smallest
add
append
argsort
astype
(Thedescribe
_describe_categorical
helper function needs access tovalue_counts
functionality. Unless we want to port something similar to the column layer (and this would be difficult because it involves a groupby), we need the objects passed to this function (and therefore all other_describe_*
helpers) to beSeries
, notColumn
, which means that thedescribe
needs to remain implemented onSeries
and haveDataFrame
call theSeries
method per column).drop
explode
floordiv
(The requirements of pulling data out of a pandas Series vs a pandas DataFrame are different enough that we can't consolidate that much common code here)from_pandas
groupby
iloc
index
(Decided against consolidation, the behavior of DataFrame is too different and Series/Index are one-liners)isin
iteritems
loc
memory_usage
mod
(Requiresmode
value_counts
, same issue asdescribe
)mul
nlargest
(merged to _n_largest_or_smallest)nsmallest (merged to _n_largest_or_smallest)
pow
(The Series method is quite simple and already usesquantile
_from_data
, and most of the logic revolves around handling the return type so there isn't much overhead added when calling it from theDataFrame
API. Attempting to combine them ends up resulting in longer code, especially since their docstrings have slightly different parameters).radd
reset_index
rfloordiv
rmod
rmul
rpow
rsub
rtruediv
sort_values
sub
to_dict
(This function relies on corresponding column methods that return Series objects. Since pandas is not based on a columnar store but rather a block manager, there is no analogous approach to initialize different type of pandas "Frame"s from a single type of data)to_pandas
truediv
values
The text was updated successfully, but these errors were encountered: