Note: Version 0.3.0 is the first big update in awhile, and changes a lot of the "base" code. The
pandas-ply
package is no longer being imported. I have coded my own version of the "symbolic" objects that I was borrowing frompandas-ply
. Also, I am no longer supporting Python 2, sorry!
In v0.3
groupby
has been renamed togroup_by
to mirror thedplyr
function. If this breaks your legacy code, one possible fix is to havefrom dfply.group import group_by as groupby
in your package imports.
The dfply
package makes it possible to do R's dplyr
-style data manipulation with pipes
in python on pandas DataFrames.
This is an alternative to pandas-ply
and dplython
, which both engineer dplyr
syntax and functionality in python. There are probably more packages that attempt
to enable dplyr
-style dataframe manipulation in python, but those are the two I
am aware of.
dfply
uses a decorator-based architecture for the piping functionality and
to "categorize" the types of data manipulation functions. The goal of this
architecture is to make dfply
concise and easily extensible, simply by chaining
together different decorators that each have a distinct effect on the wrapped
function. There is a more in-depth overview of the decorators and how dfply
can be
customized below.
dfply
is intended to mimic the functionality of dplyr
. The syntax
is the same for the most part, but will vary in some cases as Python is a
considerably different programming language than R.
A good amount of the core functionality of dplyr
is complete, and the remainder is
actively being added in. Going forward I hope functionality that is not
directly part of dplyr
to be incorporated into dfply
as well. This is not
intended to be an absolute mimic of dplyr
, but instead a port of the ease,
convenience and readability the dplyr
package provides for data manipulation
tasks.
Expect frequent updates to the package version as features are added and any bugs are fixed.
- Overview of functions
- Embedded column functions
- Extending
dfply
with custom functions - Advanced: understanding base
dfply
decorators - Contributing
dfply works directly on pandas DataFrames, chaining operations on the data with
the >>
operator, or alternatively starting with >>=
for inplace operations.
from dfply import *
diamonds >> head(3)
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
You can chain piped operations, and of course assign the output to a new DataFrame.
lowprice = diamonds >> head(10) >> tail(3)
lowprice
carat cut color clarity depth table price x y z
7 0.26 Very Good H SI1 61.9 55.0 337 4.07 4.11 2.53
8 0.22 Fair E VS2 65.1 61.0 337 3.87 3.78 2.49
9 0.23 Very Good H VS1 59.4 61.0 338 4.00 4.05 2.39
Inplace operations are done with the first pipe as >>=
and subsequent pipes
as >>
.
diamonds >>= head(10) >> tail(3)
diamonds
carat cut color clarity depth table price x y z
7 0.26 Very Good H SI1 61.9 55.0 337 4.07 4.11 2.53
8 0.22 Fair E VS2 65.1 61.0 337 3.87 3.78 2.49
9 0.23 Very Good H VS1 59.4 61.0 338 4.00 4.05 2.39
When using the inplace pipe, the DataFrame is not required on the left hand
side of the >>=
pipe and the DataFrame variable is overwritten with the
output of the operations.
The DataFrame as it is passed through the piping operations is represented
by the symbol X
. It records the actions you want to take (represented by
the Intention
class), but does not evaluate them until the appropriate time.
Operations on the DataFrame are deferred. Selecting
two of the columns, for example, can be done using the symbolic X
DataFrame
during the piping operations.
diamonds >> select(X.carat, X.cut) >> head(3)
carat cut
0 0.23 Ideal
1 0.21 Premium
2 0.23 Good
There are two functions for selection, inverse of each other: select
and
drop
. The select
and drop
functions accept string labels, integer positions,
and/or symbolically represented column names (X.column
). They also accept symbolic "selection
filter" functions, which will be covered shortly.
The example below selects "cut", "price", "x", and "y" from the diamonds dataset.
diamonds >> select(1, X.price, ['x', 'y']) >> head(2)
cut price x y
0 Ideal 326 3.95 3.98
1 Premium 326 3.89 3.84
If you were instead to use drop
, you would get back all columns besides those specified.
diamonds >> drop(1, X.price, ['x', 'y']) >> head(2)
carat color clarity depth table z
0 0.23 E SI2 61.5 55.0 2.43
1 0.21 E SI1 59.8 61.0 2.31
One particularly nice thing about dplyr
's selection functions is that you can
drop columns inside of a select statement by putting a subtraction sign in front,
like so: ... %>% select(-col)
. The same can be done in dfply
, but instead of
the subtraction operator you use the tilde ~
.
For example, let's say I wanted to select any column except carat, color, and
clarity in my dataframe. One way to do this is to specify those for removal using
the ~
operator like so:
diamonds >> select(~X.carat, ~X.color, ~X.clarity) >> head(2)
cut depth table price x y z
0 Ideal 61.5 55.0 326 3.95 3.98 2.43
1 Premium 59.8 61.0 326 3.89 3.84 2.31
Note that if you are going to use the inversion operator, you must place it
prior to the symbolic X
(or a symbolic such as a selection filter function, covered
next). For example, using the inversion operator on a list of columns will
result in an error:
diamonds >> select(~[X.carat, X.color, X.clarity]) >> head(2)
TypeError: bad operand type for unary ~: 'list'
The vanilla select
and drop
functions are useful, but there are a variety of
selection functions inspired by dplyr
available to make selecting and dropping
columns a breeze. These functions are intended to be put inside of the select
and
drop
functions, and can be paired with the ~
inverter.
First, a quick rundown of the available functions:
starts_with(prefix)
: find columns that start with a string prefix.ends_with(suffix)
: find columns that end with a string suffix.contains(substr)
: find columns that contain a substring in their name.everything()
: all columns.columns_between(start_col, end_col, inclusive=True)
: find columns between a specified start and end column. Theinclusive
boolean keyword argument indicates whether the end column should be included or not.columns_to(end_col, inclusive=True)
: get columns up to a specified end column. Theinclusive
argument indicates whether the ending column should be included or not.columns_from(start_col)
: get the columns starting at a specified column.
The selection filter functions are best explained by example. Let's say I wanted to select only the columns that started with a "c":
diamonds >> select(starts_with('c')) >> head(2)
carat cut color clarity
0 0.23 Ideal E SI2
1 0.21 Premium E SI1
The selection filter functions are instances of the class Intention
, just like the
X
placeholder, and so I can also use the inversion operator with them. For example,
I can alternatively select the columns that do not start with "c":
diamonds >> select(~starts_with('c')) >> head(2)
depth table price x y z
0 61.5 55.0 326 3.95 3.98 2.43
1 59.8 61.0 326 3.89 3.84 2.31
They work the same inside the drop
function, but with the intention of removal.
I could, for example, use the columns_from
selection filter to drop all columns
from "price" onwards:
diamonds >> drop(columns_from(X.price)) >> head(2)
carat cut color clarity depth table
0 0.23 Ideal E SI2 61.5 55.0
1 0.21 Premium E SI1 59.8 61.0
As the example above shows, you can use symbolic column names inside of the
selection filter function! You can also mix together selection filters and standard
selections inside of the same select
or drop
command.
For my next trick, I will select the first two columns, the last two columns, and the "depth" column using a mixture of selection techniques:
diamonds >> select(columns_to(1, inclusive=True), 'depth', columns_from(-2)) >> head(2)
carat cut depth y z
0 0.23 Ideal 61.5 3.98 2.43
1 0.21 Premium 59.8 3.84 2.31
Slices of rows can be selected with the row_slice()
function. You can pass
single integer indices or a list of indices to select rows as with. This is
going to be the same as using pandas' .iloc
.
diamonds >> row_slice([10,15])
carat cut color clarity depth table price x y z
10 0.30 Good J SI1 64.0 55.0 339 4.25 4.28 2.73
15 0.32 Premium E I1 60.9 58.0 345 4.38 4.42 2.68
Note that this can also be used with the group_by
function, and will operate
like a call to .iloc
on each group. The group_by
pipe function is
covered later, but it essentially works the same as pandas .groupby
(with a
few subtle differences).
diamonds >> group_by('cut') >> row_slice(5)
carat cut color clarity depth table price x y z
128 0.91 Fair H SI2 64.4 57.0 2763 6.11 6.09 3.93
20 0.30 Good I SI2 63.3 56.0 351 4.26 4.30 2.71
40 0.33 Ideal I SI2 61.2 56.0 403 4.49 4.50 2.75
26 0.24 Premium I VS1 62.5 57.0 355 3.97 3.94 2.47
21 0.23 Very Good E VS2 63.8 55.0 352 3.85 3.92 2.48
The sample()
function functions exactly the same as pandas' .sample()
method
for DataFrames. Arguments and keyword arguments will be passed through to the
DataFrame sample method.
diamonds >> sample(frac=0.0001, replace=False)
carat cut color clarity depth table price x y z
19736 1.02 Ideal E VS1 62.2 54.0 8303 6.43 6.46 4.01
37159 0.32 Premium D VS2 60.3 60.0 972 4.44 4.42 2.67
1699 0.72 Very Good E VS2 63.8 57.0 3035 5.66 5.69 3.62
20955 1.71 Very Good J VS2 62.6 55.0 9170 7.58 7.65 4.77
5168 0.91 Very Good E SI2 63.0 56.0 3772 6.12 6.16 3.87
diamonds >> sample(n=3, replace=True)
carat cut color clarity depth table price x y z
52892 0.73 Very Good G SI1 60.6 59.0 2585 5.83 5.85 3.54
39454 0.57 Ideal H SI2 62.3 56.0 1077 5.31 5.28 3.30
39751 0.43 Ideal H VVS1 62.3 54.0 1094 4.84 4.85 3.02
Selection of unique rows is done with distinct()
, which similarly passes
arguments and keyword arguments through to the DataFrame's .drop_duplicates()
method.
diamonds >> distinct(X.color)
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
7 0.26 Very Good H SI1 61.9 55.0 337 4.07 4.11 2.53
12 0.22 Premium F SI1 60.4 61.0 342 3.88 3.84 2.33
25 0.23 Very Good G VVS2 60.4 58.0 354 3.97 4.01 2.41
28 0.23 Very Good D VS2 60.5 61.0 357 3.96 3.97 2.40
Filtering rows with logical criteria is done with mask()
, which accepts
boolean arrays "masking out" False labeled rows and keeping True labeled rows.
These are best created with logical statements on symbolic Series objects as
shown below. Multiple criteria can be supplied as arguments and their intersection
will be used as the mask.
diamonds >> mask(X.cut == 'Ideal') >> head(4)
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
11 0.23 Ideal J VS1 62.8 56.0 340 3.93 3.90 2.46
13 0.31 Ideal J SI2 62.2 54.0 344 4.35 4.37 2.71
16 0.30 Ideal I SI2 62.0 54.0 348 4.31 4.34 2.68
diamonds >> mask(X.cut == 'Ideal', X.color == 'E', X.table < 55, X.price < 500)
carat cut color clarity depth table price x y z
26683 0.33 Ideal E SI2 62.2 54.0 427 4.44 4.46 2.77
32297 0.34 Ideal E SI2 62.4 54.0 454 4.49 4.52 2.81
40928 0.30 Ideal E SI1 61.6 54.0 499 4.32 4.35 2.67
50623 0.30 Ideal E SI2 62.1 54.0 401 4.32 4.35 2.69
50625 0.30 Ideal E SI2 62.0 54.0 401 4.33 4.35 2.69
Alternatively, mask()
can also be called using the alias filter_by()
:
diamonds >> filter_by(X.cut == 'Ideal', X.color == 'E', X.table < 55, X.price < 500)
carat cut color clarity depth table price x y z
26683 0.33 Ideal E SI2 62.2 54.0 427 4.44 4.46 2.77
32297 0.34 Ideal E SI2 62.4 54.0 454 4.49 4.52 2.81
40928 0.30 Ideal E SI1 61.6 54.0 499 4.32 4.35 2.67
50623 0.30 Ideal E SI2 62.1 54.0 401 4.32 4.35 2.69
50625 0.30 Ideal E SI2 62.0 54.0 401 4.33 4.35 2.69
pull
simply retrieves a column and returns it as a pandas series, in case you only care about one particular column at the end of your pipeline. Columns can be specified either by their name (string) or an integer.
The default returns the last column (on the assumption that's the column you've created most recently).
Example:
(diamonds
>> filter_by(X.cut == 'Ideal', X.color == 'E', X.table < 55, X.price < 500)
>> pull('carat'))
26683 0.33
32297 0.34
40928 0.30
50623 0.30
50625 0.30
Name: carat, dtype: float64
New variables can be created with the mutate()
function (named that way to match
dplyr
).
diamonds >> mutate(x_plus_y=X.x + X.y) >> select(columns_from('x')) >> head(3)
x y z x_plus_y
0 3.95 3.98 2.43 7.93
1 3.89 3.84 2.31 7.73
2 4.05 4.07 2.31 8.12
Multiple variables can be created in a single call.
diamonds >> mutate(x_plus_y=X.x + X.y, y_div_z=(X.y / X.z)) >> select(columns_from('x')) >> head(3)
x y z x_plus_y y_div_z
0 3.95 3.98 2.43 7.93 1.637860
1 3.89 3.84 2.31 7.73 1.662338
2 4.05 4.07 2.31 8.12 1.761905
Note: In Python the new variables created with mutate may not be guaranteed to be created in the same order that they are input into the function call, though this may have been changed in Python 3...
The transmute()
function is a combination of a mutate and a selection of the
created variables.
diamonds >> transmute(x_plus_y=X.x + X.y, y_div_z=(X.y / X.z)) >> head(3)
x_plus_y y_div_z
0 7.93 1.637860
1 7.73 1.662338
2 8.12 1.761905
DataFrames are grouped along variables using the group_by()
function and
ungrouped with the ungroup()
function. Functions chained after grouping a
DataFrame are applied by group until returning or ungrouping. Hierarchical/multiindexing
is automatically removed.
Note: In the example below, the
lead()
andlag()
functions are dfply convenience wrappers around the pandas.shift()
Series method.
(diamonds >> group_by(X.cut) >>
mutate(price_lead=lead(X.price), price_lag=lag(X.price)) >>
head(2) >> select(X.cut, X.price, X.price_lead, X.price_lag))
cut price price_lead price_lag
8 Fair 337 2757.0 NaN
91 Fair 2757 2759.0 337.0
2 Good 327 335.0 NaN
4 Good 335 339.0 327.0
0 Ideal 326 340.0 NaN
11 Ideal 340 344.0 326.0
1 Premium 326 334.0 NaN
3 Premium 334 342.0 326.0
5 Very Good 336 336.0 NaN
6 Very Good 336 337.0 336.0
Sorting is done by the arrange()
function, which wraps around the pandas
.sort_values()
DataFrame method. Arguments and keyword arguments are passed
through to that function.
diamonds >> arrange(X.table, ascending=False) >> head(5)
carat cut color clarity depth table price x y z
24932 2.01 Fair F SI1 58.6 95.0 13387 8.32 8.31 4.87
50773 0.81 Fair F SI2 68.8 79.0 2301 5.26 5.20 3.58
51342 0.79 Fair G SI1 65.3 76.0 2362 5.52 5.13 3.35
52860 0.50 Fair E VS2 79.0 73.0 2579 5.21 5.18 4.09
49375 0.70 Fair H VS1 62.0 73.0 2100 5.65 5.54 3.47
(diamonds >> group_by(X.cut) >> arrange(X.price) >>
head(3) >> ungroup() >> mask(X.carat < 0.23))
carat cut color clarity depth table price x y z
8 0.22 Fair E VS2 65.1 61.0 337 3.87 3.78 2.49
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
12 0.22 Premium F SI1 60.4 61.0 342 3.88 3.84 2.33
The rename()
function will rename columns provided as values to what you set
as the keys in the keyword arguments. You can indicate columns with symbols or
with their labels.
diamonds >> rename(CUT=X.cut, COLOR='color') >> head(2)
carat CUT COLOR clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
Transforming between "wide" and "long" format is a common pattern in data munging.
The gather(key, value, *columns)
function melts the specified columns in your
DataFrame into two key:value columns.
diamonds >> gather('variable', 'value', ['price', 'depth','x','y','z']) >> head(5)
carat cut color clarity table variable value
0 0.23 Ideal E SI2 55.0 price 326.0
1 0.21 Premium E SI1 61.0 price 326.0
2 0.23 Good E VS1 65.0 price 327.0
3 0.29 Premium I VS2 58.0 price 334.0
4 0.31 Good J SI2 58.0 price 335.0
Without any columns specified, your entire DataFrame will be transformed into two key:value pair columns.
diamonds >> gather('variable', 'value') >> head(5)
variable value
0 carat 0.23
1 carat 0.21
2 carat 0.23
3 carat 0.29
4 carat 0.31
If the add_id
keyword argument is set to true, an id column is added to the
new elongated DataFrame that acts as a row id from the original wide DataFrame.
elongated = diamonds >> gather('variable', 'value', add_id=True)
elongated >> head(5)
_ID variable value
0 0 carat 0.23
1 1 carat 0.21
2 2 carat 0.23
3 3 carat 0.29
4 4 carat 0.31
Likewise, you can transform a "long" DataFrame into a "wide" format with the
spread(key, values)
function. Converting the previously created elongated
DataFrame for example would be done like so.
widened = elongated >> spread(X.variable, X.value)
widened >> head(5)
_ID carat clarity color cut depth price table x y z
0 0 0.23 SI2 E Ideal 61.5 326 55 3.95 3.98 2.43
1 1 0.21 SI1 E Premium 59.8 326 61 3.89 3.84 2.31
2 10 0.3 SI1 J Good 64 339 55 4.25 4.28 2.73
3 100 0.75 SI1 D Very Good 63.2 2760 56 5.8 5.75 3.65
4 1000 0.75 SI1 D Ideal 62.3 2898 55 5.83 5.8 3.62
In this case the _ID
column comes in handy since it is necessary to not have
any duplicated identifiers.
If you have a mixed datatype column in your long-format DataFrame then the default behavior is for the spread columns to be of type object.
widened.dtypes
_ID int64
carat object
clarity object
color object
cut object
depth object
price object
table object
x object
y object
z object
dtype: object
If you want to try to convert dtypes when spreading, you can set the convert
keyword argument in spread to True like so.
widened = elongated >> spread(X.variable, X.value, convert=True)
widened.dtypes
_ID int64
carat float64
clarity object
color object
cut object
depth float64
price int64
table float64
x float64
y float64
z float64
dtype: object
Columns can be split into multiple columns with the
separate(column, into, sep="[\W_]+", remove=True, convert=False, extra='drop', fill='right')
function. separate()
takes a variety of arguments:
column
: the column to split.into
: the names of the new columns.sep
: either a regex string or integer positions to split the column on.remove
: boolean indicating whether to remove the original column.convert
: boolean indicating whether the new columns should be converted to the appropriate type (same as inspread
above).extra
: eitherdrop
, where split pieces beyond the specified new columns are dropped, ormerge
, where the final split piece contains the remainder of the original column.fill
: eitherright
, wherenp.nan
values are filled in the right-most columns for missing pieces, orleft
wherenp.nan
values are filled in the left-most columns.
print d
a
0 1-a-3
1 1-b
2 1-c-3-4
3 9-d-1
4 10
d >> separate(X.a, ['col1', 'col2'], remove=True, convert=True,
extra='drop', fill='right')
col1 col2
0 1 a
1 1 b
2 1 c
3 9 d
4 10 NaN
d >> separate(X.a, ['col1', 'col2'], remove=True, convert=True,
extra='drop', fill='left')
col1 col2
0 1.0 a
1 1.0 b
2 1.0 c
3 9.0 d
4 NaN 10
d >> separate(X.a, ['col1', 'col2'], remove=False, convert=True,
extra='merge', fill='right')
a col1 col2
0 1-a-3 1 a-3
1 1-b 1 b
2 1-c-3-4 1 c-3-4
3 9-d-1 9 d-1
4 10 10 NaN
d >> separate(X.a, ['col1', 'col2', 'col3'], sep=[2,4], remove=True, convert=True,
extra='merge', fill='right')
col1 col2 col3
0 1- a- 3
1 1- b NaN
2 1- c- 3-4
3 9- d- 1
4 10 NaN NaN
The unite(colname, *args, sep='_', remove=True, na_action='maintain')
function
does the inverse of separate()
, joining columns together by a separator. Any
columns that are not strings will be converted to strings. The arguments for
unite()
are:
colname
: the name of the new joined column.*args
: list of columns to be joined, which can be strings, symbolic, or integer positions.sep
: the string separator to join the columns with.remove
: boolean indicating whether or not to remove the original columns.na_action
: can be one of"maintain"
(the default),"ignore"
, or"as_string"
. The default"maintain"
will make the new column row aNaN
value if any of the original column cells at that row containedNaN
."ignore"
will treat anyNaN
value as an empty string during joining."as_string"
will convert anyNaN
value to the string"nan"
prior to joining.
print d
a b c
0 1 a True
1 2 b False
2 3 c NaN
d >> unite('united', X.a, 'b', 2, remove=False, na_action='maintain')
a b c united
0 1 a True 1_a_True
1 2 b False 2_b_False
2 3 c NaN NaN
d >> unite('united', ['a','b','c'], remove=True, na_action='ignore', sep='*')
united
0 1*a*True
1 2*b*False
2 3*c
d >> unite('united', d.columns, remove=True, na_action='as_string')
united
0 1_a_True
1 2_b_False
2 3_c_nan
Currently implemented joins are:
inner_join(other, by='column')
outer_join(other, by='column')
(which works the same asfull_join()
)right_join(other, by='column')
left_join(other, by='column')
semi_join(other, by='column')
anti_join(other, by='column')
The functionality of the join functions are outlined with the toy example DataFrames below.
a = pd.DataFrame({
'x1':['A','B','C'],
'x2':[1,2,3]
})
b = pd.DataFrame({
'x1':['A','B','D'],
'x3':[True,False,True]
})
inner_join()
joins on values present in both DataFrames' by
columns.
a >> inner_join(b, by='x1')
x1 x2 x3
0 A 1 True
1 B 2 False
outer_join
merges DataFrame's together on values present in either frame's
by
columns.
a >> outer_join(b, by='x1')
x1 x2 x3
0 A 1.0 True
1 B 2.0 False
2 C 3.0 NaN
3 D NaN True
left_join
merges on the values present in the left DataFrame's by
columns.
a >> left_join(b, by='x1')
x1 x2 x3
0 A 1 True
1 B 2 False
2 C 3 NaN
right_join
merges on the values present in the right DataFrame's by
columns.
a >> right_join(b, by='x1')
x1 x2 x3
0 A 1.0 True
1 B 2.0 False
2 D NaN True
semi_join()
returns all of the rows in the left DataFrame that have a match
in the right DataFrame in the by
columns.
a >> semi_join(b, by='x1')
x1 x2
0 A 1
1 B 2
anti_join()
returns all of the rows in the left DataFrame that do not have a
match in the right DataFrame within the by
columns.
a >> anti_join(b, by='x1')
x1 x2
2 C 3
The set operation functions filter a DataFrame based on row comparisons with another DataFrame.
Each of the set operation functions union()
, intersect()
, and set_diff()
take the same arguments:
other
: the DataFrame to compare toindex
: a boolean (defaultFalse
) indicating whether to consider the pandas index during comparison.keep
: string (default"first"
) to be passed through to.drop_duplicates()
controlling how to handle duplicate rows.
With set operations columns are expected to be in the same order in both DataFrames.
The function examples use the following two toy DataFrames.
a = pd.DataFrame({
'x1':['A','B','C'],
'x2':[1,2,3]
})
c = pd.DataFrame({
'x1':['B','C','D'],
'x2':[2,3,4]
})
The union()
function returns rows that appear in either DataFrame.
a >> union(c)
x1 x2
0 A 1
1 B 2
2 C 3
2 D 4
intersect()
returns rows that appear in both DataFrames.
a >> intersect(c)
x1 x2
0 B 2
1 C 3
set_diff()
returns the rows in the left DataFrame that do not appear in the
right DataFrame.
a >> set_diff(c)
x1 x2
0 A 1
dfply
comes with convenience wrappers around pandas.concat()
for joining
DataFrames by rows or by columns.
The toy DataFrames below (a
and b
) are the same as the ones used to display
the join functions above.
The bind_rows(other, join='outer', ignore_index=False)
function is an exact
call to pandas.concat([df, other], join=join, ignore_index=ignore_index, axis=0)
,
joining two DataFrames "vertically".
a >> bind_rows(b, join='inner')
x1
0 A
1 B
2 C
0 A
1 B
2 D
a >> bind_rows(b, join='outer')
x1 x2 x3
0 A 1.0 NaN
1 B 2.0 NaN
2 C 3.0 NaN
0 A NaN True
1 B NaN False
2 D NaN True
Note that bind_rows()
does not reset the index for you!
The bind_cols(other, join='outer', ignore_index=False)
is likewise just a
call to pandas.concat([df, other], join=join, ignore_index=ignore_index, axis=1)
,
joining DataFrames "horizontally".
a >> bind_cols(b)
x1 x2 x1 x3
0 A 1 A True
1 B 2 B False
2 C 3 D True
Note that you may well end up with duplicate column labels after binding columns as can be seen above.
There are two summarization functions in dfply
that match dplr
: summarize
and
summarize_each
(though these functions use the 'z' spelling rather than 's').
summarize(**kwargs)
takes an arbitrary number of keyword arguments that will
return new columns labeled with the keys that are summary functions of columns
in the original DataFrame.
diamonds >> summarize(price_mean=X.price.mean(), price_std=X.price.std())
price_mean price_std
0 3932.799722 3989.439738
summarize()
can of course be used with groupings as well.
diamonds >> group_by('cut') >> summarize(price_mean=X.price.mean(), price_std=X.price.std())
cut price_mean price_std
0 Fair 4358.757764 3560.386612
1 Good 3928.864452 3681.589584
2 Ideal 3457.541970 3808.401172
3 Premium 4584.257704 4349.204961
4 Very Good 3981.759891 3935.862161
The summarize_each(function_list, *columns)
is a more general summarization
function. It takes a list of summary functions to apply as its first argument and
then a list of columns to apply the summary functions to. Columns can be specified
with either symbolic, string label, or integer position like in the selection
functions for convenience.
diamonds >> summarize_each([np.mean, np.var], X.price, 'depth')
price_mean price_var depth_mean depth_var
0 3932.799722 1.591533e+07 61.749405 2.052366
summarize_each()
works with groupings as well.
diamonds >> group_by(X.cut) >> summarize_each([np.mean, np.var], X.price, 4)
cut price_mean price_var depth_mean depth_var
0 Fair 4358.757764 1.266848e+07 64.041677 13.266319
1 Good 3928.864452 1.355134e+07 62.365879 4.705224
2 Ideal 3457.541970 1.450325e+07 61.709401 0.516274
3 Premium 4584.257704 1.891421e+07 61.264673 1.342755
4 Very Good 3981.759891 1.548973e+07 61.818275 1.900466
UNDER CONSTRUCTION: documentation not complete.
Like dplyr
, the dfply
package provides functions to perform various operations
on pandas Series. These are typically window functions and summarization
functions, and wrap symbolic arguments in function calls.
Window functions perform operations on vectors of values that return a vector of the same length.
The lead(series, n)
function pushes values in a vector upward, adding NaN
values in the end positions. Likewise, the lag(series, n)
function
pushes values downward, inserting NaN
values in the initial positions. Both
are calls to pandas Series.shift()
function under the hood.
(diamonds >> mutate(price_lead=lead(X.price, 2), price_lag=lag(X.price, 2)) >>
select(X.price, -2, -1) >>
head(6))
price price_lag price_lead
0 326 NaN 327.0
1 326 NaN 334.0
2 327 326.0 335.0
3 334 326.0 336.0
4 335 327.0 336.0
5 336 334.0 337.0
The between(series, a, b, inclusive=False)
function checks to see if values are
between two given bookend values.
diamonds >> select(X.price) >> mutate(price_btwn=between(X.price, 330, 340)) >> head(6)
price price_btwn
0 326 False
1 326 False
2 327 False
3 334 True
4 335 True
5 336 True
The dense_rank(series, ascending=True)
function is a wrapper around the scipy
function for calculating dense rank.
diamonds >> select(X.price) >> mutate(price_drank=dense_rank(X.price)) >> head(6)
price price_drank
0 326 1.0
1 326 1.0
2 327 2.0
3 334 3.0
4 335 4.0
5 336 5.0
Likewise, min_rank(series, ascending=True)
is a wrapper around the scipy
ranking
function with min rank specified.
diamonds >> select(X.price) >> mutate(price_mrank=min_rank(X.price)) >> head(6)
price price_mrank
0 326 1.0
1 326 1.0
2 327 3.0
3 334 4.0
4 335 5.0
5 336 6.0
The cumsum(series)
function calculates a cumulative sum of a column.
diamonds >> select(X.price) >> mutate(price_cumsum=cumsum(X.price)) >> head(6)
price price_cumsum
0 326 326
1 326 652
2 327 979
3 334 1313
4 335 1648
5 336 1984
cummean(series)
diamonds >> select(X.price) >> mutate(price_cummean=cummean(X.price)) >> head(6)
price price_cummean
0 326 326.000000
1 326 326.000000
2 327 326.333333
3 334 328.250000
4 335 329.600000
5 336 330.666667
cummax(series)
diamonds >> select(X.price) >> mutate(price_cummax=cummax(X.price)) >> head(6)
price price_cummax
0 326 326.0
1 326 326.0
2 327 327.0
3 334 334.0
4 335 335.0
5 336 336.0
cummin(series)
diamonds >> select(X.price) >> mutate(price_cummin=cummin(X.price)) >> head(6)
price price_cummin
0 326 326.0
1 326 326.0
2 327 326.0
3 334 326.0
4 335 326.0
5 336 326.0
cumprod(series)
diamonds >> select(X.price) >> mutate(price_cumprod=cumprod(X.price)) >> head(6)
price price_cumprod
0 326 326
1 326 106276
2 327 34752252
3 334 11607252168
4 335 3888429476280
5 336 1306512304030080
mean(series)
diamonds >> groupby(X.cut) >> summarize(price_mean=mean(X.price))
cut price_mean
0 Fair 4358.757764
1 Good 3928.864452
2 Ideal 3457.541970
3 Premium 4584.257704
4 Very Good 3981.759891
first(series, order_by=None)
diamonds >> groupby(X.cut) >> summarize(price_first=first(X.price))
cut price_first
0 Fair 337
1 Good 327
2 Ideal 326
3 Premium 326
4 Very Good 336
last(series, order_by=None)
diamonds >> groupby(X.cut) >> summarize(price_last=last(X.price))
cut price_last
0 Fair 2747
1 Good 2757
2 Ideal 2757
3 Premium 2757
4 Very Good 2757
nth(series, n, order_by=None)
diamonds >> groupby(X.cut) >> summarize(price_penultimate=nth(X.price, -2))
cut price_penultimate
0 Fair 2745
1 Good 2756
2 Ideal 2757
3 Premium 2757
4 Very Good 2757
n(series)
diamonds >> groupby(X.cut) >> summarize(price_n=n(X.price))
cut price_n
0 Fair 1610
1 Good 4906
2 Ideal 21551
3 Premium 13791
4 Very Good 12082
n_distinct(series)
diamonds >> groupby(X.cut) >> summarize(price_ndistinct=n_distinct(X.price))
cut price_ndistinct
0 Fair 1267
1 Good 3086
2 Ideal 7281
3 Premium 6014
4 Very Good 5840
IQR(series)
diamonds >> groupby(X.cut) >> summarize(price_iqr=IQR(X.price))
cut price_iqr
0 Fair 3155.25
1 Good 3883.00
2 Ideal 3800.50
3 Premium 5250.00
4 Very Good 4460.75
colmin(series)
diamonds >> groupby(X.cut) >> summarize(price_min=colmin(X.price))
cut price_min
0 Fair 337
1 Good 327
2 Ideal 326
3 Premium 326
4 Very Good 336
colmax(series)
diamonds >> groupby(X.cut) >> summarize(price_max=colmax(X.price))
cut price_max
0 Fair 18574
1 Good 18788
2 Ideal 18806
3 Premium 18823
4 Very Good 18818
median(series)
diamonds >> groupby(X.cut) >> summarize(price_median=median(X.price))
cut price_median
0 Fair 3282.0
1 Good 3050.5
2 Ideal 1810.0
3 Premium 3185.0
4 Very Good 2648.0
var(series)
diamonds >> groupby(X.cut) >> summarize(price_var=var(X.price))
cut price_var
0 Fair 1.267635e+07
1 Good 1.355410e+07
2 Ideal 1.450392e+07
3 Premium 1.891558e+07
4 Very Good 1.549101e+07
sd(series)
diamonds >> groupby(X.cut) >> summarize(price_sd=sd(X.price))
cut price_sd
0 Fair 3560.386612
1 Good 3681.589584
2 Ideal 3808.401172
3 Premium 4349.204961
4 Very Good 3935.862161
There are a lot of built-in functions, but you are almost certainly going to
reach a point where you want to use some of your own functions with the dfply
piping syntax. Luckily, dfply
comes with two handy decorators to make this
as easy as possible.
For a more detailed walkthrough of these two cases, see the basics-extending-functionality.ipynb jupyter notebook in the examples folder.
You might want to make a custom function that can be a piece of the pipe chain.
For example, say we wanted to write a dfply
wrapper around a simplified
version of pd.crosstab
. For the most part, you'll only need to do two things
to make this work:
- Make sure that your function's first argument will be the dataframe passed in implicitly by the pipe.
- Decorate the function with the
@dfpipe
decorator.
Here is an example of the dfply
-enabled crosstab function:
@dfpipe
def crosstab(df, index, columns):
return pd.crosstab(index, columns)
Normally you could use pd.crosstab
like so:
pd.crosstab(diamonds.cut, diamonds.color)
color D E F G H I J
cut
Fair 163 224 312 314 303 175 119
Good 662 933 909 871 702 522 307
Ideal 2834 3903 3826 4884 3115 2093 896
Premium 1603 2337 2331 2924 2360 1428 808
Very Good 1513 2400 2164 2299 1824 1204 678
The same result can be achieved now with the custom function in pipe syntax:
diamonds >> crosstab(X.cut, X.color)
color D E F G H I J
cut
Fair 163 224 312 314 303 175 119
Good 662 933 909 871 702 522 307
Ideal 2834 3903 3826 4884 3115 2093 896
Premium 1603 2337 2331 2924 2360 1428 808
Very Good 1513 2400 2164 2299 1824 1204 678
Many tasks are simpler and do not require the capacity to work as a pipe function. The dfply window functions are the common examples of this: functions that take a Series (or symbolic Series) and return a modified version.
Let's say we had a dataframe with dates represented by strings that we wanted to convert to pandas datetime objects using the pd.to_datetime function. Below is a tiny example dataframe with this issue.
sales = pd.DataFrame(dict(date=['7/10/17','7/11/17','7/12/17','7/13/17','7/14/17'],
sales=[1220, 1592, 908, 1102, 1395]))
sales
date sales
0 7/10/17 1220
1 7/11/17 1592
2 7/12/17 908
3 7/13/17 1102
4 7/14/17 1395
Using the pd.to_datetime
function inside of a call to mutate will unfortunately
break:
sales >> mutate(pd_date=pd.to_datetime(X.date, infer_datetime_format=True))
...
TypeError: __index__ returned non-int (type Intention)
dfply
functions are special in that they "know" to delay their evaluation until
the data is at that point in the chain. pd.to_datetime
is not such a function,
and will immediately try to evaluate X.date
. With a symbolic Intention
argument
passed in (X
is an Intention
object), the function will fail.
Instead, we will need to make a wrapper around pd.to_datetime
that can handle these symbolic arguments and delay evaluation until the right time.
It's quite simple: all you need to do is decorate a function with the @make_symbolic decorator:
@make_symbolic
def to_datetime(series, infer_datetime_format=True):
return pd.to_datetime(series, infer_datetime_format=infer_datetime_format)
Now the function can be used with symbolic arguments:
sales >> mutate(pd_date=to_datetime(X.date))
date sales pd_date
0 7/10/17 1220 2017-07-10
1 7/11/17 1592 2017-07-11
2 7/12/17 908 2017-07-12
3 7/13/17 1102 2017-07-13
4 7/14/17 1395 2017-07-14
A particularly nice thing about functions decorated with @make_symbolic
is that
they will operate normally if passed arguments that are not Intention
symbolic
objects.
For example, you can pass in the series itself and it will return the new series of converted dates:
to_datetime(sales.date)
0 2017-07-10
1 2017-07-11
2 2017-07-12
3 2017-07-13
4 2017-07-14
Name: date, dtype: datetime64[ns]
Under the hood, dfply
functions work using a collection of different decorators and
special classes. Below the most important ones are detailed. Understanding these
are important if you are planning on making big additions or changes to the code.
Python is not a lazily-evaluated language. Typically, something like this would not work:
diamonds >> select(X.carat) >> head(2)
The X
is supposed to represent the current state of the data through the
piping operator chain, and X.carat
indicates "select the carat column from
the current data at this point in the chain". But Python will try to evaluate
what X
is, then what X.carat
is, then what select(X.carat)
is, all before
the diamonds dataset ever gets evaluated.
The solution to this is to delay the evaluation until the appropriate time. I will
not get into the granular details here (but feel free to check it out for yourself
in base.py
). The gist is that things to be delayed are represented by a
special Intention
class that "waits" until it is time to evaluate the stored
commands with a given dataframe. This is the core of how dplyr
data manipulation
syntax is made possible in dfply
.
(Thanks to the creators of the dplython
and pandas-ply
for trailblazing a lot
of this before I made this package.)
The primary decorator that enables chaining functions with the >>
operator
is @pipe
. For functions to work with the piping syntax they must be decorated
with @pipe
.
Any function decorated with @pipe
implicitly receives a single first argument
expected to be a pandas DataFrame. This is the DataFrame being passed through
the pipe. For example, mutate
and select
have function specifications
mutate(df, **kwargs)
and select(df, *args, **kwargs)
, but when used
do not require the user to insert the DataFrame as an argument.
# the DataFrame is implicitly passed as the first argument
diamonds >> mutate(new_var=X.price + X.depth) >> select(X.new_var)
If you create a new function decorated by @pipe
, the function definition
should contain an initial argument that represents the DataFrame being passed
through the piping operations.
@pipe
def myfunc(df, *args, **kwargs):
# code
In order to delegate a function across specified groupings (assigned by the
group_by()
function), decorate the function with the @group_delegation
decorator. This decorator will query the DataFrame for assigned groupings and
apply the function to those groups individually.
Groupings are assigned by dfply
as an attribute ._grouped_by
to the DataFrame
proceeding through the piped functions. @group_delegation
checks for the
attribute and applies the function by group if groups exist. Any hierarchical
indexing is removed by the decorator as well.
Decoration by @group_delegation
should come after (internal) to the @pipe
decorator to function as intended.
@pipe
@group_delegation
def myfunc(df, *args, **kwargs):
# code
Evaluation of any Intention
-class symbolic object (such as X
) is
handled by the @symbolic_evaluation
function. For example, when calling
mutate(new_price = X.price * 2.5)
the X.price
symbolic representation of
the price column in the DataFrame will be evaluated to the actual Series
by this decorator.
The @symbolic_evaluation
decorator can have functionality modified by
optional keyword arguments:
@symbolic_evaluation(eval_symbols=False)
def my_function(df, arg1, arg2):
...
If the eval_symbols
argument is True
, all symbolics will be evaluated
with the passed-in dataframe. If False
or None
, there will be no attempt
to evaluate symbolics.
A list can also be passed in. The list can contain a mix of positional integers and string keywords, which reference positional arguments and keyworded arguments respectively. This targets which arguments or keyword arguments to try and evaluate specifically:
# This indicates that arg1, arg2, and kw1 should be targeted for symbolic
# evaluation, but not the other arguments.
# Note that positional indexes reference arguments AFTER the passed-in dataframe.
# For example, 0 refers to arg1, not df.
@symbolic_evaluation(eval_symbols=[0,1,'kw1'])
def my_function(df, arg1, arg2, arg3, kw1=True, kw2=False):
...
In reality, you are unlikely to need this behavior unless you really want to
prevent dfply
from trying to evaluate symbolic arguments. Remember that if
an argument is not symbolic it will be evaluated as normal, so there shouldn't
be much harm leaving it at default other than a little bit of computational overhead.
Most new or custom functions for dfply will be decorated with the pattern:
@pipe
@group_delegation
@symbolic_evaluation
def myfunc(df, *args, **kwargs):
# code
Because of this, the decorator @dfpipe
is defined as exactly this combination
of decorators for your convenience. The above decoration pattern for the function
can be simply written as:
@dfpipe
def myfunc(df, *args, **kwargs):
# code
This allows you to easily create new functions that can be chained together with pipes, respect grouping, and evaluate symbolic DataFrames and Series correctly.
Sometimes, like in the window and summary functions that operate on series, it is necessary to defer the evaluation of a function. For example, in the code below:
diamonds >> summarize(price_third=nth(X.price, 3))
The nth()
function would typically be evaluated before summarize()
and the
symbolic argument would not be evaluated at the right time.
The @make_symbolic
decorator can be placed above functions to convert them
into symbolic functions that will wait to evaluate. Again, this is used
primarily for functions that are embedded inside the function call within
the piping syntax.
The nth()
code, for example, is below:
@make_symbolic
def nth(series, n, order_by=None):
if order_by is not None:
series = order_series_by(series, order_by)
try:
return series.iloc[n]
except:
return np.nan
Functions you write that you want to be able to embed as an argument
can use the @make_symbolic
to wait until they have access to the DataFrame
to evaluate.
By all means please feel free to comment or contribute to the package. The more people adding code the better. If you submit an issue, pull request, or ask for something to be added I will do my best to respond promptly.
The TODO list (now located in the "Projects" section of the repo) has an ongoing list of things that still need to be resolved and features to be added.
If you submit a pull request with features or bugfixes, please target the "develop" branch rather than the "master" branch.