Skip to content

Commit

Permalink
DOC: Harmonize column selection to bracket notation (pandas-dev#27562)
Browse files Browse the repository at this point in the history
* Harmonize column selection to bracket notation

As suggested by https://medium.com/dunder-data/minimally-sufficient-pandas-a8e67f2a2428#46f9
  • Loading branch information
katrinleinweber authored and proost committed Dec 19, 2019
1 parent 892233e commit 84a60db
Show file tree
Hide file tree
Showing 9 changed files with 54 additions and 51 deletions.
2 changes: 1 addition & 1 deletion doc/source/getting_started/10min.rst
Original file line number Diff line number Diff line change
Expand Up @@ -278,7 +278,7 @@ Using a single column's values to select data.

.. ipython:: python
df[df.A > 0]
df[df['A'] > 0]
Selecting values from a DataFrame where a boolean condition is met.

Expand Down
12 changes: 6 additions & 6 deletions doc/source/getting_started/basics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -926,7 +926,7 @@ Single aggregations on a ``Series`` this will return a scalar value:

.. ipython:: python
tsdf.A.agg('sum')
tsdf['A'].agg('sum')
Aggregating with multiple functions
Expand All @@ -950,13 +950,13 @@ On a ``Series``, multiple functions return a ``Series``, indexed by the function

.. ipython:: python
tsdf.A.agg(['sum', 'mean'])
tsdf['A'].agg(['sum', 'mean'])
Passing a ``lambda`` function will yield a ``<lambda>`` named row:

.. ipython:: python
tsdf.A.agg(['sum', lambda x: x.mean()])
tsdf['A'].agg(['sum', lambda x: x.mean()])
Passing a named function will yield that name for the row:

Expand All @@ -965,7 +965,7 @@ Passing a named function will yield that name for the row:
def mymean(x):
return x.mean()
tsdf.A.agg(['sum', mymean])
tsdf['A'].agg(['sum', mymean])
Aggregating with a dict
+++++++++++++++++++++++
Expand Down Expand Up @@ -1065,7 +1065,7 @@ Passing a single function to ``.transform()`` with a ``Series`` will yield a sin

.. ipython:: python
tsdf.A.transform(np.abs)
tsdf['A'].transform(np.abs)
Transform with multiple functions
Expand All @@ -1084,7 +1084,7 @@ resulting column names will be the transforming functions.

.. ipython:: python
tsdf.A.transform([np.abs, lambda x: x + 1])
tsdf['A'].transform([np.abs, lambda x: x + 1])
Transforming with a dict
Expand Down
8 changes: 4 additions & 4 deletions doc/source/getting_started/comparison/comparison_with_r.rst
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ R pandas
=========================================== ===========================================
``select(df, col_one = col1)`` ``df.rename(columns={'col1': 'col_one'})['col_one']``
``rename(df, col_one = col1)`` ``df.rename(columns={'col1': 'col_one'})``
``mutate(df, c=a-b)`` ``df.assign(c=df.a-df.b)``
``mutate(df, c=a-b)`` ``df.assign(c=df['a']-df['b'])``
=========================================== ===========================================


Expand Down Expand Up @@ -258,8 +258,8 @@ index/slice as well as standard boolean indexing:
df = pd.DataFrame({'a': np.random.randn(10), 'b': np.random.randn(10)})
df.query('a <= b')
df[df.a <= df.b]
df.loc[df.a <= df.b]
df[df['a'] <= df['b']]
df.loc[df['a'] <= df['b']]
For more details and examples see :ref:`the query documentation
<indexing.query>`.
Expand All @@ -284,7 +284,7 @@ In ``pandas`` the equivalent expression, using the
df = pd.DataFrame({'a': np.random.randn(10), 'b': np.random.randn(10)})
df.eval('a + b')
df.a + df.b # same as the previous expression
df['a'] + df['b'] # same as the previous expression
In certain cases :meth:`~pandas.DataFrame.eval` will be much faster than
evaluation in pure Python. For more details and examples see :ref:`the eval
Expand Down
2 changes: 1 addition & 1 deletion doc/source/user_guide/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -738,7 +738,7 @@ and allows efficient indexing and storage of an index with a large number of dup
df['B'] = df['B'].astype(CategoricalDtype(list('cab')))
df
df.dtypes
df.B.cat.categories
df['B'].cat.categories
Setting the index will create a ``CategoricalIndex``.

Expand Down
6 changes: 3 additions & 3 deletions doc/source/user_guide/cookbook.rst
Original file line number Diff line number Diff line change
Expand Up @@ -592,8 +592,8 @@ Unlike agg, apply's callable is passed a sub-DataFrame which gives you access to
.. ipython:: python
df = pd.DataFrame([0, 1, 0, 1, 1, 1, 0, 1, 1], columns=['A'])
df.A.groupby((df.A != df.A.shift()).cumsum()).groups
df.A.groupby((df.A != df.A.shift()).cumsum()).cumsum()
df['A'].groupby((df['A'] != df['A'].shift()).cumsum()).groups
df['A'].groupby((df['A'] != df['A'].shift()).cumsum()).cumsum()
Expanding data
**************
Expand Down Expand Up @@ -719,7 +719,7 @@ Rolling Apply to multiple columns where function calculates a Series before a Sc
df
def gm(df, const):
v = ((((df.A + df.B) + 1).cumprod()) - 1) * const
v = ((((df['A'] + df['B']) + 1).cumprod()) - 1) * const
return v.iloc[-1]
s = pd.Series({df.index[i]: gm(df.iloc[i:min(i + 51, len(df) - 1)], 5)
Expand Down
12 changes: 6 additions & 6 deletions doc/source/user_guide/enhancingperf.rst
Original file line number Diff line number Diff line change
Expand Up @@ -393,15 +393,15 @@ Consider the following toy example of doubling each observation:
.. code-block:: ipython
# Custom function without numba
In [5]: %timeit df['col1_doubled'] = df.a.apply(double_every_value_nonumba) # noqa E501
In [5]: %timeit df['col1_doubled'] = df['a'].apply(double_every_value_nonumba) # noqa E501
1000 loops, best of 3: 797 us per loop
# Standard implementation (faster than a custom function)
In [6]: %timeit df['col1_doubled'] = df.a * 2
In [6]: %timeit df['col1_doubled'] = df['a'] * 2
1000 loops, best of 3: 233 us per loop
# Custom function with numba
In [7]: %timeit (df['col1_doubled'] = double_every_value_withnumba(df.a.to_numpy())
In [7]: %timeit (df['col1_doubled'] = double_every_value_withnumba(df['a'].to_numpy())
1000 loops, best of 3: 145 us per loop
Caveats
Expand Down Expand Up @@ -643,8 +643,8 @@ The equivalent in standard Python would be
.. ipython:: python
df = pd.DataFrame(dict(a=range(5), b=range(5, 10)))
df['c'] = df.a + df.b
df['d'] = df.a + df.b + df.c
df['c'] = df['a'] + df['b']
df['d'] = df['a'] + df['b'] + df['c']
df['a'] = 1
df
Expand Down Expand Up @@ -688,7 +688,7 @@ name in an expression.
a = np.random.randn()
df.query('@a < a')
df.loc[a < df.a] # same as the previous expression
df.loc[a < df['a']] # same as the previous expression
With :func:`pandas.eval` you cannot use the ``@`` prefix *at all*, because it
isn't defined in that context. ``pandas`` will let you know this if you try to
Expand Down
39 changes: 21 additions & 18 deletions doc/source/user_guide/indexing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -210,7 +210,7 @@ as an attribute:
See `here for an explanation of valid identifiers
<https://docs.python.org/3/reference/lexical_analysis.html#identifiers>`__.

- The attribute will not be available if it conflicts with an existing method name, e.g. ``s.min`` is not allowed.
- The attribute will not be available if it conflicts with an existing method name, e.g. ``s.min`` is not allowed, but ``s['min']`` is possible.

- Similarly, the attribute will not be available if it conflicts with any of the following list: ``index``,
``major_axis``, ``minor_axis``, ``items``.
Expand Down Expand Up @@ -540,7 +540,7 @@ The ``callable`` must be a function with one argument (the calling Series or Dat
columns=list('ABCD'))
df1
df1.loc[lambda df: df.A > 0, :]
df1.loc[lambda df: df['A'] > 0, :]
df1.loc[:, lambda df: ['A', 'B']]
df1.iloc[:, lambda df: [0, 1]]
Expand All @@ -552,7 +552,7 @@ You can use callable indexing in ``Series``.

.. ipython:: python
df1.A.loc[lambda s: s > 0]
df1['A'].loc[lambda s: s > 0]
Using these methods / indexers, you can chain data selection operations
without using a temporary variable.
Expand All @@ -561,7 +561,7 @@ without using a temporary variable.
bb = pd.read_csv('data/baseball.csv', index_col='id')
(bb.groupby(['year', 'team']).sum()
.loc[lambda df: df.r > 100])
.loc[lambda df: df['r'] > 100])
.. _indexing.deprecate_ix:

Expand Down Expand Up @@ -871,9 +871,9 @@ Boolean indexing
Another common operation is the use of boolean vectors to filter the data.
The operators are: ``|`` for ``or``, ``&`` for ``and``, and ``~`` for ``not``.
These **must** be grouped by using parentheses, since by default Python will
evaluate an expression such as ``df.A > 2 & df.B < 3`` as
``df.A > (2 & df.B) < 3``, while the desired evaluation order is
``(df.A > 2) & (df.B < 3)``.
evaluate an expression such as ``df['A'] > 2 & df['B'] < 3`` as
``df['A'] > (2 & df['B']) < 3``, while the desired evaluation order is
``(df['A > 2) & (df['B'] < 3)``.

Using a boolean vector to index a Series works exactly as in a NumPy ndarray:

Expand Down Expand Up @@ -1134,7 +1134,7 @@ between the values of columns ``a`` and ``c``. For example:
df
# pure python
df[(df.a < df.b) & (df.b < df.c)]
df[(df['a'] < df['b']) & (df['b'] < df['c'])]
# query
df.query('(a < b) & (b < c)')
Expand Down Expand Up @@ -1241,7 +1241,7 @@ Full numpy-like syntax:
df = pd.DataFrame(np.random.randint(n, size=(n, 3)), columns=list('abc'))
df
df.query('(a < b) & (b < c)')
df[(df.a < df.b) & (df.b < df.c)]
df[(df['a'] < df['b']) & (df['b'] < df['c'])]
Slightly nicer by removing the parentheses (by binding making comparison
operators bind tighter than ``&`` and ``|``).
Expand Down Expand Up @@ -1279,12 +1279,12 @@ The ``in`` and ``not in`` operators
df.query('a in b')
# How you'd do it in pure Python
df[df.a.isin(df.b)]
df[df['a'].isin(df['b'])]
df.query('a not in b')
# pure Python
df[~df.a.isin(df.b)]
df[~df['a'].isin(df['b'])]
You can combine this with other expressions for very succinct queries:
Expand All @@ -1297,7 +1297,7 @@ You can combine this with other expressions for very succinct queries:
df.query('a in b and c < d')
# pure Python
df[df.b.isin(df.a) & (df.c < df.d)]
df[df['b'].isin(df['a']) & (df['c'] < df['d'])]
.. note::
Expand Down Expand Up @@ -1326,7 +1326,7 @@ to ``in``/``not in``.
df.query('b == ["a", "b", "c"]')
# pure Python
df[df.b.isin(["a", "b", "c"])]
df[df['b'].isin(["a", "b", "c"])]
df.query('c == [1, 2]')
Expand All @@ -1338,7 +1338,7 @@ to ``in``/``not in``.
df.query('[1, 2] not in c')
# pure Python
df[df.c.isin([1, 2])]
df[df['c'].isin([1, 2])]
Boolean operators
Expand All @@ -1352,7 +1352,7 @@ You can negate boolean expressions with the word ``not`` or the ``~`` operator.
df['bools'] = np.random.rand(len(df)) > 0.5
df.query('~bools')
df.query('not bools')
df.query('not bools') == df[~df.bools]
df.query('not bools') == df[~df['bools']]
Of course, expressions can be arbitrarily complex too:

Expand All @@ -1362,7 +1362,10 @@ Of course, expressions can be arbitrarily complex too:
shorter = df.query('a < b < c and (not bools) or bools > 2')
# equivalent in pure Python
longer = df[(df.a < df.b) & (df.b < df.c) & (~df.bools) | (df.bools > 2)]
longer = df[(df['a'] < df['b'])
& (df['b'] < df['c'])
& (~df['bools'])
| (df['bools'] > 2)]
shorter
longer
Expand Down Expand Up @@ -1835,14 +1838,14 @@ chained indexing expression, you can set the :ref:`option <options>`
# This will show the SettingWithCopyWarning
# but the frame values will be set
dfb['c'][dfb.a.str.startswith('o')] = 42
dfb['c'][dfb['a'].str.startswith('o')] = 42
This however is operating on a copy and will not work.

::

>>> pd.set_option('mode.chained_assignment','warn')
>>> dfb[dfb.a.str.startswith('o')]['c'] = 42
>>> dfb[dfb['a'].str.startswith('o')]['c'] = 42
Traceback (most recent call last)
...
SettingWithCopyWarning:
Expand Down
10 changes: 5 additions & 5 deletions doc/source/user_guide/reshaping.rst
Original file line number Diff line number Diff line change
Expand Up @@ -469,7 +469,7 @@ If ``crosstab`` receives only two Series, it will provide a frequency table.
'C': [1, 1, np.nan, 1, 1]})
df
pd.crosstab(df.A, df.B)
pd.crosstab(df['A'], df['B'])
Any input passed containing ``Categorical`` data will have **all** of its
categories included in the cross-tabulation, even if the actual data does
Expand All @@ -489,21 +489,21 @@ using the ``normalize`` argument:

.. ipython:: python
pd.crosstab(df.A, df.B, normalize=True)
pd.crosstab(df['A'], df['B'], normalize=True)
``normalize`` can also normalize values within each row or within each column:

.. ipython:: python
pd.crosstab(df.A, df.B, normalize='columns')
pd.crosstab(df['A'], df['B'], normalize='columns')
``crosstab`` can also be passed a third ``Series`` and an aggregation function
(``aggfunc``) that will be applied to the values of the third ``Series`` within
each group defined by the first two ``Series``:

.. ipython:: python
pd.crosstab(df.A, df.B, values=df.C, aggfunc=np.sum)
pd.crosstab(df['A'], df['B'], values=df['C'], aggfunc=np.sum)
Adding margins
~~~~~~~~~~~~~~
Expand All @@ -512,7 +512,7 @@ Finally, one can also add margins or normalize this output.

.. ipython:: python
pd.crosstab(df.A, df.B, values=df.C, aggfunc=np.sum, normalize=True,
pd.crosstab(df['A'], df['B'], values=df['C'], aggfunc=np.sum, normalize=True,
margins=True)
.. _reshaping.tile:
Expand Down
14 changes: 7 additions & 7 deletions doc/source/user_guide/visualization.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1148,10 +1148,10 @@ To plot data on a secondary y-axis, use the ``secondary_y`` keyword:
.. ipython:: python
df.A.plot()
df['A'].plot()
@savefig series_plot_secondary_y.png
df.B.plot(secondary_y=True, style='g')
df['B'].plot(secondary_y=True, style='g')
.. ipython:: python
:suppress:
Expand Down Expand Up @@ -1205,7 +1205,7 @@ Here is the default behavior, notice how the x-axis tick labeling is performed:
plt.figure()
@savefig ser_plot_suppress.png
df.A.plot()
df['A'].plot()
.. ipython:: python
:suppress:
Expand All @@ -1219,7 +1219,7 @@ Using the ``x_compat`` parameter, you can suppress this behavior:
plt.figure()
@savefig ser_plot_suppress_parm.png
df.A.plot(x_compat=True)
df['A'].plot(x_compat=True)
.. ipython:: python
:suppress:
Expand All @@ -1235,9 +1235,9 @@ in ``pandas.plotting.plot_params`` can be used in a `with statement`:
@savefig ser_plot_suppress_context.png
with pd.plotting.plot_params.use('x_compat', True):
df.A.plot(color='r')
df.B.plot(color='g')
df.C.plot(color='b')
df['A'].plot(color='r')
df['B'].plot(color='g')
df['C'].plot(color='b')
.. ipython:: python
:suppress:
Expand Down

0 comments on commit 84a60db

Please sign in to comment.