Skip to content

Commit

Permalink
[SPARK-24444][DOCS][PYTHON][BRANCH-2.3] Improve Pandas UDF docs to ex…
Browse files Browse the repository at this point in the history
…plain column assignment

## What changes were proposed in this pull request?
Added sections to pandas_udf docs, in the grouped map section, to indicate columns are assigned by position. Backported to branch-2.3.

## How was this patch tested?
NA

Author: Bryan Cutler <[email protected]>

Closes #21478 from BryanCutler/arrow-doc-pandas_udf-column_by_pos-2_3_1-SPARK-21427.
  • Loading branch information
BryanCutler authored and HyukjinKwon committed Jun 1, 2018
1 parent b37e76f commit e56266a
Show file tree
Hide file tree
Showing 2 changed files with 17 additions and 1 deletion.
9 changes: 9 additions & 0 deletions docs/sql-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -1737,6 +1737,15 @@ To use `groupBy().apply()`, the user needs to define the following:
* A Python function that defines the computation for each group.
* A `StructType` object or a string that defines the schema of the output `DataFrame`.

The output schema will be applied to the columns of the returned `pandas.DataFrame` in order by position,
not by name. This means that the columns in the `pandas.DataFrame` must be indexed so that their
position matches the corresponding field in the schema.

Note that when creating a new `pandas.DataFrame` using a dictionary, the actual position of the column
can differ from the order that it was placed in the dictionary. It is recommended in this case to
explicitly define the column order using the `columns` keyword, e.g.
`pandas.DataFrame({'id': ids, 'a': data}, columns=['id', 'a'])`, or alternatively use an `OrderedDict`.

Note that all data for a group will be loaded into memory before the function is applied. This can
lead to out of memory exceptons, especially if the group sizes are skewed. The configuration for
[maxRecordsPerBatch](#setting-arrow-batch-size) is not applied on groups and it is up to the user
Expand Down
9 changes: 8 additions & 1 deletion python/pyspark/sql/functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -2216,7 +2216,8 @@ def pandas_udf(f=None, returnType=None, functionType=None):
A grouped map UDF defines transformation: A `pandas.DataFrame` -> A `pandas.DataFrame`
The returnType should be a :class:`StructType` describing the schema of the returned
`pandas.DataFrame`.
The length of the returned `pandas.DataFrame` can be arbitrary.
The length of the returned `pandas.DataFrame` can be arbitrary and the columns must be
indexed so that their position matches the corresponding field in the schema.
Grouped map UDFs are used with :meth:`pyspark.sql.GroupedData.apply`.
Expand All @@ -2239,6 +2240,12 @@ def pandas_udf(f=None, returnType=None, functionType=None):
| 2| 1.1094003924504583|
+---+-------------------+
.. note:: If returning a new `pandas.DataFrame` constructed with a dictionary, it is
recommended to explicitly index the columns by name to ensure the positions are correct,
or alternatively use an `OrderedDict`.
For example, `pd.DataFrame({'id': ids, 'a': data}, columns=['id', 'a'])` or
`pd.DataFrame(OrderedDict([('id', ids), ('a', data)]))`.
.. seealso:: :meth:`pyspark.sql.GroupedData.apply`
.. note:: The user-defined functions are considered deterministic by default. Due to
Expand Down

0 comments on commit e56266a

Please sign in to comment.