Skip to content

Commit

Permalink
[SPARK-23018][PYTHON] Fix createDataFrame from Pandas timestamp serie…
Browse files Browse the repository at this point in the history
…s assignment

## What changes were proposed in this pull request?

This fixes createDataFrame from Pandas to only assign modified timestamp series back to a copied version of the Pandas DataFrame.  Previously, if the Pandas DataFrame was only a reference (e.g. a slice of another) each series will still get assigned back to the reference even if it is not a modified timestamp column.  This caused the following warning "SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame."

## How was this patch tested?

existing tests

Author: Bryan Cutler <[email protected]>

Closes #20213 from BryanCutler/pyspark-createDataFrame-copy-slice-warn-SPARK-23018.
  • Loading branch information
BryanCutler authored and ueshin committed Jan 10, 2018
1 parent 6f169ca commit 7bcc266
Showing 1 changed file with 15 additions and 13 deletions.
28 changes: 15 additions & 13 deletions python/pyspark/sql/session.py
Original file line number Diff line number Diff line change
Expand Up @@ -459,21 +459,23 @@ def _convert_from_pandas(self, pdf, schema, timezone):
# TODO: handle nested timestamps, such as ArrayType(TimestampType())?
if isinstance(field.dataType, TimestampType):
s = _check_series_convert_timestamps_tz_local(pdf[field.name], timezone)
if not copied and s is not pdf[field.name]:
# Copy once if the series is modified to prevent the original Pandas
# DataFrame from being updated
pdf = pdf.copy()
copied = True
pdf[field.name] = s
if s is not pdf[field.name]:
if not copied:
# Copy once if the series is modified to prevent the original
# Pandas DataFrame from being updated
pdf = pdf.copy()
copied = True
pdf[field.name] = s
else:
for column, series in pdf.iteritems():
s = _check_series_convert_timestamps_tz_local(pdf[column], timezone)
if not copied and s is not pdf[column]:
# Copy once if the series is modified to prevent the original Pandas
# DataFrame from being updated
pdf = pdf.copy()
copied = True
pdf[column] = s
s = _check_series_convert_timestamps_tz_local(series, timezone)
if s is not series:
if not copied:
# Copy once if the series is modified to prevent the original
# Pandas DataFrame from being updated
pdf = pdf.copy()
copied = True
pdf[column] = s

# Convert pandas.DataFrame to list of numpy records
np_records = pdf.to_records(index=False)
Expand Down

0 comments on commit 7bcc266

Please sign in to comment.