[SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp values for Pandas to respect session timezone #19607

ueshin · 2017-10-30T05:32:49Z

What changes were proposed in this pull request?

When converting Pandas DataFrame/Series from/to Spark DataFrame using toPandas() or pandas udfs, timestamp values behave to respect Python system timezone instead of session timezone.

For example, let's say we use "America/Los_Angeles" as session timezone and have a timestamp value "1970-01-01 00:00:01" in the timezone. Btw, I'm in Japan so Python timezone would be "Asia/Tokyo".

The timestamp value from current toPandas() will be the following:

>>> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
>>> df = spark.createDataFrame([28801], "long").selectExpr("timestamp(value) as ts")
>>> df.show()
+-------------------+
|                 ts|
+-------------------+
|1970-01-01 00:00:01|
+-------------------+

>>> df.toPandas()
                   ts
0 1970-01-01 17:00:01

As you can see, the value becomes "1970-01-01 17:00:01" because it respects Python timezone.
As we discussed in #18664, we consider this behavior is a bug and the value should be "1970-01-01 00:00:01".

How was this patch tested?

Added tests and existing tests.

SparkQA · 2017-10-30T07:05:02Z

Test build #83205 has finished for PR 19607 at commit 5c08ecf.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2017-10-30T07:07:55Z

Jenkins, retest this please.

SparkQA · 2017-10-30T09:30:19Z

Test build #83207 has finished for PR 19607 at commit 5c08ecf.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-10-30T15:11:45Z

Test build #83212 has finished for PR 19607 at commit 5c08ecf.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-10-30T15:14:21Z

Test build #83211 has finished for PR 19607 at commit e28fc87.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-01T07:02:42Z

Test build #83280 has finished for PR 19607 at commit ee1a1c8.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-01T10:21:48Z

Test build #83286 has finished for PR 19607 at commit b1436b8.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2017-11-01T14:41:49Z

@BryanCutler I'm fixing the behavior of toPandas() and pandas udfs as we discussed in #18664 but I guess we still need to support old Pandas as well.
I tried to find out a workaround for old Pandas, but I haven't done yet.
Do you have any ideas for the workaround? cc @wesm @HyukjinKwon @viirya

SparkQA · 2017-11-01T17:10:15Z

Test build #83295 has finished for PR 19607 at commit 6872516.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2017-11-01T17:27:07Z

Hi @ueshin , what is the oldest version of Pandas that's required to support and what exactly wasn't working with it?

ueshin · 2017-11-02T04:15:39Z

@BryanCutler I guess the oldest version of Pandas is 0.13.0 currently according to #18403, cc @HyukjinKwon.

HyukjinKwon · 2017-11-02T04:22:01Z

Yea, that was my proposal. If anything is blocked by this, I think we can consider bumping it up as an option because, IMHO, technically the fixed version specification is not yet released and published.

^ cc @cloud-fan, @srowen and @viirya

HyukjinKwon · 2017-11-02T04:26:22Z

I tried to find out a workaround for old Pandas, but I haven't done yet.

I haven't looked at this closely yet but will definitely try to take a look and help soon together. I would appreciate it if the problem (or just symptoms, or just a pointer ..) can be given though if it is not too complex.

SparkQA · 2017-11-02T07:05:01Z

Test build #83320 has finished for PR 19607 at commit 1f096bf.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-11-02T07:09:52Z

retest this please

ueshin · 2017-11-02T07:32:42Z

Jenkins, retest this please.

SparkQA · 2017-11-02T10:02:13Z

Test build #83324 has finished for PR 19607 at commit 1f096bf.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-11-02T13:54:29Z

python/pyspark/sql/types.py

@@ -1629,35 +1629,121 @@ def to_arrow_type(dt):
    return arrow_type


-def _check_dataframe_localize_timestamps(pdf):
+def to_arrow_schema(schema):


where do we use this method?

Ah, currently it isn't used. I'll remove it for now.

cloud-fan · 2017-11-02T13:59:35Z

python/pyspark/sql/types.py

+            return s.dt.tz_convert('UTC')
+        else:
+            return s
+    except ImportError:


I think we should bump up pandas version if we can't find a workaround.

Sure, let me look into it a little more and summarize what version we can support.

cloud-fan · 2017-11-02T14:04:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -948,6 +948,14 @@ object SQLConf {
      .intConf
      .createWithDefault(10000)

+  val PANDAS_RESPECT_SESSION_LOCAL_TIMEZONE =


can we clean up the code more if we don't have this config?

Sure, I'll try it.

gatorsmile · 2017-11-22T18:25:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+    buildConf("spark.sql.execution.pandas.respectSessionTimeZone")
+      .internal()
+      .doc("When true, make Pandas DataFrame with timestamp type respecting session local " +
+        "timezone when converting to/from Pandas DataFrame.")


Emphasize the conf will be deprecated?

When true, make Pandas DataFrame with timestamp type respecting session local timezone when converting to/from Pandas DataFrame. This configuration will be deprecated in the future releases.

Sure, I'll update it.

gatorsmile · 2017-11-22T18:29:50Z

python/setup.py

@@ -201,7 +201,7 @@ def _supports_symlinks():
        extras_require={
            'ml': ['numpy>=1.7'],
            'mllib': ['numpy>=1.7'],
-            'sql': ['pandas>=0.13.0']
+            'sql': ['pandas>=0.19.2']


Document this requirement and behavior changes in Migration Guide?

Sure, I'll add it.

SparkQA · 2017-11-27T08:05:01Z

Test build #84205 has finished for PR 19607 at commit 40a9735.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class VectorIndexer(JavaEstimator, HasInputCol, HasOutputCol, HasHandleInvalid, JavaMLReadable,
class _ImageSchema(object):
raise RuntimeError(\"Creating instance of _ImageSchema class is disallowed.\")

SparkQA · 2017-11-27T08:05:01Z

Test build #84204 has finished for PR 19607 at commit f92eae3.

This patch fails due to an unknown error code, -9.
This patch does not merge cleanly.
This patch adds no public classes.

cloud-fan · 2017-11-27T08:13:37Z

retest this please

HyukjinKwon

Looks fine to me.

HyukjinKwon · 2017-11-27T09:26:36Z

python/pyspark/sql/types.py

    elif is_datetime64tz_dtype(s.dtype):
        return s.dt.tz_convert('UTC')
    else:
        return s


+def _check_series_convert_timestamps_localize(s, fromTimezone, toTimezone):


Nit: maybe from_timezone .

Thanks, I'll update it. Maybe toTimestamp -> to_timestamp as well.

HyukjinKwon · 2017-11-27T09:27:08Z

python/pyspark/sql/types.py

+        from pandas.api.types import is_datetime64tz_dtype, is_datetime64_dtype
+    except ImportError as e:
+        raise ImportError(_old_pandas_exception_message(e))
+    fromTz = fromTimezone or 'tzlocal()'


I'll update it.

HyukjinKwon · 2017-11-27T09:55:28Z

python/pyspark/sql/tests.py

+
+            self.assertNotEqual(result_ny, result_la)
+
+            result_la_corrected = [Row(**{k: v - timedelta(hours=3) if k == '7_timestamp_t' else v


Small comments here would be helpful .. BTW, to be clear, this 3 hours timedelta is from America/Los_Angeles and America/New_York time difference?

Yes, the 3 hours timedelta is the time difference.
I'll add some comments.

HyukjinKwon · 2017-11-27T11:09:15Z

python/pyspark/sql/tests.py

+                df_la = df.withColumn("tscopy", f_timestamp_copy(col("timestamp"))) \
+                    .withColumn("internal_value", internal_value(col("timestamp")))
+                result_la = df_la.select(col("idx"), col("internal_value")).collect()
+                diff = 3 * 60 * 60 * 1000 * 1000 * 1000


Here too. it took me a while to check where this 3 came from ..

Yes, I'll add some comments.

HyukjinKwon · 2017-11-27T11:17:50Z

python/pyspark/sql/session.py

+                        s = _check_series_convert_timestamps_tz_local(pdf[field.name], timezone)
+                        if not copied and s is not pdf[field.name]:
+                            pdf = pdf.copy()
+                            copied = True


Would you mind if I ask why we should copy here? Probably, some comments explaining it would be helpful. To be clear, Is it to prevent the original Pandas DataFrame being updated?

Yes, it's to prevent the original one from being updated.
I'll add some comments.

SparkQA · 2017-11-27T11:49:55Z

Test build #84210 has finished for PR 19607 at commit 40a9735.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class VectorIndexer(JavaEstimator, HasInputCol, HasOutputCol, HasHandleInvalid, JavaMLReadable,
class _ImageSchema(object):
raise RuntimeError(\"Creating instance of _ImageSchema class is disallowed.\")

HyukjinKwon · 2017-11-27T11:53:38Z

python/pyspark/sql/session.py

@@ -444,11 +445,30 @@ def _get_numpy_record_dtype(self, rec):
            record_type_list.append((str(col_names[i]), curr_type))
        return np.dtype(record_type_list) if has_rec_fix else None

-    def _convert_from_pandas(self, pdf):
+    def _convert_from_pandas(self, pdf, schema, timezone):


Just an idea not blocking this PR. Probably, we have enough codes to make a separate Python file / class to put Pandas / Arrow stuff into one place.

Thanks, I agree with it but maybe I'll leave those as they are in this pr.

HyukjinKwon · 2017-11-28T03:32:21Z

LGTM

SparkQA · 2017-11-28T06:00:58Z

Test build #84242 has finished for PR 19607 at commit 9200f38.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-11-28T06:31:52Z

I guess we should look at R to see if it should behave similarly? WDYT @HyukjinKwon ?

HyukjinKwon · 2017-11-28T08:08:27Z

Yup, I think we should take a look between POSIXct / POSIXlt in R and timestamp within Spark too. Seems not respecting the session timezone in a quick look.

cloud-fan · 2017-11-28T08:46:06Z

LGTM, merging to master!

Let's fix the R timestamp issue in a new ticket.

HyukjinKwon · 2017-11-28T08:47:33Z

Yup, I was testing and trying to produce details. Let me describe this in the JIRA, not here :D.

HyukjinKwon · 2017-11-28T13:55:02Z

Sorry, but does anyone remember how we are going to deal with df.collect() in PySpark? R fix should be more like df.collect(). It should be good to file a JIRA for df.collect() in PySpark too while we are here, if I haven't missed some discussion about it.

Filed for R anyway - https://issues.apache.org/jira/browse/SPARK-22632.

ueshin · 2017-11-28T14:14:34Z

Unfortunately, df.collect() is out of scope of this pr. Its timestamp values will respect python system timezone.

HyukjinKwon · 2017-11-28T14:18:34Z

Yup it should be separate. I meant to file another JIRA while we are here if it is something we need to fix before forgetting. If df.collect() is not meant to be fixed, I think I should reread the discussion and maybe resolve the R JIRA.

ueshin · 2017-11-28T14:19:53Z

Maybe we need at least 2 external libraries like dateutil and pytz to handle timezones properly, which Pandas uses, but I have no idea to handle timezones out of Pandas properly for now.

HyukjinKwon · 2017-11-28T14:21:12Z

Hm .. I see. but is this something we should fix though ideally? I am asking this because I am checking R related codes now ..

ueshin · 2017-11-28T14:23:28Z

Ah, yes, I think so.

HyukjinKwon · 2017-11-28T14:24:34Z

Thanks. I was just worried if I missed any discussion somewhere and wanted to double check.

ueshin added 3 commits October 30, 2017 13:35

Add a conf to make Pandas DataFrame respect session local timezone.

4735e59

Fix toPandas() behavior.

1f85150

Modify pandas UDFs to respect session timezone.

5c08ecf

ueshin force-pushed the issues/SPARK-22395 branch from e28fc87 to 5c08ecf Compare October 30, 2017 11:12

Workaround for old pandas.

ee1a1c8

Don't use is_datetime64tz_dtype for old pandas.

b1436b8

Fix one of the failed tests.

6872516

ueshin force-pushed the issues/SPARK-22395 branch from 5689d01 to 6872516 Compare November 1, 2017 14:44

ueshin changed the title ~~[SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp values for Pandas to respect session timezone~~ [WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp values for Pandas to respect session timezone Nov 2, 2017

Modify check_data udf for debug messages.

1f096bf

cloud-fan reviewed Nov 2, 2017

View reviewed changes

ueshin changed the title ~~[WIP][SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp values for Pandas to respect session timezone~~ [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp values for Pandas to respect session timezone Nov 22, 2017

gatorsmile reviewed Nov 22, 2017

View reviewed changes

ueshin added 3 commits November 27, 2017 13:19

Add a description about deprecation of the config.

e240631

Add migration guide.

f92eae3

Merge branch 'master' into issues/SPARK-22395

40a9735

HyukjinKwon reviewed Nov 27, 2017

View reviewed changes

Address comments.

9200f38

asfgit closed this in 64817c4 Nov 28, 2017

HyukjinKwon mentioned this pull request Dec 15, 2017

[SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0 #19884

Closed

rednaxelafx mentioned this pull request Jan 5, 2018

[SPARK-22966][PYTHON][SQL] Python UDFs with returnType=StringType should treat return values of datetime.date or datetime.datetime as unconvertible #20163

Closed


		self.assertNotEqual(result_ny, result_la)

		result_la_corrected = [Row(**{k: v - timedelta(hours=3) if k == '7_timestamp_t' else v

[SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp values for Pandas to respect session timezone #19607

[SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp values for Pandas to respect session timezone #19607

Conversation

ueshin commented Oct 30, 2017

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Oct 30, 2017

ueshin commented Oct 30, 2017

SparkQA commented Oct 30, 2017

SparkQA commented Oct 30, 2017

SparkQA commented Oct 30, 2017

SparkQA commented Nov 1, 2017

SparkQA commented Nov 1, 2017

ueshin commented Nov 1, 2017

SparkQA commented Nov 1, 2017

BryanCutler commented Nov 1, 2017

ueshin commented Nov 2, 2017

HyukjinKwon commented Nov 2, 2017 • edited Loading

HyukjinKwon commented Nov 2, 2017

SparkQA commented Nov 2, 2017

HyukjinKwon commented Nov 2, 2017

ueshin commented Nov 2, 2017

SparkQA commented Nov 2, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 27, 2017

SparkQA commented Nov 27, 2017

cloud-fan commented Nov 27, 2017

HyukjinKwon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 27, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Nov 28, 2017

SparkQA commented Nov 28, 2017

felixcheung commented Nov 28, 2017 • edited Loading

HyukjinKwon commented Nov 28, 2017 • edited Loading

cloud-fan commented Nov 28, 2017

HyukjinKwon commented Nov 28, 2017

HyukjinKwon commented Nov 28, 2017

ueshin commented Nov 28, 2017

HyukjinKwon commented Nov 28, 2017 • edited Loading

ueshin commented Nov 28, 2017

HyukjinKwon commented Nov 28, 2017

ueshin commented Nov 28, 2017

HyukjinKwon commented Nov 28, 2017

HyukjinKwon commented Nov 2, 2017 •

edited

Loading

felixcheung commented Nov 28, 2017 •

edited

Loading

HyukjinKwon commented Nov 28, 2017 •

edited

Loading

HyukjinKwon commented Nov 28, 2017 •

edited

Loading