-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp values for Pandas to respect session timezone #19607
Conversation
Test build #83205 has finished for PR 19607 at commit
|
Jenkins, retest this please. |
Test build #83207 has finished for PR 19607 at commit
|
e28fc87
to
5c08ecf
Compare
Test build #83212 has finished for PR 19607 at commit
|
Test build #83211 has finished for PR 19607 at commit
|
Test build #83280 has finished for PR 19607 at commit
|
Test build #83286 has finished for PR 19607 at commit
|
@BryanCutler I'm fixing the behavior of |
5689d01
to
6872516
Compare
Test build #83295 has finished for PR 19607 at commit
|
Hi @ueshin , what is the oldest version of Pandas that's required to support and what exactly wasn't working with it? |
@BryanCutler I guess the oldest version of Pandas is |
Yea, that was my proposal. If anything is blocked by this, I think we can consider bumping it up as an option because, IMHO, technically the fixed version specification is not yet released and published. ^ cc @cloud-fan, @srowen and @viirya |
I haven't looked at this closely yet but will definitely try to take a look and help soon together. I would appreciate it if the problem (or just symptoms, or just a pointer ..) can be given though if it is not too complex. |
Test build #83320 has finished for PR 19607 at commit
|
retest this please |
Jenkins, retest this please. |
Test build #83324 has finished for PR 19607 at commit
|
python/pyspark/sql/types.py
Outdated
@@ -1629,35 +1629,121 @@ def to_arrow_type(dt): | |||
return arrow_type | |||
|
|||
|
|||
def _check_dataframe_localize_timestamps(pdf): | |||
def to_arrow_schema(schema): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where do we use this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, currently it isn't used. I'll remove it for now.
python/pyspark/sql/types.py
Outdated
return s.dt.tz_convert('UTC') | ||
else: | ||
return s | ||
except ImportError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should bump up pandas version if we can't find a workaround.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let me look into it a little more and summarize what version we can support.
@@ -948,6 +948,14 @@ object SQLConf { | |||
.intConf | |||
.createWithDefault(10000) | |||
|
|||
val PANDAS_RESPECT_SESSION_LOCAL_TIMEZONE = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we clean up the code more if we don't have this config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll try it.
buildConf("spark.sql.execution.pandas.respectSessionTimeZone") | ||
.internal() | ||
.doc("When true, make Pandas DataFrame with timestamp type respecting session local " + | ||
"timezone when converting to/from Pandas DataFrame.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Emphasize the conf will be deprecated?
When true, make Pandas DataFrame with timestamp type respecting session local timezone when converting to/from Pandas DataFrame. This configuration will be deprecated in the future releases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll update it.
@@ -201,7 +201,7 @@ def _supports_symlinks(): | |||
extras_require={ | |||
'ml': ['numpy>=1.7'], | |||
'mllib': ['numpy>=1.7'], | |||
'sql': ['pandas>=0.13.0'] | |||
'sql': ['pandas>=0.19.2'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document this requirement and behavior changes in Migration Guide
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll add it.
Test build #84205 has finished for PR 19607 at commit
|
Test build #84204 has finished for PR 19607 at commit
|
retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me.
python/pyspark/sql/types.py
Outdated
elif is_datetime64tz_dtype(s.dtype): | ||
return s.dt.tz_convert('UTC') | ||
else: | ||
return s | ||
|
||
|
||
def _check_series_convert_timestamps_localize(s, fromTimezone, toTimezone): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: maybe from_timezone
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll update it. Maybe toTimestamp
-> to_timestamp
as well.
python/pyspark/sql/types.py
Outdated
from pandas.api.types import is_datetime64tz_dtype, is_datetime64_dtype | ||
except ImportError as e: | ||
raise ImportError(_old_pandas_exception_message(e)) | ||
fromTz = fromTimezone or 'tzlocal()' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update it.
python/pyspark/sql/tests.py
Outdated
|
||
self.assertNotEqual(result_ny, result_la) | ||
|
||
result_la_corrected = [Row(**{k: v - timedelta(hours=3) if k == '7_timestamp_t' else v |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small comments here would be helpful .. BTW, to be clear, this 3 hours timedelta is from America/Los_Angeles and America/New_York time difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the 3 hours timedelta is the time difference.
I'll add some comments.
python/pyspark/sql/tests.py
Outdated
df_la = df.withColumn("tscopy", f_timestamp_copy(col("timestamp"))) \ | ||
.withColumn("internal_value", internal_value(col("timestamp"))) | ||
result_la = df_la.select(col("idx"), col("internal_value")).collect() | ||
diff = 3 * 60 * 60 * 1000 * 1000 * 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too. it took me a while to check where this 3 came from ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll add some comments.
python/pyspark/sql/session.py
Outdated
s = _check_series_convert_timestamps_tz_local(pdf[field.name], timezone) | ||
if not copied and s is not pdf[field.name]: | ||
pdf = pdf.copy() | ||
copied = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind if I ask why we should copy here? Probably, some comments explaining it would be helpful. To be clear, Is it to prevent the original Pandas DataFrame being updated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's to prevent the original one from being updated.
I'll add some comments.
Test build #84210 has finished for PR 19607 at commit
|
@@ -444,11 +445,30 @@ def _get_numpy_record_dtype(self, rec): | |||
record_type_list.append((str(col_names[i]), curr_type)) | |||
return np.dtype(record_type_list) if has_rec_fix else None | |||
|
|||
def _convert_from_pandas(self, pdf): | |||
def _convert_from_pandas(self, pdf, schema, timezone): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an idea not blocking this PR. Probably, we have enough codes to make a separate Python file / class to put Pandas / Arrow stuff into one place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I agree with it but maybe I'll leave those as they are in this pr.
LGTM |
Test build #84242 has finished for PR 19607 at commit
|
I guess we should look at R to see if it should behave similarly? WDYT @HyukjinKwon ? |
Yup, I think we should take a look between POSIXct / POSIXlt in R and timestamp within Spark too. Seems not respecting the session timezone in a quick look. |
LGTM, merging to master! Let's fix the R timestamp issue in a new ticket. |
Yup, I was testing and trying to produce details. Let me describe this in the JIRA, not here :D. |
Sorry, but does anyone remember how we are going to deal with Filed for R anyway - https://issues.apache.org/jira/browse/SPARK-22632. |
Unfortunately, |
Yup it should be separate. I meant to file another JIRA while we are here if it is something we need to fix before forgetting. If |
Maybe we need at least 2 external libraries like |
Hm .. I see. but is this something we should fix though ideally? I am asking this because I am checking R related codes now .. |
Ah, yes, I think so. |
Thanks. I was just worried if I missed any discussion somewhere and wanted to double check. |
What changes were proposed in this pull request?
When converting Pandas DataFrame/Series from/to Spark DataFrame using
toPandas()
or pandas udfs, timestamp values behave to respect Python system timezone instead of session timezone.For example, let's say we use
"America/Los_Angeles"
as session timezone and have a timestamp value"1970-01-01 00:00:01"
in the timezone. Btw, I'm in Japan so Python timezone would be"Asia/Tokyo"
.The timestamp value from current
toPandas()
will be the following:As you can see, the value becomes
"1970-01-01 17:00:01"
because it respects Python timezone.As we discussed in #18664, we consider this behavior is a bug and the value should be
"1970-01-01 00:00:01"
.How was this patch tested?
Added tests and existing tests.