-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0 #19884
[SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0 #19884
Conversation
Test build #84447 has finished for PR 19884 at commit
|
The highlights that pertain to Spark for the update from Arrow versoin 0.4.1 to 0.8.0 include:
|
Great, @BryanCutler . Could you put the highlight in the PR description, too? |
This is a WIP to start updating Spark to use Arrow 0.8.0 which will be released soon. TODO:
|
pom.xml
Outdated
@@ -185,7 +185,7 @@ | |||
<paranamer.version>2.8</paranamer.version> | |||
<maven-antrun.version>1.8</maven-antrun.version> | |||
<commons-crypto.version>1.0.0</commons-crypto.version> | |||
<arrow.version>0.4.0</arrow.version> | |||
<arrow.version>0.8.0-SNAPSHOT</arrow.version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any ETA for the official 0.8.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are still wrapping a few things up, should be later this week or early next week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we download the snapshot from somewhere for our local tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be able to cut an RC beginning of next week. I would suggest mvn-installing from Arrow master for the time being
Sure, thanks @dongjoon-hyun ! Will do, just want to go back and check the release notes first |
cc @zsxwing as well, I saw you opened a JIRA about this - SPARK-22656 |
@zsxwing, fyi after applying your Netty upgrade patch to Arrow, and then your other patch for Spark, all of the Spark Scala/Java tests pass |
python/pyspark/sql/types.py
Outdated
spark_type = StringType() | ||
elif at == pa.date32(): | ||
spark_type = DateType() | ||
elif type(at) == pa.TimestampType: | ||
elif pa.types.is_timestamp(at): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@icexelloss @wesm is this the recommended way to check type id for the latest pyarrow? For types with a single bit width, I am using the is_* functions, like is_timestamp
, but for others I still need to check object equality such as t == pa.date32()
because there is no is_date32()
only is_date()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, this is right. I'm opening a JIRA to add more functions for testing exact types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, thanks for confirming!
Test build #84616 has finished for PR 19884 at commit
|
Test build #84663 has finished for PR 19884 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I tried to run tests locally, I got OutOfMemoryException
as follows:
[info] org.apache.arrow.memory.OutOfMemoryException:
[info] at org.apache.arrow.vector.complex.AbstractContainerVector.allocateNew(AbstractContainerVector.java:52)
[info] at org.apache.spark.sql.execution.arrow.ArrowWriter$$anonfun$1.apply(ArrowWriter.scala:40)
We shouldn't explicitly use allocateNew()
or something?
I'll wait for the next update. Thanks!
python/pyspark/serializers.py
Outdated
mask = None if casted.dtype == 'object' else s.isnull() | ||
return pa.Array.from_pandas(casted, mask=mask, type=t) | ||
mask = s.isnull() | ||
# Workaround for casting timestamp units with timezone, ARROW-1906 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the fix for this workaround be included in Arrow 0.8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, just fixed in ARROW-1906 apache/arrow#1411
pom.xml
Outdated
@@ -185,7 +185,7 @@ | |||
<paranamer.version>2.8</paranamer.version> | |||
<maven-antrun.version>1.8</maven-antrun.version> | |||
<commons-crypto.version>1.0.0</commons-crypto.version> | |||
<arrow.version>0.4.0</arrow.version> | |||
<arrow.version>0.8.0-SNAPSHOT</arrow.version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't forget that we also need to update dev/deps/spark-deps-hadoop-2.x
files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@zsxwing that's right, we will have to coordinate to make sure the Jenkins pyarrow is upgraded to version 0.8 as well. I'm not sure the best way to coordinate all of this because this PR, jenkins upgrade, and Spark Netty upgrade all need to happen at the same time. @holdenk @shaneknapp will one of you be able to work on the pyarrow upgrade for Jenkins sometime around next week? (assuming Arrow 0.8 is released in the next day or so) |
@BryanCutler could you just pull my changes into this PR since we need both changes to pass Jenkins? Thanks! |
yeah, i can do the upgrade next week. i'll be working remotely from the
east coast, but unavailable at all on monday due to travel.
…On Mon, Dec 11, 2017 at 1:59 PM, Bryan Cutler ***@***.***> wrote:
Jenkins cannot support to install multiple versions of PyArrow?
@zsxwing <https://github.com/zsxwing> that's right, we will have to
coordinate to make sure the Jenkins pyarrow is upgraded to version 0.8 as
well. I'm not sure the best way to coordinate all of this because this PR,
jenkins upgrade, and Spark Netty upgrade all need to happen at the same
time.
@holdenk <https://github.com/holdenk> @shaneknapp
<https://github.com/shaneknapp> will one of you be able to work on the
pyarrow upgrade for Jenkins sometime around next week? (assuming Arrow 0.8
is released in the next day or so)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19884 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABiDrEi41hUQjTmiBiKwTHlm1onv23lfks5s_aW5gaJpZM4Q1ftW>
.
|
The Arrow 0.8.0 release vote just started today. Assuming it passes, the earliest you could see packages pushed to PyPI or conda-forge would be sometime on Thursday evening or Friday. |
fdba406
to
46ad595
Compare
Test build #84738 has finished for PR 19884 at commit
|
46ad595
to
c3d612f
Compare
Test build #84740 has finished for PR 19884 at commit
|
dev/deps/spark-deps-hadoop-2.6
Outdated
netty-all-4.0.47.Final.jar | ||
netty-all-4.1.17.Final.jar | ||
netty-buffer-4.1.17.Final.jar | ||
netty-common-4.1.17.Final.jar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zsxwing do you think netty-buffer
and netty-common
can be safely excluded in the Spark pom because the same classes are also in netty-all
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BryanCutler Yes. It should be safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thx just wanted to be sure
Great, thanks @shaneknapp ! I'll ping you when I think we are set to go |
c3d612f
to
3a5e3c1
Compare
Test build #84743 has finished for PR 19884 at commit
|
@@ -110,3 +110,12 @@ def toJArray(gateway, jtype, arr): | |||
for i in range(0, len(arr)): | |||
jarr[i] = arr[i] | |||
return jarr | |||
|
|||
|
|||
def _require_minimum_pyarrow_version(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ueshin did we do the same thing for pandas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I just checked if ImportError
occurred or not. We should do the same thing for pandas later.
LGTM, I'm also fine to ignore some tests if they are hard to fix, to unblock other PRs sooner. |
I used a workaround for timestamp casts that allows the tests to pass for me locally, and left a note to look into the root cause later. Hopefully this should pass now and we will be good to merge. |
python/pyspark/sql/functions.py
Outdated
>>> df.select(slen("name").alias("slen(name)"), to_upper("name"), add_one("age")) \\ | ||
... .show() # doctest: +SKIP | ||
+----------+--------------+------------+ | ||
|slen(name)|to_upper(name)|add_one(age)| | ||
+----------+--------------+------------+ | ||
| 8| JOHN DOE| 22| | ||
| 8| JOHN| 22| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we should revert this too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, done!
Test build #85244 has finished for PR 19884 at commit
|
Test build #85242 has finished for PR 19884 at commit
|
Jenkins, retest this please. |
Test build #85246 has finished for PR 19884 at commit
|
LGTM Merged to master. |
Hi @zsxwing is it okay to resolve SPARK-19552? |
@BryanCutler can you give me a minimal repro for the timestamp issue you cited above? |
@HyukjinKwon yeah, I closed the ticket. |
Thanks all for reviewing and getting the Netty upgrade in also! |
Sure @wesm, I'll ping you with a repro |
## What changes were proposed in this pull request? This is a follow-up pr of #19884 updating setup.py file to add pyarrow dependency. ## How was this patch tested? Existing tests. Author: Takuya UESHIN <[email protected]> Closes #20089 from ueshin/issues/SPARK-22324/fup1.
@BryanCutler, did we resolve #19884 (comment)? If not, shall we file a JIRA? |
@HyukjinKwon ARROW-1949 was created to add an option to allow truncation when data will be lost. Once that is in Arrow, we can remove the workaround if we want. |
@@ -91,7 +91,7 @@ public long position() { | |||
} | |||
|
|||
@Override | |||
public long transfered() { | |||
public long transferred() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This breaks binary compatibility. Is it intentional? @zsxwing @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't. The old method is implemented in AbstractFileRegion.transfered
. In addition, the whole network module is private, we don't need to maintain compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. AbstractFileRegion.transfered
is final
so it may break binary compatibility. However, this is fine since it's a private module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thanks!
Upgrade Spark to Arrow 0.8.0 for Java and Python. Also includes an upgrade of Netty to 4.1.17 to resolve dependency requirements. The highlights that pertain to Spark for the update from Arrow versoin 0.4.1 to 0.8.0 include: * Java refactoring for more simple API * Java reduced heap usage and streamlined hot code paths * Type support for DecimalType, ArrayType * Improved type casting support in Python * Simplified type checking in Python Existing tests Author: Bryan Cutler <[email protected]> Author: Shixiong Zhu <[email protected]> Closes apache#19884 from BryanCutler/arrow-upgrade-080-SPARK-22324.
What changes were proposed in this pull request?
Upgrade Spark to Arrow 0.8.0 for Java and Python. Also includes an upgrade of Netty to 4.1.17 to resolve dependency requirements.
The highlights that pertain to Spark for the update from Arrow versoin 0.4.1 to 0.8.0 include:
How was this patch tested?
Existing tests