ARROW-1579: [Java] Adding containerized Spark Integration tests #1319

BryanCutler · 2017-11-14T18:52:45Z

This adds configuration files for running containerized integration tests with Spark for Java and Python. The image is based off of maven with openJDK-8 and will include a copy of Arrow when built on the parent directory to ARROW_HOME. Running the container will do the following:

Install Arrow Java artifacts to m2 repo in container
Checkout current Spark master branch
Update Spark POM to use the installed version of Arrow and package
Run Arrow related Java/Scala Spark tests
Run pyarrow related Python Spark tests

Successful completion of the running container will indicate that there are no breaking changes between current Spark code and the Arrow code

BryanCutler · 2017-11-14T18:58:16Z

This is currently a WIP, the Scala/Java tests are able to run

Left TODO:

Run PySpark Tests
Verify working with docker-compose and existing volumes in arrow/dev
Check why Zinc is unable to run in mvn build, need to enable port 3030?
Speed up pyarrow build using conda prefix as toolchain

BryanCutler · 2017-11-14T19:03:53Z

I don't prior experience with Docker, so any suggestions are welcome! @heimir-sverrisson @wesm please take a look when you have a chance. One question I have from looking at the Dask integration tests is that are we assuming that pyarrow is already built prior to creating the docker image?

wesm · 2017-11-14T19:09:26Z

I think it needs to be built as part of the Docker build. We should try to create some reusable scripts like our Travis CI setup to help with automating certain common tasks (setting up the C++ toolchain, building things, etc.)

xhochy

Small comment about better usage of docker's ability to cache layer of the image creation.

xhochy · 2017-11-15T10:35:50Z

dev/spark_integration/Dockerfile

+# limitations under the License.
+#
+FROM maven:3.5.2-jdk-8-slim
+ADD . /apache-arrow


Add apache-arrow as late as a possible. From the point on where you have added it, docker's caching will not be effective anymore. For example, once you move the apt-get line before the add, a subsequent build of the image will not run apt-get again but use the cached layer on top of the maven image that already is at the state after the apt-get call.

Great tip, thanks @xhochy ! apt-get takes a long time and I didn't know why it only sometimes ended up being cached.

wesm · 2018-01-18T16:54:17Z

@BryanCutler or @icexelloss could we get this set up? I'd like to look into setting up nightly integration test runs on a server someplace (we have a shared machine that @cpcloud and I have been using for nightly builds, and the pandas team users for benchmarks) so we can get warned immediately if something gets broken

BryanCutler · 2018-01-18T17:49:33Z

I'll try to pick this up again in the next few days @wesm. One question though, when we do an API breaking change, like rename MapVector, then this will fail until Spark is patched. Do we need to keep a branch with these kind of patches until they can be applied upstream?

wesm · 2018-01-18T18:02:22Z

I think that's fine. The failing nightly integration test will help with creating some urgency to get the downstream project fixed

…k tests

BryanCutler · 2018-01-26T01:40:27Z

This is pretty much running for me now, but I'm getting a strange test failure when running pyspark tests

======================================================================
FAIL: test_unsupported_datatype (pyspark.sql.tests.ArrowTests)
----------------------------------------------------------------------
AttributeError: 'LooseVersion' object has no attribute 'version'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/apache-arrow/spark/python/pyspark/sql/tests.py", line 3404, in test_unsupported_datatype
    df.toPandas()
AssertionError: "Unsupported data type" does not match "'LooseVersion' object has no attribute 'version'"

Not sure why because the tests pass normally, maybe something with the conda setup? I'll try some more later..

xhochy · 2018-01-28T17:58:18Z

@BryanCutler I had a look into the build and found some errors with the Python/C++ scripts (will do a PR for that) but I also cannot get mvn install to work correctly. Any ideas why it installs to this location?

[INFO] Installing /apache-arrow/arrow/java/tools/target/arrow-tools-0.9.0-SNAPSHOT-jar-with-dependencies.jar to /apache-arrow/arrow/java/?/.m2/repository/org/apache/arrow/arrow-tools/0.9.0-SNAPSHOT/arrow-tools-0.9.0-SNAPSHOT-jar-with-dependencies.jar

BryanCutler · 2018-01-29T19:47:57Z

@xhochy that is strange, until I switched the image to be based on FROM maven:3.5.2-jdk-8-slim I was seeing similar problems with paths containing ? but since then it's been working fine and installs under root

[INFO] Installing /apache-arrow/arrow/java/tools/target/arrow-tools-0.9.0-SNAPSHOT-jar-with-dependencies.jar to /root/.m2/repository/org/apache/arrow/arrow-tools/0.9.0-SNAPSHOT/arrow-tools-0.9.0-SNAPSHOT-jar-with-dependencies.jar

I did follow this docker post-installation step, maybe that has something to do with it? https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user

I've also been running with my local .m2 repo mounted in the container with docker run -v /home/bryan/.m2:/root/.m2 spark-arrow to save from always downloading dependencies, but I tried skipping that still don't get any errors.

Besides the strange path, what is the exact error that it gives you for mvn install?

BryanCutler · 2018-01-29T21:20:07Z

Looks like the errors I'm getting is because pa.__version__ return None

>>> import pyarrow as pa
>>> from distutils.version import LooseVersion
>>> LooseVersion(pa.__version__) < LooseVersion('0.8.0')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/bryan/miniconda2/envs/pyarrow-dev/lib/python2.7/distutils/version.py", line 296, in __cmp__
    return cmp(self.version, other.version)
AttributeError: LooseVersion instance has no attribute 'version'
>>> pa.__version__
>>>

Is there something in the build I need to set to get a valid version? cc @wesm

BryanCutler · 2018-01-29T21:43:24Z

from pyarrow.__init__.py it looks like maybe I should install setuptools_scm but then I get the following version:

>>> pa.__version__
'0.1.1.dev1231+g0a49022'

I would expect something like '0.8.***', where is it getting the above number from?

wesm · 2018-01-29T23:15:39Z

Not my area of expertise (@xhochy or @cpcloud would know better) -- usually that relates to not having fetched all the tags

felixcheung

maybe some native toolchain is missing when switching to maven:3.5.2-jdk-8-slim

felixcheung · 2018-01-30T07:27:40Z

dev/spark_integration/spark_integration.sh

+popd
+
+# Build Spark with Arrow
+SPARK_REPO=https://github.com/apache/spark.git


I think we should pull from apache git?

Thanks @felixcheung ! Yeah, it would probably be best to use Apache, I'll change it

BryanCutler · 2018-01-30T21:56:49Z

@xhochy any pointers on how to get a correct pyarrow version? If you are able to do a PR to this branch for the errors you found, that would be great - I can make you a collaborator if that would be easier.

wesm · 2018-01-30T22:18:33Z

@BryanCutler I think all committers on GitBox have write access to your branch, so we can push directly

xhochy · 2018-02-04T13:51:18Z

@BryanCutler The main change I did was to move source activate pyarrow-dev to the beginning of the file, set export ARROW_BUILD_TOOLCHAIN=$CONDA_PREFIX and removed LD_LIBRARY_PATH and PYTHONPATH from the file. This should pick up the conda environment and speed up Arrow builds a lot. That should actually not affect maven builds.

…thon dir

BryanCutler · 2018-02-09T01:30:54Z

Ok, I finally got this to build all and pass all tests! There are still a couple of issues to work out though, I'll discuss below..

Btw, to get the correct pyarrow.__version__ from the dev env, you do need to have all git tags fetched and install setuptools_scm from pip or conda. @xhochy , setuptools_scm wasn't listed in any of the developer docs I could find, should it be added to the list of dependent packages for setting up a conda env?

BryanCutler · 2018-02-09T01:38:12Z

python/pyarrow/__init__.py

@@ -24,7 +24,7 @@
   # package is not installed
    try:
        import setuptools_scm
-        __version__ = setuptools_scm.get_version('../')
+        __version__ = setuptools_scm.get_version(root='../../', relative_to=__file__)


@xhochy and @wesm , I needed to change this because it would only give a version if run under ARROW_HOME/python directory. So when running Spark tests, on importing pyarrow it would return None. Making it relative to the __file__ seemed to fix it for all cases. I can make this a separate JIRA if you think that would be better.

See the suggestion above to get rid of this line.

BryanCutler · 2018-02-09T01:44:55Z

@xhochy , I could not get Arrow C++ to build with export ARROW_BUILD_TOOLCHAIN=$CONDA_PREFIX, I would get a linking error with gflags like "undefined reference google::FlagRegisterer::FlagRegisterer". I thought maybe it was because I wasn't using g++ 4.9, but I had no luck trying to get 4.9 installed since the base image I'm using is Ubuntu 16.04. Have you ever run into this? It seemed like it was some kind of template constructor that it couldn't find..

wesm · 2018-02-09T21:40:16Z

We'll need to add some flags to ARROW_CXXFLAGS to disable the gcc5 ABI. I can try to take a look in a bit or this weekend

xhochy

To use the conda toolchain, set "ARROW_CXXFLAGS=-D_GLIBCXX_USE_CXX11_ABI=0" this switches the code then to the correct C++ ABI.

xhochy · 2018-02-10T18:24:31Z

dev/spark_integration/spark_integration.sh

+# limitations under the License.
+#
+
+# Set up environment and working directory


Call set -e here, then a command failure in the script will lead to a failure of the whole script. Then you can get rid of the if [[ $? -ne 0 ]]; then blocks later on

xhochy · 2018-02-10T18:25:31Z

dev/spark_integration/spark_integration.sh

+# Build pyarrow and install inplace
+pushd arrow/python
+python setup.py clean
+python setup.py build_ext --build-type=release --inplace


Don't use --inplace but rather run python setup.py build_ext --build-type=release install to install the extension into the environment.

xhochy · 2018-02-10T18:26:22Z

dev/spark_integration/spark_integration.sh

+popd
+
+# Clean up
+echo "Cleaning up.."


No need for these two line, at the end the environment is thrown away anyways

xhochy · 2018-02-10T18:27:13Z

python/pyarrow/__init__.py

@@ -24,7 +24,7 @@
   # package is not installed
    try:
        import setuptools_scm
-        __version__ = setuptools_scm.get_version('../')
+        __version__ = setuptools_scm.get_version(root='../../', relative_to=__file__)


See the suggestion above to get rid of this line.

BryanCutler · 2018-02-13T19:42:28Z

Thanks @xhochy , I'll try all this out!

BryanCutler · 2018-02-13T19:55:56Z

I still get the linking error after setting "ARROW_CXXFLAGS=-D_GLIBCXX_USE_CXX11_ABI=0"

[ 78%] Linking CXX executable ../../../release/json-integration-test
CMakeFiles/json-integration-test.dir/json-integration-test.cc.o: In function `_GLOBAL__sub_I__ZN5arrow4test25MakeRandomInt32PoolBufferElPNS_10MemoryPoolEPSt10shared_ptrINS_10PoolBufferEEj':
json-integration-test.cc:(.text.startup+0x1de): undefined reference to `google::FlagRegisterer::FlagRegisterer<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(char const*, char const*, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*)'
json-integration-test.cc:(.text.startup+0x296): undefined reference to `google::FlagRegisterer::FlagRegisterer<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(char const*, char const*, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*)'
json-integration-test.cc:(.text.startup+0x34e): undefined reference to `google::FlagRegisterer::FlagRegisterer<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(char const*, char const*, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*)'
collect2: error: ld returned 1 exit status

wesm · 2018-02-13T21:43:32Z

Looks like that flag needs to be passed through to the ExternalProject declarations. Try using CMAKE_CXX_FLAGS instead of ARROW_CXXFLAGS? If this doesn't work I can dig in later today

BryanCutler · 2018-02-13T22:42:14Z

thanks @wesm , that seemed to do the trick by adding it to the cmake command line (it didn't work as an env var)
cmake -DCMAKE_CXX_FLAGS="-D_GLIBCXX_USE_CXX11_ABI=0" -DARROW_PYTHON=on -DARROW_HDFS=on -DCMAKE_BUILD_TYPE=release -DCMAKE_INSTALL_PREFIX=$ARROW_HOME ..

BryanCutler · 2018-02-14T00:46:33Z

Well, I'm able to run the build with ARROW_BUILD_TOOLCHAIN=$CONDA_PREFIX but I do need to keep -DCMAKE_INSTALL_PREFIX=$ARROW_HOME for the Python build to work (is this right?), and then LD_LIBRARY_PATH=${ARROW_HOME}/lib:${CONDA_BASE}/lib:${LD_LIBRARY_PATH} to locate libarrow.so.

Now, I get an error with import pyarrow because of some undefined symbol:
ImportError: /home/ubuntu/miniconda/envs/pyarrow-dev/lib/python2.7/site-packages/pyarrow-0.8.1.dev116+g7bf7b2e9-py2.7-linux-x86_64.egg/pyarrow/lib.so: undefined symbol: _ZN5arrow9timestampENS_8TimeUnit4typeERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE

I don't think this is from an arrow library, but I can't tell where it's coming from, any ideas?

wesm · 2018-02-14T01:37:07Z

Now you need to set the PYARROW_CXXFLAGS environment variable with the gcc5 abi flag

…arrow python dir" This reverts commit 3305a35.

BryanCutler · 2018-02-14T19:03:53Z

This is working using ARROW_BUILD_TOOLCHAIN set to the conda evn, and I think this could be merged now. Is somebody else able to verify that it is working before merging?

wesm · 2018-02-15T16:49:00Z

Sweet, I can take this for a spin later today, or if @cpcloud wants to look that also works

wesm · 2018-02-16T02:16:09Z

Build failed for me with

Scanning dependencies of target arrow_static
Scanning dependencies of target arrow_shared
make[2]: *** No rule to make target '/home/ubuntu/miniconda/envs/pyarrow-dev/lib/libbrotlidec.a', needed by 'release/libarrow.so.0.0.0'.  Stop.
CMakeFiles/Makefile2:693: recipe for target 'src/arrow/CMakeFiles/arrow_shared.dir/all' failed
make[1]: *** [src/arrow/CMakeFiles/arrow_shared.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
[ 52%] Linking CXX static library ../../release/libarrow.a
[ 52%] Built target arrow_static
Makefile:140: recipe for target 'all' failed
make: *** [all] Error 2

Haven't dug too far into what's wrong. Can we document how to run this someplace other than the Dockerfile?

BryanCutler · 2018-02-16T03:20:33Z

Thanks @wesm , I think that is because the brotli libraries were recently updated and #1554 should fix that. Could you check if you have the latest master checked out? The docker script just takes what is in your current dir - not sure if that is what we want or not.

BryanCutler · 2018-02-16T03:23:18Z

yeah, we should have it better documented. I copied how the api docs script was done and it's run with docker compose, but I'm not quite sure how to run it with through that. I just used the regular docker run cmd

xhochy · 2018-02-16T17:37:07Z

As this is part of a docker-compose script, this should be run using docker-compose build && docker-compose run spark_integration. Then it will use the options declared in the docker-compose.yml (mostly the mount of the current working dir)

xhochy · 2018-02-16T17:37:41Z

To get past the build issues I had to rebase locally. Once the script has run through I will report back.

BryanCutler · 2018-02-16T18:38:31Z

I probably should have rebased here earlier, if anyone else is going to try this out let me know and I can do that first.

xhochy · 2018-02-16T19:02:22Z

Sadly I get the following error on master:

[warn] Class org.apache.avro.reflect.Stringable not found - continuing with a stub.
[error] /apache-arrow/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala:65: not found: type NullableMapVector
[error]       case (StructType(_), vector: NullableMapVector) =>
[error]                                    ^
[error] /apache-arrow/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala:66: value size is not a member of org.apache.arrow.vector.ValueVector
[error]         val children = (0 until vector.size()).map { ordinal =>
[error]                                        ^
[error] /apache-arrow/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala:67: value getChildByOrdinal is not a member of org.apache.arrow.vector.ValueVector
[error]           createFieldWriter(vector.getChildByOrdinal(ordinal))
[error]                                    ^
[error] /apache-arrow/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala:318: not found: type NullableMapVector
[error]     val valueVector: NullableMapVector,
[error]                      ^
[warn] Class org.apache.avro.reflect.Stringable not found - continuing with a stub.
[warn] two warnings found
[error] four errors found
[error] Compile failed at Feb 16, 2018 6:49:56 PM [50.120s]

BryanCutler · 2018-02-16T19:22:42Z

That error is from #1490 being merged yesterday - what I was talking about in #1319 (comment) but I was hoping to at least get this running for a little while first before breaking Spark!

I guess we could either rebase this just before #1490 to ensure it works or I can provide a patch for Spark to update it for that breaking change in Java?

xhochy · 2018-02-16T22:00:59Z

We definitely need an upstream patch for Spark but I'm happy with merging this PR as-is as it shows that we can now test Spark with it. @wesm is this ok for you?

wesm · 2018-02-17T16:42:53Z

Works for me. Merging now

BryanCutler · 2018-02-19T19:55:26Z

Thanks @wesm @xhochy and @felixcheung ! Since it can sometimes take a while to get Spark updated, ff we get to the point were this is ready to be put in the nightly builds, maybe I could submit a PR to patch Spark and we could configure the docker build to point to that.

xhochy reviewed Nov 15, 2017

View reviewed changes

wesm changed the title ~~[WIP] ARROW-1579: [TESTS] Adding containerized Spark Integration tests~~ [WIP] ARROW-1579: [Java] Adding containerized Spark Integration tests Nov 18, 2017

BryanCutler added 4 commits January 24, 2018 16:27

added initial docker files

09c961e

docker running Scala Arrow tests

b04b0e5

using build/mvn script for spark builds

e38d43d

running python tests now and building pyarrow, not yet passing pyspar…

95eb22a

…k tests

BryanCutler force-pushed the docker-spark-integration-ARROW-1579 branch from 5d398e8 to 95eb22a Compare January 26, 2018 01:37

felixcheung reviewed Jan 30, 2018

View reviewed changes

BryanCutler added 2 commits February 8, 2018 17:21

added fix for using setuptools_scm to get version outside of arrow py…

3305a35

…thon dir

docker container now running and passing all tests

b2182d3

BryanCutler commented Feb 9, 2018

View reviewed changes

xhochy reviewed Feb 10, 2018

View reviewed changes

BryanCutler added 2 commits February 14, 2018 10:32

Revert "added fix for using setuptools_scm to get version outside of …

f962e38

…arrow python dir" This reverts commit 3305a35.

now building with ARROW_BUILD_TOOLCHAIN set to conda env

3f9f483

BryanCutler changed the title ~~[WIP] ARROW-1579: [Java] Adding containerized Spark Integration tests~~ ARROW-1579: [Java] Adding containerized Spark Integration tests Feb 14, 2018

wesm mentioned this pull request Feb 15, 2018

Importing parquet fails in debug mode #1612

Closed

wesm closed this in 62b9eb2 Feb 17, 2018

jameslamb mentioned this pull request Oct 27, 2018

ARROW-3366: [R] Dockerfile for docker-compose setup #2770

Closed

BryanCutler deleted the docker-spark-integration-ARROW-1579 branch November 19, 2018 05:49

ARROW-1579: [Java] Adding containerized Spark Integration tests #1319

ARROW-1579: [Java] Adding containerized Spark Integration tests #1319

Conversation

BryanCutler commented Nov 14, 2017

BryanCutler commented Nov 14, 2017 • edited Loading

BryanCutler commented Nov 14, 2017

wesm commented Nov 14, 2017

xhochy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wesm commented Jan 18, 2018

BryanCutler commented Jan 18, 2018

wesm commented Jan 18, 2018

BryanCutler commented Jan 26, 2018

xhochy commented Jan 28, 2018

BryanCutler commented Jan 29, 2018

BryanCutler commented Jan 29, 2018

BryanCutler commented Jan 29, 2018 • edited Loading

wesm commented Jan 29, 2018

felixcheung left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BryanCutler commented Jan 30, 2018

wesm commented Jan 30, 2018

xhochy commented Feb 4, 2018

BryanCutler commented Feb 9, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BryanCutler commented Feb 9, 2018

wesm commented Feb 9, 2018

xhochy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BryanCutler commented Feb 13, 2018

BryanCutler commented Feb 13, 2018

wesm commented Feb 13, 2018

BryanCutler commented Feb 13, 2018

BryanCutler commented Feb 14, 2018 • edited Loading

wesm commented Feb 14, 2018

BryanCutler commented Feb 14, 2018

wesm commented Feb 15, 2018

wesm commented Feb 16, 2018 • edited Loading

BryanCutler commented Feb 16, 2018

BryanCutler commented Feb 16, 2018

xhochy commented Feb 16, 2018

xhochy commented Feb 16, 2018

BryanCutler commented Feb 16, 2018

xhochy commented Feb 16, 2018

BryanCutler commented Feb 16, 2018

xhochy commented Feb 16, 2018

wesm commented Feb 17, 2018

BryanCutler commented Feb 19, 2018

BryanCutler commented Nov 14, 2017 •

edited

Loading

BryanCutler commented Jan 29, 2018 •

edited

Loading

BryanCutler commented Feb 9, 2018 •

edited

Loading

BryanCutler commented Feb 14, 2018 •

edited

Loading

wesm commented Feb 16, 2018 •

edited

Loading