[SPARK-19613][SS][TEST] Random.nextString is not safe for directory namePrefix #21446

dongjoon-hyun · 2018-05-28T16:02:30Z

What changes were proposed in this pull request?

Random.nextString is good for generating random string data, but it's not proper for directory name prefix in Utils.createDirectory(tempDir, Random.nextString(10)). This PR uses more safe directory namePrefix.

scala> scala.util.Random.nextString(10)
res0: String = 馨쭔ᎰႻ穚䃈兩㻞藑並

StateStoreRDDSuite:
- versioning and immutability
- recovering from files
- usage with iterators - only gets and only puts
- preferred locations using StateStoreCoordinator *** FAILED ***
  java.io.IOException: Failed to create a temp directory (under /.../spark/sql/core/target/tmp/StateStoreRDDSuite8712796397908632676) after 10 attempts!
  at org.apache.spark.util.Utils$.createDirectory(Utils.scala:295)
  at org.apache.spark.sql.execution.streaming.state.StateStoreRDDSuite$$anonfun$13$$anonfun$apply$6.apply(StateStoreRDDSuite.scala:152)
  at org.apache.spark.sql.execution.streaming.state.StateStoreRDDSuite$$anonfun$13$$anonfun$apply$6.apply(StateStoreRDDSuite.scala:149)
  at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42)
  at org.apache.spark.sql.execution.streaming.state.StateStoreRDDSuite$$anonfun$13.apply(StateStoreRDDSuite.scala:149)
  at org.apache.spark.sql.execution.streaming.state.StateStoreRDDSuite$$anonfun$13.apply(StateStoreRDDSuite.scala:149)
...
- distributed test *** FAILED ***
  java.io.IOException: Failed to create a temp directory (under /.../spark/sql/core/target/tmp/StateStoreRDDSuite8712796397908632676) after 10 attempts!
  at org.apache.spark.util.Utils$.createDirectory(Utils.scala:295)

How was this patch tested?

Pass the existing tests.StateStoreRDDSuite:

…amePrefix

SparkQA · 2018-05-28T19:50:55Z

Test build #91231 has finished for PR 21446 at commit a46dc70.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-05-28T20:12:35Z

Test build #91232 has finished for PR 21446 at commit 1412abd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung

good point, LGTM.

HeartSaVioR

LGTM, nice finding!

dongjoon-hyun · 2018-05-29T00:37:17Z

Thank you for reviewing, @felixcheung and @HeartSaVioR.

HyukjinKwon · 2018-05-29T02:26:43Z

Yea, I was facing this problem too. Thanks for fixing this.

HyukjinKwon

LGTM

…amePrefix ## What changes were proposed in this pull request? `Random.nextString` is good for generating random string data, but it's not proper for directory name prefix in `Utils.createDirectory(tempDir, Random.nextString(10))`. This PR uses more safe directory namePrefix. ```scala scala> scala.util.Random.nextString(10) res0: String = 馨쭔ᎰႻ穚䃈兩㻞藑並 ``` ```scala StateStoreRDDSuite: - versioning and immutability - recovering from files - usage with iterators - only gets and only puts - preferred locations using StateStoreCoordinator *** FAILED *** java.io.IOException: Failed to create a temp directory (under /.../spark/sql/core/target/tmp/StateStoreRDDSuite8712796397908632676) after 10 attempts! at org.apache.spark.util.Utils$.createDirectory(Utils.scala:295) at org.apache.spark.sql.execution.streaming.state.StateStoreRDDSuite$$anonfun$13$$anonfun$apply$6.apply(StateStoreRDDSuite.scala:152) at org.apache.spark.sql.execution.streaming.state.StateStoreRDDSuite$$anonfun$13$$anonfun$apply$6.apply(StateStoreRDDSuite.scala:149) at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42) at org.apache.spark.sql.execution.streaming.state.StateStoreRDDSuite$$anonfun$13.apply(StateStoreRDDSuite.scala:149) at org.apache.spark.sql.execution.streaming.state.StateStoreRDDSuite$$anonfun$13.apply(StateStoreRDDSuite.scala:149) ... - distributed test *** FAILED *** java.io.IOException: Failed to create a temp directory (under /.../spark/sql/core/target/tmp/StateStoreRDDSuite8712796397908632676) after 10 attempts! at org.apache.spark.util.Utils$.createDirectory(Utils.scala:295) ``` ## How was this patch tested? Pass the existing tests.StateStoreRDDSuite: Author: Dongjoon Hyun <[email protected]> Closes #21446 from dongjoon-hyun/SPARK-19613. (cherry picked from commit b31b587) Signed-off-by: hyukjinkwon <[email protected]>

HyukjinKwon · 2018-05-29T02:36:16Z

Merged to master and branch-2.3.

dongjoon-hyun · 2018-05-29T04:34:31Z

Thank you for review and merging, @HyukjinKwon . Thank you all!

* [SPARK-24381][TESTING] Add unit tests for NOT IN subquery around null values ## What changes were proposed in this pull request? This PR adds several unit tests along the `cols NOT IN (subquery)` pathway. There are a scattering of tests here and there which cover this codepath, but there doesn't seem to be a unified unit test of the correctness of null-aware anti joins anywhere. I have also added a brief explanation of how this expression behaves in SubquerySuite. Lastly, I made some clarifying changes in the NOT IN pathway in RewritePredicateSubquery. ## How was this patch tested? Added unit tests! There should be no behavioral change in this PR. Author: Miles Yucht <[email protected]> Closes #21425 from mgyucht/spark-24381. * [SPARK-24334] Fix race condition in ArrowPythonRunner causes unclean shutdown of Arrow memory allocator ## What changes were proposed in this pull request? There is a race condition of closing Arrow VectorSchemaRoot and Allocator in the writer thread of ArrowPythonRunner. The race results in memory leak exception when closing the allocator. This patch removes the closing routine from the TaskCompletionListener and make the writer thread responsible for cleaning up the Arrow memory. This issue be reproduced by this test: ``` def test_memory_leak(self): from pyspark.sql.functions import pandas_udf, col, PandasUDFType, array, lit, explode # Have all data in a single executor thread so it can trigger the race condition easier with self.sql_conf({'spark.sql.shuffle.partitions': 1}): df = self.spark.range(0, 1000) df = df.withColumn('id', array([lit(i) for i in range(0, 300)])) \ .withColumn('id', explode(col('id'))) \ .withColumn('v', array([lit(i) for i in range(0, 1000)])) pandas_udf(df.schema, PandasUDFType.GROUPED_MAP) def foo(pdf): xxx return pdf result = df.groupby('id').apply(foo) with QuietTest(self.sc): with self.assertRaises(py4j.protocol.Py4JJavaError) as context: result.count() self.assertTrue('Memory leaked' not in str(context.exception)) ``` Note: Because of the race condition, the test case cannot reproduce the issue reliably so it's not added to test cases. ## How was this patch tested? Because of the race condition, the bug cannot be unit test easily. So far it has only happens on large amount of data. This is currently tested manually. Author: Li Jin <[email protected]> Closes #21397 from icexelloss/SPARK-24334-arrow-memory-leak. * [SPARK-24373][SQL] Add AnalysisBarrier to RelationalGroupedDataset's and KeyValueGroupedDataset's child ## What changes were proposed in this pull request? When we create a `RelationalGroupedDataset` or a `KeyValueGroupedDataset` we set its child to the `logicalPlan` of the `DataFrame` we need to aggregate. Since the `logicalPlan` is already analyzed, we should not analyze it again. But this happens when the new plan of the aggregate is analyzed. The current behavior in most of the cases is likely to produce no harm, but in other cases re-analyzing an analyzed plan can change it, since the analysis is not idempotent. This can cause issues like the one described in the JIRA (missing to find a cached plan). The PR adds an `AnalysisBarrier` to the `logicalPlan` which is used as child of `RelationalGroupedDataset` or a `KeyValueGroupedDataset`. ## How was this patch tested? added UT Author: Marco Gaido <[email protected]> Closes #21432 from mgaido91/SPARK-24373. * [SPARK-24392][PYTHON] Label pandas_udf as Experimental ## What changes were proposed in this pull request? The pandas_udf functionality was introduced in 2.3.0, but is not completely stable and still evolving. This adds a label to indicate it is still an experimental API. ## How was this patch tested? NA Author: Bryan Cutler <[email protected]> Closes #21435 from BryanCutler/arrow-pandas_udf-experimental-SPARK-24392. * [SPARK-19613][SS][TEST] Random.nextString is not safe for directory namePrefix ## What changes were proposed in this pull request? `Random.nextString` is good for generating random string data, but it's not proper for directory name prefix in `Utils.createDirectory(tempDir, Random.nextString(10))`. This PR uses more safe directory namePrefix. ```scala scala> scala.util.Random.nextString(10) res0: String = 馨쭔ᎰႻ穚䃈兩㻞藑並 ``` ```scala StateStoreRDDSuite: - versioning and immutability - recovering from files - usage with iterators - only gets and only puts - preferred locations using StateStoreCoordinator *** FAILED *** java.io.IOException: Failed to create a temp directory (under /.../spark/sql/core/target/tmp/StateStoreRDDSuite8712796397908632676) after 10 attempts! at org.apache.spark.util.Utils$.createDirectory(Utils.scala:295) at org.apache.spark.sql.execution.streaming.state.StateStoreRDDSuite$$anonfun$13$$anonfun$apply$6.apply(StateStoreRDDSuite.scala:152) at org.apache.spark.sql.execution.streaming.state.StateStoreRDDSuite$$anonfun$13$$anonfun$apply$6.apply(StateStoreRDDSuite.scala:149) at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42) at org.apache.spark.sql.execution.streaming.state.StateStoreRDDSuite$$anonfun$13.apply(StateStoreRDDSuite.scala:149) at org.apache.spark.sql.execution.streaming.state.StateStoreRDDSuite$$anonfun$13.apply(StateStoreRDDSuite.scala:149) ... - distributed test *** FAILED *** java.io.IOException: Failed to create a temp directory (under /.../spark/sql/core/target/tmp/StateStoreRDDSuite8712796397908632676) after 10 attempts! at org.apache.spark.util.Utils$.createDirectory(Utils.scala:295) ``` ## How was this patch tested? Pass the existing tests.StateStoreRDDSuite: Author: Dongjoon Hyun <[email protected]> Closes #21446 from dongjoon-hyun/SPARK-19613. * [SPARK-24377][SPARK SUBMIT] make --py-files work in non pyspark application ## What changes were proposed in this pull request? For some Spark applications, though they're a java program, they require not only jar dependencies, but also python dependencies. One example is Livy remote SparkContext application, this application is actually an embedded REPL for Scala/Python/R, it will not only load in jar dependencies, but also python and R deps, so we should specify not only "--jars", but also "--py-files". Currently for a Spark application, --py-files can only be worked for a pyspark application, so it will not be worked in the above case. So here propose to remove such restriction. Also we tested that "spark.submit.pyFiles" only supports quite limited scenario (client mode with local deps), so here also expand the usage of "spark.submit.pyFiles" to be alternative of --py-files. ## How was this patch tested? UT added. Author: jerryshao <[email protected]> Closes #21420 from jerryshao/SPARK-24377. * [SPARK-24250][SQL][FOLLOW-UP] support accessing SQLConf inside tasks ## What changes were proposed in this pull request? We should not stop users from calling `getActiveSession` and `getDefaultSession` in executors. To not break the existing behaviors, we should simply return None. ## How was this patch tested? N/A Author: Xiao Li <[email protected]> Closes #21436 from gatorsmile/followUpSPARK-24250. * [SPARK-23991][DSTREAMS] Fix data loss when WAL write fails in allocateBlocksToBatch When blocks tried to get allocated to a batch and WAL write fails then the blocks will be removed from the received block queue. This fact simply produces data loss because the next allocation will not find the mentioned blocks in the queue. In this PR blocks will be removed from the received queue only if WAL write succeded. Additional unit test. Author: Gabor Somogyi <[email protected]> Closes #21430 from gaborgsomogyi/SPARK-23991. Change-Id: I5ead84f0233f0c95e6d9f2854ac2ff6906f6b341 * [SPARK-24371][SQL] Added isInCollection in DataFrame API for Scala and Java. ## What changes were proposed in this pull request? Implemented **`isInCollection `** in DataFrame API for both Scala and Java, so users can do ```scala val profileDF = Seq( Some(1), Some(2), Some(3), Some(4), Some(5), Some(6), Some(7), None ).toDF("profileID") val validUsers: Seq[Any] = Seq(6, 7.toShort, 8L, "3") val result = profileDF.withColumn("isValid", $"profileID". isInCollection(validUsers)) result.show(10) """ +---------+-------+ |profileID|isValid| +---------+-------+ | 1| false| | 2| false| | 3| true| | 4| false| | 5| false| | 6| true| | 7| true| | null| null| +---------+-------+ """.stripMargin ``` ## How was this patch tested? Several unit tests are added. Author: DB Tsai <[email protected]> Closes #21416 from dbtsai/optimize-set. * [SPARK-24365][SQL] Add Data Source write benchmark ## What changes were proposed in this pull request? Add Data Source write benchmark. So that it would be easier to measure the writer performance. Author: Gengliang Wang <[email protected]> Closes #21409 from gengliangwang/parquetWriteBenchmark. * [SPARK-24331][SPARKR][SQL] Adding arrays_overlap, array_repeat, map_entries to SparkR ## What changes were proposed in this pull request? The PR adds functions `arrays_overlap`, `array_repeat`, `map_entries` to SparkR. ## How was this patch tested? Tests added into R/pkg/tests/fulltests/test_sparkSQL.R ## Examples ### arrays_overlap ``` df <- createDataFrame(list(list(list(1L, 2L), list(3L, 1L)), list(list(1L, 2L), list(3L, 4L)), list(list(1L, NA), list(3L, 4L)))) collect(select(df, arrays_overlap(df[[1]], df[[2]]))) ``` ``` arrays_overlap(_1, _2) 1 TRUE 2 FALSE 3 NA ``` ### array_repeat ``` df <- createDataFrame(list(list("a", 3L), list("b", 2L))) collect(select(df, array_repeat(df[[1]], df[[2]]))) ``` ``` array_repeat(_1, _2) 1 a, a, a 2 b, b ``` ``` collect(select(df, array_repeat(df[[1]], 2L))) ``` ``` array_repeat(_1, 2) 1 a, a 2 b, b ``` ### map_entries ``` df <- createDataFrame(list(list(map = as.environment(list(x = 1, y = 2))))) collect(select(df, map_entries(df$map))) ``` ``` map_entries(map) 1 x, 1, y, 2 ``` Author: Marek Novotny <[email protected]> Closes #21434 from mn-mikke/SPARK-24331. * [SPARK-23754][PYTHON] Re-raising StopIteration in client code ## What changes were proposed in this pull request? Make sure that `StopIteration`s raised in users' code do not silently interrupt processing by spark, but are raised as exceptions to the users. The users' functions are wrapped in `safe_iter` (in `shuffle.py`), which re-raises `StopIteration`s as `RuntimeError`s ## How was this patch tested? Unit tests, making sure that the exceptions are indeed raised. I am not sure how to check whether a `Py4JJavaError` contains my exception, so I simply looked for the exception message in the java exception's `toString`. Can you propose a better way? ## License This is my original work, licensed in the same way as spark Author: e-dorigatti <[email protected]> Author: edorigatti <[email protected]> Closes #21383 from e-dorigatti/fix_spark_23754. * [SPARK-24419][BUILD] Upgrade SBT to 0.13.17 with Scala 2.10.7 for JDK9+ ## What changes were proposed in this pull request? Upgrade SBT to 0.13.17 with Scala 2.10.7 for JDK9+ ## How was this patch tested? Existing tests Author: DB Tsai <[email protected]> Closes #21458 from dbtsai/sbt. * [SPARK-24369][SQL] Correct handling for multiple distinct aggregations having the same argument set ## What changes were proposed in this pull request? This pr fixed an issue when having multiple distinct aggregations having the same argument set, e.g., ``` scala>: paste val df = sql( s"""SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*) | FROM (VALUES (1, 1), (2, 2), (2, 2)) t(x, y) """.stripMargin) java.lang.RuntimeException You hit a query analyzer bug. Please report your query to Spark user mailing list. ``` The root cause is that `RewriteDistinctAggregates` can't detect multiple distinct aggregations if they have the same argument set. This pr modified code so that `RewriteDistinctAggregates` could count the number of aggregate expressions with `isDistinct=true`. ## How was this patch tested? Added tests in `DataFrameAggregateSuite`. Author: Takeshi Yamamuro <[email protected]> Closes #21443 from maropu/SPARK-24369. * [SPARK-24384][PYTHON][SPARK SUBMIT] Add .py files correctly into PythonRunner in submit with client mode in spark-submit ## What changes were proposed in this pull request? In client side before context initialization specifically, .py file doesn't work in client side before context initialization when the application is a Python file. See below: ``` $ cat /home/spark/tmp.py def testtest(): return 1 ``` This works: ``` $ cat app.py import pyspark pyspark.sql.SparkSession.builder.getOrCreate() import tmp print("************************%s" % tmp.testtest()) $ ./bin/spark-submit --master yarn --deploy-mode client --py-files /home/spark/tmp.py app.py ... ************************1 ``` but this doesn't: ``` $ cat app.py import pyspark import tmp pyspark.sql.SparkSession.builder.getOrCreate() print("************************%s" % tmp.testtest()) $ ./bin/spark-submit --master yarn --deploy-mode client --py-files /home/spark/tmp.py app.py Traceback (most recent call last): File "/home/spark/spark/app.py", line 2, in <module> import tmp ImportError: No module named tmp ``` ### How did it happen? In client mode specifically, the paths are being added into PythonRunner as are: https://github.com/apache/spark/blob/628c7b517969c4a7ccb26ea67ab3dd61266073ca/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L430 https://github.com/apache/spark/blob/628c7b517969c4a7ccb26ea67ab3dd61266073ca/core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala#L49-L88 The problem here is, .py file shouldn't be added as are since `PYTHONPATH` expects a directory or an archive like zip or egg. ### How does this PR fix? We shouldn't simply just add its parent directory because other files in the parent directory could also be added into the `PYTHONPATH` in client mode before context initialization. Therefore, we copy .py files into a temp directory for .py files and add it to `PYTHONPATH`. ## How was this patch tested? Unit tests are added and manually tested in both standalond and yarn client modes with submit. Author: hyukjinkwon <[email protected]> Closes #21426 from HyukjinKwon/SPARK-24384. * [SPARK-23161][PYSPARK][ML] Add missing APIs to Python GBTClassifier ## What changes were proposed in this pull request? Add featureSubsetStrategy in GBTClassifier and GBTRegressor. Also make GBTClassificationModel inherit from JavaClassificationModel instead of prediction model so it will have numClasses. ## How was this patch tested? Add tests in doctest Author: Huaxin Gao <[email protected]> Closes #21413 from huaxingao/spark-23161. * [SPARK-23901][SQL] Add masking functions ## What changes were proposed in this pull request? The PR adds the masking function as they are described in Hive's documentation: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DataMaskingFunctions. This means that only `string`s are accepted as parameter for the masking functions. ## How was this patch tested? added UTs Author: Marco Gaido <[email protected]> Closes #21246 from mgaido91/SPARK-23901. * [SPARK-24276][SQL] Order of literals in IN should not affect semantic equality ## What changes were proposed in this pull request? When two `In` operators are created with the same list of values, but different order, we are considering them as semantically different. This is wrong, since they have the same semantic meaning. The PR adds a canonicalization rule which orders the literals in the `In` operator so the semantic equality works properly. ## How was this patch tested? added UT Author: Marco Gaido <[email protected]> Closes #21331 from mgaido91/SPARK-24276. * [SPARK-24337][CORE] Improve error messages for Spark conf values ## What changes were proposed in this pull request? Improve the exception messages when retrieving Spark conf values to include the key name when the value is invalid. ## How was this patch tested? Unit tests for all get* operations in SparkConf that require a specific value format Author: William Sheu <[email protected]> Closes #21454 from PenguinToast/SPARK-24337-spark-config-errors. * [SPARK-24146][PYSPARK][ML] spark.ml parity for sequential pattern mining - PrefixSpan: Python API ## What changes were proposed in this pull request? spark.ml parity for sequential pattern mining - PrefixSpan: Python API ## How was this patch tested? doctests Author: WeichenXu <[email protected]> Closes #21265 from WeichenXu123/prefix_span_py. * [WEBUI] Avoid possibility of script in query param keys As discussed separately, this avoids the possibility of XSS on certain request param keys. CC vanzin Author: Sean Owen <[email protected]> Closes #21464 from srowen/XSS2. * [SPARK-24414][UI] Calculate the correct number of tasks for a stage. This change takes into account all non-pending tasks when calculating the number of tasks to be shown. This also means that when the stage is pending, the task table (or, in fact, most of the data in the stage page) will not be rendered. I also fixed the label when the known number of tasks is larger than the recorded number of tasks (it was inverted). Author: Marcelo Vanzin <[email protected]> Closes #21457 from vanzin/SPARK-24414. * [SPARK-24397][PYSPARK] Added TaskContext.getLocalProperty(key) in Python ## What changes were proposed in this pull request? This adds a new API `TaskContext.getLocalProperty(key)` to the Python TaskContext. It mirrors the Java TaskContext API of returning a string value if the key exists, or None if the key does not exist. ## How was this patch tested? New test added. Author: Tathagata Das <[email protected]> Closes #21437 from tdas/SPARK-24397. * [SPARK-23900][SQL] format_number support user specifed format as argument ## What changes were proposed in this pull request? `format_number` support user specifed format as argument. For example: ```sql spark-sql> SELECT format_number(12332.123456, '##################.###'); 12332.123 ``` ## How was this patch tested? unit test Author: Yuming Wang <[email protected]> Closes #21010 from wangyum/SPARK-23900. * [SPARK-24232][K8S] Add support for secret env vars ## What changes were proposed in this pull request? * Allows to refer a secret as an env var. * Introduces new config properties in the form: spark.kubernetes{driver,executor}.secretKeyRef.ENV_NAME=name:key ENV_NAME is case sensitive. * Updates docs. * Adds required unit tests. ## How was this patch tested? Manually tested and confirmed that the secrets exist in driver's and executor's container env. Also job finished successfully. First created a secret with the following yaml: ``` apiVersion: v1 kind: Secret metadata: name: test-secret data: username: c3RhdnJvcwo= password: Mzk1MjgkdmRnN0pi ------- $ echo -n 'stavros' | base64 c3RhdnJvcw== $ echo -n '39528$vdg7Jb' | base64 MWYyZDFlMmU2N2Rm ``` Run a job as follows: ```./bin/spark-submit \ --master k8s://http://localhost:9000 \ --deploy-mode cluster \ --name spark-pi \ --class org.apache.spark.examples.SparkPi \ --conf spark.executor.instances=1 \ --conf spark.kubernetes.container.image=skonto/spark:k8envs3 \ --conf spark.kubernetes.driver.secretKeyRef.MY_USERNAME=test-secret:username \ --conf spark.kubernetes.driver.secretKeyRef.My_password=test-secret:password \ --conf spark.kubernetes.executor.secretKeyRef.MY_USERNAME=test-secret:username \ --conf spark.kubernetes.executor.secretKeyRef.My_password=test-secret:password \ local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0-SNAPSHOT.jar 10000 ``` Secret loaded correctly at the driver container: ![image](https://user-images.githubusercontent.com/7945591/40174346-7fee70c8-59dd-11e8-8705-995a5472716f.png) Also if I log into the exec container: kubectl exec -it spark-pi-1526555613156-exec-1 bash bash-4.4# env > SPARK_EXECUTOR_MEMORY=1g > SPARK_EXECUTOR_CORES=1 > LANG=C.UTF-8 > HOSTNAME=spark-pi-1526555613156-exec-1 > SPARK_APPLICATION_ID=spark-application-1526555618626 > **MY_USERNAME=stavros** > > JAVA_HOME=/usr/lib/jvm/java-1.8-openjdk > KUBERNETES_PORT_443_TCP_PROTO=tcp > KUBERNETES_PORT_443_TCP_ADDR=10.100.0.1 > JAVA_VERSION=8u151 > KUBERNETES_PORT=tcp://10.100.0.1:443 > PWD=/opt/spark/work-dir > HOME=/root > SPARK_LOCAL_DIRS=/var/data/spark-b569b0ae-b7ef-4f91-bcd5-0f55535d3564 > KUBERNETES_SERVICE_PORT_HTTPS=443 > KUBERNETES_PORT_443_TCP_PORT=443 > SPARK_HOME=/opt/spark > SPARK_DRIVER_URL=spark://CoarseGrainedSchedulerspark-pi-1526555613156-driver-svc.default.svc:7078 > KUBERNETES_PORT_443_TCP=tcp://10.100.0.1:443 > SPARK_EXECUTOR_POD_IP=9.0.9.77 > TERM=xterm > SPARK_EXECUTOR_ID=1 > SHLVL=1 > KUBERNETES_SERVICE_PORT=443 > SPARK_CONF_DIR=/opt/spark/conf > PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/lib/jvm/java-1.8-openjdk/jre/bin:/usr/lib/jvm/java-1.8-openjdk/bin > JAVA_ALPINE_VERSION=8.151.12-r0 > KUBERNETES_SERVICE_HOST=10.100.0.1 > **My_password=39528$vdg7Jb** > _=/usr/bin/env > Author: Stavros Kontopoulos <[email protected]> Closes #21317 from skonto/k8s-fix-env-secrets. * [MINOR][YARN] Add YARN-specific credential providers in debug logging message This PR adds a debugging log for YARN-specific credential providers which is loaded by service loader mechanism. It took me a while to debug if it's actually loaded or not. I had to explicitly set the deprecated configuration and check if that's actually being loaded. The change scope is manually tested. Logs are like: ``` Using the following builtin delegation token providers: hadoopfs, hive, hbase. Using the following YARN-specific credential providers: yarn-test. ``` Author: hyukjinkwon <[email protected]> Closes #21466 from HyukjinKwon/minor-log. Change-Id: I18e2fb8eeb3289b148f24c47bb3130a560a881cf * [SPARK-24330][SQL] Refactor ExecuteWriteTask and Use `while` in writing files ## What changes were proposed in this pull request? 1. Refactor ExecuteWriteTask in FileFormatWriter to reduce common logic and improve readability. After the change, callers only need to call `commit()` or `abort` at the end of task. Also there is less code in `SingleDirectoryWriteTask` and `DynamicPartitionWriteTask`. Definitions of related classes are moved to a new file, and `ExecuteWriteTask` is renamed to `FileFormatDataWriter`. 2. As per code style guide: https://github.com/databricks/scala-style-guide#traversal-and-zipwithindex , we avoid using `for` for looping in [FileFormatWriter](https://github.com/apache/spark/pull/21381/files#diff-3b69eb0963b68c65cfe8075f8a42e850L536) , or `foreach` in [WriteToDataSourceV2Exec](https://github.com/apache/spark/pull/21381/files#diff-6fbe10db766049a395bae2e785e9d56eL119). In such critical code path, using `while` is good for performance. ## How was this patch tested? Existing unit test. I tried the microbenchmark in https://github.com/apache/spark/pull/21409 | Workload | Before changes(Best/Avg Time(ms)) | After changes(Best/Avg Time(ms)) | | --- | --- | -- | |Output Single Int Column| 2018 / 2043 | 2096 / 2236 | |Output Single Double Column| 1978 / 2043 | 2013 / 2018 | |Output Int and String Column| 6332 / 6706 | 6162 / 6298 | |Output Partitions| 4458 / 5094 | 3792 / 4008 | |Output Buckets| 5695 / 6102 | 5120 / 5154 | Also a microbenchmark on my laptop for general comparison among while/foreach/for : ``` class Writer { var sum = 0L def write(l: Long): Unit = sum += l } def testWhile(iterator: Iterator[Long]): Long = { val w = new Writer while (iterator.hasNext) { w.write(iterator.next()) } w.sum } def testForeach(iterator: Iterator[Long]): Long = { val w = new Writer iterator.foreach(w.write) w.sum } def testFor(iterator: Iterator[Long]): Long = { val w = new Writer for (x <- iterator) { w.write(x) } w.sum } val data = 0L to 100000000L val start = System.nanoTime (0 to 10).foreach(_ => testWhile(data.iterator)) println("benchmark while: " + (System.nanoTime - start)/1000000) val start2 = System.nanoTime (0 to 10).foreach(_ => testForeach(data.iterator)) println("benchmark foreach: " + (System.nanoTime - start2)/1000000) val start3 = System.nanoTime (0 to 10).foreach(_ => testForeach(data.iterator)) println("benchmark for: " + (System.nanoTime - start3)/1000000) ``` Benchmark result: `while`: 15401 ms `foreach`: 43034 ms `for`: 41279 ms Author: Gengliang Wang <[email protected]> Closes #21381 from gengliangwang/refactorExecuteWriteTask. * [SPARK-24444][DOCS][PYTHON] Improve Pandas UDF docs to explain column assignment ## What changes were proposed in this pull request? Added sections to pandas_udf docs, in the grouped map section, to indicate columns are assigned by position. ## How was this patch tested? NA Author: Bryan Cutler <[email protected]> Closes #21471 from BryanCutler/arrow-doc-pandas_udf-column_by_pos-SPARK-21427. * [SPARK-24326][MESOS] add support for local:// scheme for the app jar ## What changes were proposed in this pull request? * Adds support for local:// scheme like in k8s case for image based deployments where the jar is already in the image. Affects cluster mode and the mesos dispatcher. Covers also file:// scheme. Keeps the default case where jar resolution happens on the host. ## How was this patch tested? Dispatcher image with the patch, use it to start DC/OS Spark service: skonto/spark-local-disp:test Test image with my application jar located at the root folder: skonto/spark-local:test Dockerfile for that image. From mesosphere/spark:2.3.0-2.2.1-2-hadoop-2.6 COPY spark-examples_2.11-2.2.1.jar / WORKDIR /opt/spark/dist Tests: The following work as expected: * local normal example ``` dcos spark run --submit-args="--conf spark.mesos.appJar.local.resolution.mode=container --conf spark.executor.memory=1g --conf spark.mesos.executor.docker.image=skonto/spark-local:test --conf spark.executor.cores=2 --conf spark.cores.max=8 --class org.apache.spark.examples.SparkPi local:///spark-examples_2.11-2.2.1.jar" ``` * make sure the flag does not affect other uris ``` dcos spark run --submit-args="--conf spark.mesos.appJar.local.resolution.mode=container --conf spark.executor.memory=1g --conf spark.executor.cores=2 --conf spark.cores.max=8 --class org.apache.spark.examples.SparkPi https://s3-eu-west-1.amazonaws.com/fdp-stavros-test/spark-examples_2.11-2.1.1.jar" ``` * normal example no local ``` dcos spark run --submit-args="--conf spark.executor.memory=1g --conf spark.executor.cores=2 --conf spark.cores.max=8 --class org.apache.spark.examples.SparkPi https://s3-eu-west-1.amazonaws.com/fdp-stavros-test/spark-examples_2.11-2.1.1.jar" ``` The following fails * uses local with no setting, default is host. ``` dcos spark run --submit-args="--conf spark.executor.memory=1g --conf spark.mesos.executor.docker.image=skonto/spark-local:test --conf spark.executor.cores=2 --conf spark.cores.max=8 --class org.apache.spark.examples.SparkPi local:///spark-examples_2.11-2.2.1.jar" ``` ![image](https://user-images.githubusercontent.com/7945591/40283021-8d349762-5c80-11e8-9d62-2a61a4318fd5.png) Author: Stavros Kontopoulos <[email protected]> Closes #21378 from skonto/local-upstream. * [SPARK-23920][SQL] add array_remove to remove all elements that equal element from array ## What changes were proposed in this pull request? add array_remove to remove all elements that equal element from array ## How was this patch tested? add unit tests Author: Huaxin Gao <[email protected]> Closes #21069 from huaxingao/spark-23920. * [SPARK-24351][SS] offsetLog/commitLog purge thresholdBatchId should be computed with current committed epoch but not currentBatchId in CP mode ## What changes were proposed in this pull request? Compute the thresholdBatchId to purge metadata based on current committed epoch instead of currentBatchId in CP mode to avoid cleaning all the committed metadata in some case as described in the jira [SPARK-24351](https://issues.apache.org/jira/browse/SPARK-24351). ## How was this patch tested? Add new unit test. Author: Huang Tengfei <[email protected]> Closes #21400 from ivoson/branch-cp-meta. * Revert "[SPARK-24369][SQL] Correct handling for multiple distinct aggregations having the same argument set" This reverts commit 1e46f92f956a00d04d47340489b6125d44dbd47b. * [INFRA] Close stale PRs. Closes #21444 * [SPARK-24340][CORE] Clean up non-shuffle disk block manager files following executor exits on a Standalone cluster ## What changes were proposed in this pull request? Currently we only clean up the local directories on application removed. However, when executors die and restart repeatedly, many temp files are left untouched in the local directories, which is undesired behavior and could cause disk space used up gradually. We can detect executor death in the Worker, and clean up the non-shuffle files (files not ended with ".index" or ".data") in the local directories, we should not touch the shuffle files since they are expected to be used by the external shuffle service. Scope of this PR is limited to only implement the cleanup logic on a Standalone cluster, we defer to experts familiar with other cluster managers(YARN/Mesos/K8s) to determine whether it's worth to add similar support. ## How was this patch tested? Add new test suite to cover. Author: Xingbo Jiang <[email protected]> Closes #21390 from jiangxb1987/cleanupNonshuffleFiles. * [SPARK-23668][K8S] Added missing config property in running-on-kubernetes.md ## What changes were proposed in this pull request? PR https://github.com/apache/spark/pull/20811 introduced a new Spark configuration property `spark.kubernetes.container.image.pullSecrets` for specifying image pull secrets. However, the documentation wasn't updated accordingly. This PR adds the property introduced into running-on-kubernetes.md. ## How was this patch tested? N/A. foxish mccheah please help merge this. Thanks! Author: Yinan Li <[email protected]> Closes #21480 from liyinan926/master. * [SPARK-24356][CORE] Duplicate strings in File.path managed by FileSegmentManagedBuffer This patch eliminates duplicate strings that come from the 'path' field of java.io.File objects created by FileSegmentManagedBuffer. That is, we want to avoid the situation when multiple File instances for the same pathname "foo/bar" are created, each with a separate copy of the "foo/bar" String instance. In some scenarios such duplicate strings may waste a lot of memory (~ 10% of the heap). To avoid that, we intern the pathname with String.intern(), and before that we make sure that it's in a normalized form (contains no "//", "///" etc.) Otherwise, the code in java.io.File would normalize it later, creating a new "foo/bar" String copy. Unfortunately, the normalization code that java.io.File uses internally is in the package-private class java.io.FileSystem, so we cannot call it here directly. ## What changes were proposed in this pull request? Added code to ExternalShuffleBlockResolver.getFile(), that normalizes and then interns the pathname string before passing it to the File() constructor. ## How was this patch tested? Added unit test Author: Misha Dmitriev <[email protected]> Closes #21456 from countmdm/misha/spark-24356. * [SPARK-24455][CORE] fix typo in TaskSchedulerImpl comment change runTasks to submitTasks in the TaskSchedulerImpl.scala 's comment Author: xueyu <[email protected]> Author: Xue Yu <[email protected]> Closes #21485 from xueyumusic/fixtypo1. * [SPARK-24369][SQL] Correct handling for multiple distinct aggregations having the same argument set ## What changes were proposed in this pull request? bring back https://github.com/apache/spark/pull/21443 This is a different approach: just change the check to count distinct columns with `toSet` ## How was this patch tested? a new test to verify the planner behavior. Author: Wenchen Fan <[email protected]> Author: Takeshi Yamamuro <[email protected]> Closes #21487 from cloud-fan/back. * [SPARK-23786][SQL] Checking column names of csv headers ## What changes were proposed in this pull request? Currently column names of headers in CSV files are not checked against provided schema of CSV data. It could cause errors like showed in the [SPARK-23786](https://issues.apache.org/jira/browse/SPARK-23786) and https://github.com/apache/spark/pull/20894#issuecomment-375957777. I introduced new CSV option - `enforceSchema`. If it is enabled (by default `true`), Spark forcibly applies provided or inferred schema to CSV files. In that case, CSV headers are ignored and not checked against the schema. If `enforceSchema` is set to `false`, additional checks can be performed. For example, if column in CSV header and in the schema have different ordering, the following exception is thrown: ``` java.lang.IllegalArgumentException: CSV file header does not contain the expected fields Header: depth, temperature Schema: temperature, depth CSV file: marina.csv ``` ## How was this patch tested? The changes were tested by existing tests of CSVSuite and by 2 new tests. Author: Maxim Gekk <[email protected]> Author: Maxim Gekk <[email protected]> Closes #20894 from MaxGekk/check-column-names. * [SPARK-23903][SQL] Add support for date extract ## What changes were proposed in this pull request? Add support for date `extract` function: ```sql spark-sql> SELECT EXTRACT(YEAR FROM TIMESTAMP '2000-12-16 12:21:13'); 2000 ``` Supported field same as [Hive](https://github.com/apache/hive/blob/rel/release-2.3.3/ql/src/java/org/apache/hadoop/hive/ql/parse/IdentifiersParser.g#L308-L316): `YEAR`, `QUARTER`, `MONTH`, `WEEK`, `DAY`, `DAYOFWEEK`, `HOUR`, `MINUTE`, `SECOND`. ## How was this patch tested? unit tests Author: Yuming Wang <[email protected]> Closes #21479 from wangyum/SPARK-23903. * [SPARK-21896][SQL] Fix StackOverflow caused by window functions inside aggregate functions ## What changes were proposed in this pull request? This PR explicitly prohibits window functions inside aggregates. Currently, this will cause StackOverflow during analysis. See PR #19193 for previous discussion. ## How was this patch tested? This PR comes with a dedicated unit test. Author: aokolnychyi <[email protected]> Closes #21473 from aokolnychyi/fix-stackoverflow-window-funcs. * [SPARK-24290][ML] add support for Array input for instrumentation.logNamedValue ## What changes were proposed in this pull request? Extend instrumentation.logNamedValue to support Array input change the logging for "clusterSizes" to new method ## How was this patch tested? N/A Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Lu WANG <[email protected]> Closes #21347 from ludatabricks/SPARK-24290. * [SPARK-24300][ML] change the way to set seed in ml.cluster.LDASuite.generateLDAData ## What changes were proposed in this pull request? Using different RNG in all different partitions. ## How was this patch tested? manually Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Lu WANG <[email protected]> Closes #21492 from ludatabricks/SPARK-24300. * [SPARK-24215][PYSPARK] Implement _repr_html_ for dataframes in PySpark ## What changes were proposed in this pull request? Implement `_repr_html_` for PySpark while in notebook and add config named "spark.sql.repl.eagerEval.enabled" to control this. The dev list thread for context: http://apache-spark-developers-list.1001551.n3.nabble.com/eager-execution-and-debuggability-td23928.html ## How was this patch tested? New ut in DataFrameSuite and manual test in jupyter. Some screenshot below. **After:** ![image](https://user-images.githubusercontent.com/4833765/40268422-8db5bef0-5b9f-11e8-80f1-04bc654a4f2c.png) **Before:** ![image](https://user-images.githubusercontent.com/4833765/40268431-9f92c1b8-5b9f-11e8-9db9-0611f0940b26.png) Author: Yuanjian Li <[email protected]> Closes #21370 from xuanyuanking/SPARK-24215. * [SPARK-16451][REPL] Fail shell if SparkSession fails to start. Currently, in spark-shell, if the session fails to start, the user sees a bunch of unrelated errors which are caused by code in the shell initialization that references the "spark" variable, which does not exist in that case. Things like: ``` <console>:14: error: not found: value spark import spark.sql ``` The user is also left with a non-working shell (unless they want to just write non-Spark Scala or Python code, that is). This change fails the whole shell session at the point where the failure occurs, so that the last error message is the one with the actual information about the failure. For the python error handling, I moved the session initialization code to session.py, so that traceback.print_exc() only shows the last error. Otherwise, the printed exception would contain all previous exceptions with a message "During handling of the above exception, another exception occurred", making the actual error kinda hard to parse. Tested with spark-shell, pyspark (with 2.7 and 3.5), by forcing an error during SparkContext initialization. Author: Marcelo Vanzin <[email protected]> Closes #21368 from vanzin/SPARK-16451. * [SPARK-15784] Add Power Iteration Clustering to spark.ml ## What changes were proposed in this pull request? According to the discussion on JIRA. I rewrite the Power Iteration Clustering API in `spark.ml`. ## How was this patch tested? Unit test. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: WeichenXu <[email protected]> Closes #21493 from WeichenXu123/pic_api. * [SPARK-24453][SS] Fix error recovering from the failure in a no-data batch ## What changes were proposed in this pull request? The error occurs when we are recovering from a failure in a no-data batch (say X) that has been planned (i.e. written to offset log) but not executed (i.e. not written to commit log). Upon recovery the following sequence of events happen. 1. `MicroBatchExecution.populateStartOffsets` sets `currentBatchId` to X. Since there was no data in the batch, the `availableOffsets` is same as `committedOffsets`, so `isNewDataAvailable` is `false`. 2. When `MicroBatchExecution.constructNextBatch` is called, ideally it should immediately return true because the next batch has already been constructed. However, the check of whether the batch has been constructed was `if (isNewDataAvailable) return true`. Since the planned batch is a no-data batch, it escaped this check and proceeded to plan the same batch X *once again*. The solution is to have an explicit flag that signifies whether a batch has already been constructed or not. `populateStartOffsets` is going to set the flag appropriately. ## How was this patch tested? new unit test Author: Tathagata Das <[email protected]> Closes #21491 from tdas/SPARK-24453. * [SPARK-22384][SQL] Refine partition pruning when attribute is wrapped in Cast ## What changes were proposed in this pull request? Sql below will get all partitions from metastore, which put much burden on metastore; ``` CREATE TABLE `partition_test`(`col` int) PARTITIONED BY (`pt` byte) SELECT * FROM partition_test WHERE CAST(pt AS INT)=1 ``` The reason is that the the analyzed attribute `dt` is wrapped in `Cast` and `HiveShim` fails to generate a proper partition filter. This pr proposes to take `Cast` into consideration when generate partition filter. ## How was this patch tested? Test added. This pr proposes to use analyzed expressions in `HiveClientSuite` Author: jinxing <[email protected]> Closes #19602 from jinxing64/SPARK-22384. * [SPARK-24187][R][SQL] Add array_join function to SparkR ## What changes were proposed in this pull request? This PR adds array_join function to SparkR ## How was this patch tested? Add unit test in test_sparkSQL.R Author: Huaxin Gao <[email protected]> Closes #21313 from huaxingao/spark-24187. * [SPARK-23803][SQL] Support bucket pruning ## What changes were proposed in this pull request? support bucket pruning when filtering on a single bucketed column on the following predicates - EqualTo, EqualNullSafe, In, And/Or predicates ## How was this patch tested? refactored unit tests to test the above. based on gatorsmile work in https://github.com/apache/spark/commit/e3c75c6398b1241500343ff237e9bcf78b5396f9 Author: Asher Saban <[email protected]> Author: asaban <[email protected]> Closes #20915 from sabanas/filter-prune-buckets. * [SPARK-24119][SQL] Add interpreted execution to SortPrefix expression ## What changes were proposed in this pull request? Implemented eval in SortPrefix expression. ## How was this patch tested? - ran existing sbt SQL tests - added unit test - ran existing Python SQL tests - manual tests: disabling codegen -- patching code to disable beyond what spark.sql.codegen.wholeStage=false can do -- and running sbt SQL tests Author: Bruce Robbins <[email protected]> Closes #21231 from bersprockets/sortprefixeval. * [SPARK-24224][ML-EXAMPLES] Java example code for Power Iteration Clustering in spark.ml ## What changes were proposed in this pull request? Java example code for Power Iteration Clustering in spark.ml ## How was this patch tested? Locally tested Author: Shahid <[email protected]> Closes #21283 from shahidki31/JavaPicExample. * [SPARK-24191][ML] Scala Example code for Power Iteration Clustering ## What changes were proposed in this pull request? Added example code for Power Iteration Clustering in Spark ML examples Author: Shahid <[email protected]> Closes #21248 from shahidki31/sparkCommit. * [SPARK-24477][SPARK-24454][ML][PYTHON] Imports submodule in ml/__init__.py and add ImageSchema into __all__ ## What changes were proposed in this pull request? This PR attaches submodules to ml's `__init__.py` module. Also, adds `ImageSchema` into `image.py` explicitly. ## How was this patch tested? Before: ```python >>> from pyspark import ml >>> ml.image Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'module' object has no attribute 'image' >>> ml.image.ImageSchema Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'module' object has no attribute 'image' ``` ```python >>> "image" in globals() False >>> from pyspark.ml import * >>> "image" in globals() False >>> image Traceback (most recent call last): File "<stdin>", line 1, in <module> NameError: name 'image' is not defined ``` After: ```python >>> from pyspark import ml >>> ml.image <module 'pyspark.ml.image' from '/.../spark/python/pyspark/ml/image.pyc'> >>> ml.image.ImageSchema <pyspark.ml.image._ImageSchema object at 0x10d973b10> ``` ```python >>> "image" in globals() False >>> from pyspark.ml import * >>> "image" in globals() True >>> image <module 'pyspark.ml.image' from #'/.../spark/python/pyspark/ml/image.pyc'> ``` Author: hyukjinkwon <[email protected]> Closes #21483 from HyukjinKwon/SPARK-24454. * [SPARK-23984][K8S] Initial Python Bindings for PySpark on K8s ## What changes were proposed in this pull request? Introducing Python Bindings for PySpark. - [x] Running PySpark Jobs - [x] Increased Default Memory Overhead value - [ ] Dependency Management for virtualenv/conda ## How was this patch tested? This patch was tested with - [x] Unit Tests - [x] Integration tests with [this addition](https://github.com/apache-spark-on-k8s/spark-integration/pull/46) ``` KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - Run SparkPi with a test secret mounted into the driver and executor pods - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Run PySpark on simple pi.py example - Run PySpark with Python2 to test a pyfiles example - Run PySpark with Python3 to test a pyfiles example Run completed in 4 minutes, 28 seconds. Total number of tests run: 11 Suites: completed 2, aborted 0 Tests: succeeded 11, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Author: Ilan Filonenko <[email protected]> Author: Ilan Filonenko <[email protected]> Closes #21092 from ifilonenko/master. * [SPARK-17756][PYTHON][STREAMING] Workaround to avoid return type mismatch in PythonTransformFunction ## What changes were proposed in this pull request? This PR proposes to wrap the transformed rdd within `TransformFunction`. `PythonTransformFunction` looks requiring to return `JavaRDD` in `_jrdd`. https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/python/pyspark/streaming/util.py#L67 https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/streaming/src/main/scala/org/apache/spark/streaming/api/python/PythonDStream.scala#L43 However, this could be `JavaPairRDD` by some APIs, for example, `zip` in PySpark's RDD API. `_jrdd` could be checked as below: ```python >>> rdd.zip(rdd)._jrdd.getClass().toString() u'class org.apache.spark.api.java.JavaPairRDD' ``` So, here, I wrapped it with `map` so that it ensures returning `JavaRDD`. ```python >>> rdd.zip(rdd).map(lambda x: x)._jrdd.getClass().toString() u'class org.apache.spark.api.java.JavaRDD' ``` I tried to elaborate some failure cases as below: ```python from pyspark.streaming import StreamingContext ssc = StreamingContext(spark.sparkContext, 10) ssc.queueStream([sc.range(10)]) \ .transform(lambda rdd: rdd.cartesian(rdd)) \ .pprint() ssc.start() ``` ```python from pyspark.streaming import StreamingContext ssc = StreamingContext(spark.sparkContext, 10) ssc.queueStream([sc.range(10)]).foreachRDD(lambda rdd: rdd.cartesian(rdd)) ssc.start() ``` ```python from pyspark.streaming import StreamingContext ssc = StreamingContext(spark.sparkContext, 10) ssc.queueStream([sc.range(10)]).foreachRDD(lambda rdd: rdd.zip(rdd)) ssc.start() ``` ```python from pyspark.streaming import StreamingContext ssc = StreamingContext(spark.sparkContext, 10) ssc.queueStream([sc.range(10)]).foreachRDD(lambda rdd: rdd.zip(rdd).union(rdd.zip(rdd))) ssc.start() ``` ```python from pyspark.streaming import StreamingContext ssc = StreamingContext(spark.sparkContext, 10) ssc.queueStream([sc.range(10)]).foreachRDD(lambda rdd: rdd.zip(rdd).coalesce(1)) ssc.start() ``` ## How was this patch tested? Unit tests were added in `python/pyspark/streaming/tests.py` and manually tested. Author: hyukjinkwon <[email protected]> Closes #19498 from HyukjinKwon/SPARK-17756. * [SPARK-23010][K8S] Initial checkin of k8s integration tests. These tests were developed in the https://github.com/apache-spark-on-k8s/spark-integration repo by several contributors. This is a copy of the current state into the main apache spark repo. The only changes from the current spark-integration repo state are: * Move the files from the repo root into resource-managers/kubernetes/integration-tests * Add a reference to these tests in the root README.md * Fix a path reference in dev/dev-run-integration-tests.sh * Add a TODO in include/util.sh ## What changes were proposed in this pull request? Incorporation of Kubernetes integration tests. ## How was this patch tested? This code has its own unit tests, but the main purpose is to provide the integration tests. I tested this on my laptop by running dev/dev-run-integration-tests.sh --spark-tgz ~/spark-2.4.0-SNAPSHOT-bin--.tgz The spark-integration tests have already been running for months in AMPLab, here is an example: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-scheduled-spark-integration-master/ Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Sean Suchter <[email protected]> Author: Sean Suchter <[email protected]> Closes #20697 from ssuchter/ssuchter-k8s-integration-tests. * [SPARK-24412][SQL] Adding docs about automagical type casting in `isin` and `isInCollection` APIs ## What changes were proposed in this pull request? Update documentation for `isInCollection` API to clealy explain the "auto-casting" of elements if their types are different. ## How was this patch tested? No-Op Author: Thiruvasakan Paramasivan <[email protected]> Closes #21519 from trvskn/sql-doc-update. * [SPARK-24468][SQL] Handle negative scale when adjusting precision for decimal operations ## What changes were proposed in this pull request? In SPARK-22036 we introduced the possibility to allow precision loss in arithmetic operations (according to the SQL standard). The implementation was drawn from Hive's one, where Decimals with a negative scale are not allowed in the operations. The PR handles the case when the scale is negative, removing the assertion that it is not. ## How was this patch tested? added UTs Author: Marco Gaido <[email protected]> Closes #21499 from mgaido91/SPARK-24468. * [SPARK-23754][PYTHON][FOLLOWUP] Move UDF stop iteration wrapping from driver to executor ## What changes were proposed in this pull request? SPARK-23754 was fixed in #21383 by changing the UDF code to wrap the user function, but this required a hack to save its argspec. This PR reverts this change and fixes the `StopIteration` bug in the worker ## How does this work? The root of the problem is that when an user-supplied function raises a `StopIteration`, pyspark might stop processing data, if this function is used in a for-loop. The solution is to catch `StopIteration`s exceptions and re-raise them as `RuntimeError`s, so that the execution fails and the error is reported to the user. This is done using the `fail_on_stopiteration` wrapper, in different ways depending on where the function is used: - In RDDs, the user function is wrapped in the driver, because this function is also called in the driver itself. - In SQL UDFs, the function is wrapped in the worker, since all processing happens there. Moreover, the worker needs the signature of the user function, which is lost when wrapping it, but passing this signature to the worker requires a not so nice hack. ## How was this patch tested? Same tests, plus tests for pandas UDFs Author: edorigatti <[email protected]> Closes #21467 from e-dorigatti/fix_udf_hack. * [SPARK-19826][ML][PYTHON] add spark.ml Python API for PIC ## What changes were proposed in this pull request? add spark.ml Python API for PIC ## How was this patch tested? add doctest Author: Huaxin Gao <[email protected]> Closes #21513 from huaxingao/spark--19826. * [MINOR][CORE] Log committer class used by HadoopMapRedCommitProtocol ## What changes were proposed in this pull request? When HadoopMapRedCommitProtocol is used (e.g., when using saveAsTextFile() or saveAsHadoopFile() with RDDs), it's not easy to determine which output committer class was used, so this PR simply logs the class that was used, similarly to what is done in SQLHadoopMapReduceCommitProtocol. ## How was this patch tested? Built Spark then manually inspected logging when calling saveAsTextFile(): ```scala scala> sc.setLogLevel("INFO") scala> sc.textFile("README.md").saveAsTextFile("/tmp/out") ... 18/05/29 10:06:20 INFO HadoopMapRedCommitProtocol: Using output committer class org.apache.hadoop.mapred.FileOutputCommitter ``` Author: Jonathan Kelly <[email protected]> Closes #21452 from ejono/master. * [SPARK-24520] Double braces in documentations There are double braces in the markdown, which break the link. ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Fokko Driesprong <[email protected]> Closes #21528 from Fokko/patch-1. * [SPARK-24134][DOCS] A missing full-stop in doc "Tuning Spark". ## What changes were proposed in this pull request? In the document [Tuning Spark -> Determining Memory Consumption](https://spark.apache.org/docs/latest/tuning.html#determining-memory-consumption), a full stop was missing in the second paragraph. It's `...use SizeEstimator’s estimate method This is useful for experimenting...`, while there is supposed to be a full stop before `This`. Screenshot showing before change is attached below. <img width="1033" alt="screen shot 2018-05-01 at 5 22 32 pm" src="https://user-images.githubusercontent.com/11539188/39468206-778e3d8a-4d64-11e8-8a92-38464952b54b.png"> ## How was this patch tested? This is a simple change in doc. Only one full stop was added in plain text. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Xiaodong <[email protected]> Closes #21205 from XD-DENG/patch-1. * [SPARK-22144][SQL] ExchangeCoordinator combine the partitions of an 0 sized pre-shuffle to 0 ## What changes were proposed in this pull request? when the length of pre-shuffle's partitions is 0, the length of post-shuffle's partitions should be 0 instead of spark.sql.shuffle.partitions. ## How was this patch tested? ExchangeCoordinator converted a pre-shuffle that partitions is 0 to a post-shuffle that partitions is 0 instead of one that partitions is spark.sql.shuffle.partitions. Author: liutang123 <[email protected]> Closes #19364 from liutang123/SPARK-22144. * [SPARK-23732][DOCS] Fix source links in generated scaladoc. Apply the suggestion on the bug to fix source links. Tested with the 2.3.1 release docs. Author: Marcelo Vanzin <[email protected]> Closes #21521 from vanzin/SPARK-23732. * [SPARK-24502][SQL] flaky test: UnsafeRowSerializerSuite ## What changes were proposed in this pull request? `UnsafeRowSerializerSuite` calls `UnsafeProjection.create` which accesses `SQLConf.get`, while the current active SparkSession may already be stopped, and we may hit exception like this ``` sbt.ForkMain$ForkError: java.lang.IllegalStateException: LiveListenerBus is stopped. at org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:97) at org.apache.spark.scheduler.LiveListenerBus.addToStatusQueue(LiveListenerBus.scala:80) at org.apache.spark.sql.internal.SharedState.<init>(SharedState.scala:93) at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120) at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:120) at org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:119) at org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:286) at org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42) at org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41) at org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95) at org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95) at scala.Option.map(Option.scala:146) at org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:95) at org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:94) at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:126) at org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:54) at org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:157) at org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:150) at org.apache.spark.sql.execution.UnsafeRowSerializerSuite.org$apache$spark$sql$execution$UnsafeRowSerializerSuite$$unsafeRowConverter(UnsafeRowSerializerSuite.scala:54) at org.apache.spark.sql.execution.UnsafeRowSerializerSuite.org$apache$spark$sql$execution$UnsafeRowSerializerSuite$$toUnsafeRow(UnsafeRowSerializerSuite.scala:49) at org.apache.spark.sql.execution.UnsafeRowSerializerSuite$$anonfun$2.apply(UnsafeRowSerializerSuite.scala:63) at org.apache.spark.sql.execution.UnsafeRowSerializerSuite$$anonfun$2.apply(UnsafeRowSerializerSuite.scala:60) ... ``` ## How was this patch tested? N/A Author: Wenchen Fan <[email protected]> Closes #21518 from cloud-fan/test. * docs: fix typo no => no[t] ## What changes were proposed in this pull request? Fixing a typo. ## How was this patch tested? Visual check of the docs. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Tom Saleeba <[email protected]> Closes #21496 from tomsaleeba/patch-1. * [SPARK-15064][ML] Locale support in StopWordsRemover ## What changes were proposed in this pull request? Add locale support for `StopWordsRemover`. ## How was this patch tested? [Scala|Python] unit tests. Author: Lee Dongjin <[email protected]> Closes #21501 from dongjinleekr/feature/SPARK-15064. * [SPARK-24531][TESTS] Remove version 2.2.0 from testing versions in HiveExternalCatalogVersionsSuite ## What changes were proposed in this pull request? Removing version 2.2.0 from testing versions in HiveExternalCatalogVersionsSuite as it is not present anymore in the mirrors and this is blocking all the open PRs. ## How was this patch tested? running UTs Author: Marco Gaido <[email protected]> Closes #21540 from mgaido91/SPARK-24531. * [SPARK-24416] Fix configuration specification for killBlacklisted executors ## What changes were proposed in this pull request? spark.blacklist.killBlacklistedExecutors is defined as (Experimental) If set to "true", allow Spark to automatically kill, and attempt to re-create, executors when they are blacklisted. Note that, when an entire node is added to the blacklist, all of the executors on that node will be killed. I presume the killing of blacklisted executors only happens after the stage completes successfully and all tasks have completed or on fetch failures (updateBlacklistForFetchFailure/updateBlacklistForSuccessfulTaskSet). It is confusing because the definition states that the executor will be attempted to be recreated as soon as it is blacklisted. This is not true while the stage is in progress and an executor is blacklisted, it will not attempt to cleanup until the stage finishes. Author: Sanket Chintapalli <[email protected]> Closes #21475 from redsanket/SPARK-24416. * [SPARK-23931][SQL] Adds arrays_zip function to sparksql Signed-off-by: DylanGuedes <djmgguedesgmail.com> ## What changes were proposed in this pull request? Addition of arrays_zip function to spark sql functions. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Unit tests that checks if the results are correct. Author: DylanGuedes <[email protected]> Closes #21045 from DylanGuedes/SPARK-23931. * [SPARK-24216][SQL] Spark TypedAggregateExpression uses getSimpleName that is not safe in scala ## What changes were proposed in this pull request? When user create a aggregator object in scala and pass the aggregator to Spark Dataset's agg() method, Spark's will initialize TypedAggregateExpression with the nodeName field as aggregator.getClass.getSimpleName. However, getSimpleName is not safe in scala environment, depending on how user creates the aggregator object. For example, if the aggregator class full qualified name is "com.my.company.MyUtils$myAgg$2$", the getSimpleName will throw java.lang.InternalError "Malformed class name". This has been reported in scalatest https://github.com/scalatest/scalatest/pull/1044 and discussed in many scala upstream jiras such as SI-8110, SI-5425. To fix this issue, we follow the solution in https://github.com/scalatest/scalatest/pull/1044 to add safer version of getSimpleName as a util method, and TypedAggregateExpression will invoke this util method rather than getClass.getSimpleName. ## How was this patch tested? added unit test Author: Fangshi Li <[email protected]> Closes #21276 from fangshil/SPARK-24216. * [SPARK-23933][SQL] Add map_from_arrays function ## What changes were proposed in this pull request? The PR adds the SQL function `map_from_arrays`. The behavior of the function is based on Presto's `map`. Since SparkSQL already had a `map` function, we prepared the different name for this behavior. This function returns returns a map from a pair of arrays for keys and values. ## How was this patch tested? Added UTs Author: Kazuaki Ishizaki <[email protected]> Closes #21258 from kiszk/SPARK-23933. * [SPARK-23010][BUILD][FOLLOWUP] Fix java checkstyle failure of kubernetes-integration-tests ## What changes were proposed in this pull request? Fix java checkstyle failure of kubernetes-integration-tests ## How was this patch tested? Checked manually on my local environment. Author: Xingbo Jiang <[email protected]> Closes #21545 from jiangxb1987/k8s-checkstyle. * [SPARK-24506][UI] Add UI filters to tabs added after binding ## What changes were proposed in this pull request? Currently, `spark.ui.filters` are not applied to the handlers added after binding the server. This means that every page which is added after starting the UI will not have the filters configured on it. This can allow unauthorized access to the pages. The PR adds the filters also to the handlers added after the UI starts. ## How was this patch tested? manual tests (without the patch, starting the thriftserver with `--conf spark.ui.filters=org.apache.hadoop.security.authentication.server.AuthenticationFilter --conf spark.org.apache.hadoop.security.authentication.server.AuthenticationFilter.params="type=simple"` you can access `http://localhost:4040/sqlserver`; with the patch, 401 is the response as for the other pages). Author: Marco Gaido <[email protected]> Closes #21523 from mgaido91/SPARK-24506. * [SPARK-22239][SQL][PYTHON] Enable grouped aggregate pandas UDFs as window functions with unbounded window frames ## What changes were proposed in this pull request? This PR enables using a grouped aggregate pandas UDFs as window functions. The semantics is the same as using SQL aggregation function as window functions. ``` >>> from pyspark.sql.functions import pandas_udf, PandasUDFType >>> from pyspark.sql import Window >>> df = spark.createDataFrame( ... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ... ("id", "v")) >>> pandas_udf("double", PandasUDFType.GROUPED_AGG) ... def mean_udf(v): ... return v.mean() >>> w = Window.partitionBy('id') >>> df.withColumn('mean_v', mean_udf(df['v']).over(w)).show…

…nerJoinSuite ## What changes were proposed in this pull request? Similar with #21446. Looks random string is not quite safe as a directory name. ```scala scala> val prefix = Random.nextString(10); val dir = new File("/tmp", "del_" + prefix + "-" + UUID.randomUUID.toString); dir.mkdirs() prefix: String = 窽텘⒘駖ⵚ駢⡞Ρ닋੎ dir: java.io.File = /tmp/del_窽텘⒘駖ⵚ駢⡞Ρ닋੎-a3f99855-c429-47a0-a108-47bca6905745 res40: Boolean = false // nope, didn't like this one ``` ## How was this patch tested? Unit test was added, and manually. Closes #23405 from HyukjinKwon/SPARK-26496. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit e63243d) Signed-off-by: Dongjoon Hyun <[email protected]>

* [SPARK-25837][CORE] Fix potential slowdown in AppStatusListener when cleaning up stages ## What changes were proposed in this pull request? * Update `AppStatusListener` `cleanupStages` method to remove tasks for those stages in a single pass instead of 1 for each stage. * This fixes an issue where the cleanupStages method would get backed up, causing a backup in the executor in ElementTrackingStore, resulting in stages and jobs not getting cleaned up properly. Tasks seem most susceptible to this as there are a lot of them, however a similar issue could arise in other locations the `KVStore` `view` method is used. A broader fix might involve updates to `KVStoreView` and `InMemoryView` as it appears this interface and implementation can lead to multiple and inefficient traversals of the stored data. ## How was this patch tested? Using existing tests in AppStatusListenerSuite This is my original work and I license the work to the project under the project’s open source license. Closes apache#22883 from patrickbrownsync/cleanup-stages-fix. Authored-by: Patrick Brown <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]> (cherry picked from commit e9d3ca0) Signed-off-by: Marcelo Vanzin <[email protected]> * [SPARK-25933][DOCUMENTATION] Fix pstats.Stats() reference in configuration.md ## What changes were proposed in this pull request? Change ptats.Stats() to pstats.Stats() for `spark.python.profile.dump` in configuration.md. ## How was this patch tested? Doc test Closes apache#22933 from AlexHagerman/doc_fix. Authored-by: Alex Hagerman <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit 1a7abf3) Signed-off-by: Sean Owen <[email protected]> * [SPARK-26011][SPARK-SUBMIT] Yarn mode pyspark app without python main resource does not honor "spark.jars.packages" SparkSubmit determines pyspark app by the suffix of primary resource but Livy uses "spark-internal" as the primary resource when calling spark-submit, therefore args.isPython is set to false in SparkSubmit.scala. In Yarn mode, SparkSubmit module is responsible for resolving maven coordinates and adding them to "spark.submit.pyFiles" so that python's system path can be set correctly. The fix is to resolve maven coordinates not only when args.isPython is true, but also when primary resource is spark-internal. Tested the patch with Livy submitting pyspark app, spark-submit, pyspark with or without packages config. Signed-off-by: Shanyu Zhao <shzhaomicrosoft.com> Closes apache#23009 from shanyu/shanyu-26011. Authored-by: Shanyu Zhao <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit 9a5fda6) Signed-off-by: Sean Owen <[email protected]> * [SPARK-25934][MESOS] Don't propagate SPARK_CONF_DIR from spark submit ## What changes were proposed in this pull request? Don't propagate SPARK_CONF_DIR to the driver in mesos cluster mode. ## How was this patch tested? I built the 2.3.2 tag with this patch added and deployed a test job to a mesos cluster to confirm that the incorrect SPARK_CONF_DIR was no longer passed from the submit command. Closes apache#22937 from mpmolek/fix-conf-dir. Authored-by: Matt Molek <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit 696b75a) Signed-off-by: Sean Owen <[email protected]> * [MINOR][SQL] Fix typo in CTAS plan database string ## What changes were proposed in this pull request? Since [Spark 1.6.0](apache@56d7da1#diff-6f38a103058a6e233b7ad80718452387R96), there was a redundant '}' character in CTAS string plan's database argument string; `default}`. This PR aims to fix it. **BEFORE** ```scala scala> sc.version res1: String = 1.6.0 scala> sql("create table t as select 1").explain == Physical Plan == ExecutedCommand CreateTableAsSelect [Database:default}, TableName: t, InsertIntoHiveTable] +- Project [1 AS _c0#3] +- OneRowRelation$ ``` **AFTER** ```scala scala> sql("create table t as select 1").explain == Physical Plan == Execute CreateHiveTableAsSelectCommand CreateHiveTableAsSelectCommand [Database:default, TableName: t, InsertIntoHiveTable] +- *(1) Project [1 AS 1#4] +- Scan OneRowRelation[] ``` ## How was this patch tested? Manual. Closes apache#23064 from dongjoon-hyun/SPARK-FIX. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: hyukjinkwon <[email protected]> (cherry picked from commit b538c44) Signed-off-by: hyukjinkwon <[email protected]> * [SPARK-26084][SQL] Fixes unresolved AggregateExpression.references exception ## What changes were proposed in this pull request? This PR fixes an exception in `AggregateExpression.references` called on unresolved expressions. It implements the solution proposed in [SPARK-26084](https://issues.apache.org/jira/browse/SPARK-26084), a minor refactoring that removes the unnecessary dependence on `AttributeSet.toSeq`, which requires expression IDs and, therefore, can only execute successfully for resolved expressions. The refactored implementation is both simpler and faster, eliminating the conversion of a `Set` to a `Seq` and back to `Set`. ## How was this patch tested? Added a new test based on the failing case in [SPARK-26084](https://issues.apache.org/jira/browse/SPARK-26084). hvanhovell Closes apache#23075 from ssimeonov/ss_SPARK-26084. Authored-by: Simeon Simeonov <[email protected]> Signed-off-by: Herman van Hovell <[email protected]> (cherry picked from commit db136d3) Signed-off-by: Herman van Hovell <[email protected]> * [SPARK-26109][WEBUI] Duration in the task summary metrics table and the task table are different ## What changes were proposed in this pull request? Task summary table displays the summary of the task table in the stage page. However, the 'Duration' metrics of 'task summary' table and 'task table' are not matching. The reason is because, in the 'task summary' we display 'executorRunTime' as the duration, and in the 'task table' the actual duration of the task. Except duration metrics, all other metrics are properly displaying in the task summary. In Spark2.2, used to show 'executorRunTime' as duration in the 'taskTable'. That is why, in summary metrics also the 'exeuctorRunTime' shows as the duration. So, we need to show 'executorRunTime' as the duration in the tasks table to follow the same behaviour as the previous versions of spark. ## How was this patch tested? Before patch: ![screenshot from 2018-11-19 04-32-06](https://user-images.githubusercontent.com/23054875/48679263-1e4fff80-ebb4-11e8-9ed5-16d892039e01.png) After patch: ![screenshot from 2018-11-19 04-37-39](https://user-images.githubusercontent.com/23054875/48679343-e39a9700-ebb4-11e8-8df9-9dc3a28d4bce.png) Closes apache#23081 from shahidki31/duratinSummary. Authored-by: Shahid <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit 540afc2) Signed-off-by: Sean Owen <[email protected]> * [SPARK-26118][BACKPORT-2.3][WEB UI] Introducing spark.ui.requestHeaderSize for setting HTTP requestHeaderSize ## What changes were proposed in this pull request? Introducing spark.ui.requestHeaderSize for configuring Jetty's HTTP requestHeaderSize. This way long authorization field does not lead to HTTP 413. ## How was this patch tested? Manually with curl (which version must be at least 7.55). With the original default value (8k limit): ```bash $ ./sbin/start-history-server.sh starting org.apache.spark.deploy.history.HistoryServer, logging to /Users/attilapiros/github/spark/logs/spark-attilapiros-org.apache.spark.deploy.history.HistoryServer-1-apiros-MBP.lan.out $ echo -n "X-Custom-Header: " > cookie $ printf 'A%.0s' {1..9500} >> cookie $ curl -H cookie http://458apiros-MBP.lan:18080/ <h1>Bad Message 431</h1><pre>reason: Request Header Fields Too Large</pre> $ tail -1 /Users/attilapiros/github/spark/logs/spark-attilapiros-org.apache.spark.deploy.history.HistoryServer-1-apiros-MBP.lan.out 18/11/19 21:24:28 WARN HttpParser: Header is too large 8193>8192 ``` After: ```bash $ echo spark.ui.requestHeaderSize=10000 > history.properties $ ./sbin/start-history-server.sh --properties-file history.properties starting org.apache.spark.deploy.history.HistoryServer, logging to /Users/attilapiros/github/spark/logs/spark-attilapiros-org.apache.spark.deploy.history.HistoryServer-1-apiros-MBP.lan.out $ curl -H cookie http://458apiros-MBP.lan:18080/ <!DOCTYPE html><html> <head>... <link rel="shortcut icon" href="/static/spark-logo-77x50px-hd.png"></link> <title>History Server</title> </head> <body> ... ``` (cherry picked from commit ab61ddb) Closes apache#23114 from attilapiros/julianOffByDays-2.3. Authored-by: “attilapiros” <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> * [SPARK-25786][CORE] If the ByteBuffer.hasArray is false , it will throw UnsupportedOperationException for Kryo `deserialize` for kryo, the type of input parameter is ByteBuffer, if it is not backed by an accessible byte array. it will throw `UnsupportedOperationException` Exception Info: ``` java.lang.UnsupportedOperationException was thrown. java.lang.UnsupportedOperationException at java.nio.ByteBuffer.array(ByteBuffer.java:994) at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:362) ``` Added a unit test Closes apache#22779 from 10110346/InputStreamKryo. Authored-by: liuxian <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit 7f5f7a9) Signed-off-by: Sean Owen <[email protected]> * [SPARK-26137][CORE] Use Java system property "file.separator" inste… … of hard coded "/" in DependencyUtils ## What changes were proposed in this pull request? Use Java system property "file.separator" instead of hard coded "/" in DependencyUtils. ## How was this patch tested? Manual test: Submit Spark application via REST API that reads data from Elasticsearch using spark-elasticsearch library. Without fix application fails with error: 18/11/22 10:36:20 ERROR Version: Multiple ES-Hadoop versions detected in the classpath; please use only one jar:file:/C:/<...>/spark-2.4.0-bin-hadoop2.6/work/driver-20181122103610-0001/myApp-assembly-1.0.jar jar:file:/C:/<...>/myApp-assembly-1.0.jar 18/11/22 10:36:20 ERROR Main: Application [MyApp] failed: java.lang.Error: Multiple ES-Hadoop versions detected in the classpath; please use only one jar:file:/C:/<...>/spark-2.4.0-bin-hadoop2.6/work/driver-20181122103610-0001/myApp-assembly-1.0.jar jar:file:/C:/<...>/myApp-assembly-1.0.jar at org.elasticsearch.hadoop.util.Version.<clinit>(Version.java:73) at org.elasticsearch.hadoop.rest.RestService.findPartitions(RestService.java:214) at org.elasticsearch.spark.rdd.AbstractEsRDD.esPartitions$lzycompute(AbstractEsRDD.scala:73) at org.elasticsearch.spark.rdd.AbstractEsRDD.esPartitions(AbstractEsRDD.scala:72) at org.elasticsearch.spark.rdd.AbstractEsRDD.getPartitions(AbstractEsRDD.scala:44) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.RDD.collect(RDD.scala:944) ... at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:65) at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala) With fix application runs successfully. Closes apache#23102 from markpavey/JIRA_SPARK-26137_DependencyUtilsFileSeparatorFix. Authored-by: Mark Pavey <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit ce61bac) Signed-off-by: Sean Owen <[email protected]> * [SPARK-26211][SQL] Fix InSet for binary, and struct and array with null. Currently `InSet` doesn't work properly for binary type, or struct and array type with null value in the set. Because, as for binary type, the `HashSet` doesn't work properly for `Array[Byte]`, and as for struct and array type with null value in the set, the `ordering` will throw a `NPE`. Added a few tests. Closes apache#23176 from ueshin/issues/SPARK-26211/inset. Authored-by: Takuya UESHIN <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit b9b68a6) Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-26201] Fix python broadcast with encryption ## What changes were proposed in this pull request? Python with rpc and disk encryption enabled along with a python broadcast variable and just read the value back on the driver side the job failed with: Traceback (most recent call last): File "broadcast.py", line 37, in <module> words_new.value File "/pyspark.zip/pyspark/broadcast.py", line 137, in value File "pyspark.zip/pyspark/broadcast.py", line 122, in load_from_path File "pyspark.zip/pyspark/broadcast.py", line 128, in load EOFError: Ran out of input To reproduce use configs: --conf spark.network.crypto.enabled=true --conf spark.io.encryption.enabled=true Code: words_new = sc.broadcast(["scala", "java", "hadoop", "spark", "akka"]) words_new.value print(words_new.value) ## How was this patch tested? words_new = sc.broadcast([“scala”, “java”, “hadoop”, “spark”, “akka”]) textFile = sc.textFile(“README.md”) wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word + words_new.value[1], 1)).reduceByKey(lambda a, b: a+b) count = wordCounts.count() print(count) words_new.value print(words_new.value) Closes apache#23166 from redsanket/SPARK-26201. Authored-by: schintap <[email protected]> Signed-off-by: Thomas Graves <[email protected]> (cherry picked from commit 9b23be2) Signed-off-by: Thomas Graves <[email protected]> * [MINOR][DOC] Correct some document description errors Correct some document description errors. N/A Closes apache#23162 from 10110346/docerror. Authored-by: liuxian <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit 60e4239) Signed-off-by: Sean Owen <[email protected]> * [SPARK-26198][SQL] Fix Metadata serialize null values throw NPE How to reproduce this issue: ```scala scala> val meta = new org.apache.spark.sql.types.MetadataBuilder().putNull("key").build().json java.lang.NullPointerException at org.apache.spark.sql.types.Metadata$.org$apache$spark$sql$types$Metadata$$toJsonValue(Metadata.scala:196) at org.apache.spark.sql.types.Metadata$$anonfun$1.apply(Metadata.scala:180) ``` This pr fix `NullPointerException` when `Metadata` serialize `null` values. unit tests Closes apache#23164 from wangyum/SPARK-26198. Authored-by: Yuming Wang <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit 676bbb2) Signed-off-by: Sean Owen <[email protected]> * [SPARK-26233][SQL][BACKPORT-2.3] CheckOverflow when encoding a decimal value ## What changes were proposed in this pull request? When we encode a Decimal from external source we don't check for overflow. That method is useful not only in order to enforce that we can represent the correct value in the specified range, but it also changes the underlying data to the right precision/scale. Since in our code generation we assume that a decimal has exactly the same precision and scale of its data type, missing to enforce it can lead to corrupted output/results when there are subsequent transformations. ## How was this patch tested? added UT Closes apache#23233 from mgaido91/SPARK-26233_2.3. Authored-by: Marco Gaido <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> * [SPARK-26307][SQL] Fix CTAS when INSERT a partitioned table using Hive serde This is a Spark 2.3 regression introduced in apache#20521. We should add the partition info for InsertIntoHiveTable in CreateHiveTableAsSelectCommand. Otherwise, we will hit the following error by running the newly added test case: ``` [info] - CTAS: INSERT a partitioned table using Hive serde *** FAILED *** (829 milliseconds) [info] org.apache.spark.SparkException: Requested partitioning does not match the tab1 table: [info] Requested partitions: [info] Table partitions: part [info] at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:179) [info] at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:107) ``` Added a test case. Closes apache#23255 from gatorsmile/fixCTAS. Authored-by: gatorsmile <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 3bc83de) Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-26327][SQL][BACKPORT-2.3] Bug fix for `FileSourceScanExec` metrics update ## What changes were proposed in this pull request? Backport apache#23277 to branch 2.3 without the metrics renaming. ## How was this patch tested? New test case in `SQLMetricsSuite`. Closes apache#23299 from xuanyuanking/SPARK-26327-2.3. Authored-by: Yuanjian Li <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> * [SPARK-26315][PYSPARK] auto cast threshold from Integer to Float in approxSimilarityJoin of BucketedRandomProjectionLSHModel ## What changes were proposed in this pull request? If the input parameter 'threshold' to the function approxSimilarityJoin is not a float, we would get an exception. The fix is to convert the 'threshold' into a float before calling the java implementation method. ## How was this patch tested? Added a new test case. Without this fix, the test will throw an exception as reported in the JIRA. With the fix, the test passes. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes apache#23313 from jerryjch/SPARK-26315. Authored-by: Jing Chen He <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit 860f449) Signed-off-by: Sean Owen <[email protected]> * [SPARK-26352][SQL] join reorder should not change the order of output attributes ## What changes were proposed in this pull request? The optimizer rule `org.apache.spark.sql.catalyst.optimizer.ReorderJoin` performs join reordering on inner joins. This was introduced from SPARK-12032 (apache#10073) in 2015-12. After it had reordered the joins, though, it didn't check whether or not the output attribute order is still the same as before. Thus, it's possible to have a mismatch between the reordered output attributes order vs the schema that a DataFrame thinks it has. The same problem exists in the CBO version of join reordering (`CostBasedJoinReorder`) too. This can be demonstrated with the example: ```scala spark.sql("create table table_a (x int, y int) using parquet") spark.sql("create table table_b (i int, j int) using parquet") spark.sql("create table table_c (a int, b int) using parquet") val df = spark.sql(""" with df1 as (select * from table_a cross join table_b) select * from df1 join table_c on a = x and b = i """) ``` here's what the DataFrame thinks: ``` scala> df.printSchema root |-- x: integer (nullable = true) |-- y: integer (nullable = true) |-- i: integer (nullable = true) |-- j: integer (nullable = true) |-- a: integer (nullable = true) |-- b: integer (nullable = true) ``` here's what the optimized plan thinks, after join reordering: ``` scala> df.queryExecution.optimizedPlan.output.foreach(a => println(s"|-- ${a.name}: ${a.dataType.typeName}")) |-- x: integer |-- y: integer |-- a: integer |-- b: integer |-- i: integer |-- j: integer ``` If we exclude the `ReorderJoin` rule (using Spark 2.4's optimizer rule exclusion feature), it's back to normal: ``` scala> spark.conf.set("spark.sql.optimizer.excludedRules", "org.apache.spark.sql.catalyst.optimizer.ReorderJoin") scala> val df = spark.sql("with df1 as (select * from table_a cross join table_b) select * from df1 join table_c on a = x and b = i") df: org.apache.spark.sql.DataFrame = [x: int, y: int ... 4 more fields] scala> df.queryExecution.optimizedPlan.output.foreach(a => println(s"|-- ${a.name}: ${a.dataType.typeName}")) |-- x: integer |-- y: integer |-- i: integer |-- j: integer |-- a: integer |-- b: integer ``` Note that this output attribute ordering problem leads to data corruption, and can manifest itself in various symptoms: * Silently corrupting data, if the reordered columns happen to either have matching types or have sufficiently-compatible types (e.g. all fixed length primitive types are considered as "sufficiently compatible" in an `UnsafeRow`), then only the resulting data is going to be wrong but it might not trigger any alarms immediately. Or * Weird Java-level exceptions like `java.lang.NegativeArraySizeException`, or even SIGSEGVs. ## How was this patch tested? Added new unit test in `JoinReorderSuite` and new end-to-end test in `JoinSuite`. Also made `JoinReorderSuite` and `StarJoinReorderSuite` assert more strongly on maintaining output attribute order. Closes apache#23303 from rednaxelafx/fix-join-reorder. Authored-by: Kris Mok <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 56448c6) Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-26352][SQL][FOLLOWUP-2.3] Fix missing sameOutput in branch-2.3 ## What changes were proposed in this pull request? This is the branch-2.3 equivalent of apache#23330. After apache#23303 was merged to branch-2.3/2.4, the builds on those branches were broken due to missing a `LogicalPlan.sameOutput` function which came from apache#22713 only available on master. This PR is to follow-up with the broken 2.3/2.4 branches and make a copy of the new `LogicalPlan.sameOutput` into `ReorderJoin` to make it locally available. ## How was this patch tested? Fix the build of 2.3/2.4. Closes apache#23333 from rednaxelafx/branch-2.3. Authored-by: Kris Mok <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-26316][SPARK-21052][BRANCH-2.3] Revert hash join metrics in that causes performance degradation ## What changes were proposed in this pull request? Revert spark 21052 in spark 2.3 because of the discussion in [PR23269](apache#23269) ## How was this patch tested? N/A Closes apache#23319 from JkSelf/branch-2.3-revert21052. Authored-by: jiake <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> * [SPARK-26394][CORE] Fix annotation error for Utils.timeStringAsMs ## What changes were proposed in this pull request? Change microseconds to milliseconds in annotation of Utils.timeStringAsMs. Closes apache#23346 from stczwd/stczwd. Authored-by: Jackey Lee <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit 428eb2a) Signed-off-by: Sean Owen <[email protected]> * [SPARK-24687][CORE] Avoid job hanging when generate task binary causes fatal error ## What changes were proposed in this pull request? When NoClassDefFoundError thrown,it will cause job hang. `Exception in thread "dag-scheduler-event-loop" java.lang.NoClassDefFoundError: Lcom/xxx/data/recommend/aggregator/queue/QueueName; at java.lang.Class.getDeclaredFields0(Native Method) at java.lang.Class.privateGetDeclaredFields(Class.java:2436) at java.lang.Class.getDeclaredField(Class.java:1946) at java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1659) at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:72) at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:480) at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468) at java.security.AccessController.doPrivileged(Native Method) at java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:468) at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365) at java.io.ObjectOutputStream.writeClass(ObjectOutputStream.java:1212) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1119) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377)` It is caused by NoClassDefFoundError will not catch up during task seriazation. `var taskBinary: Broadcast[Array[Byte]] = null try { // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep). // For ResultTask, serialize and broadcast (rdd, func). val taskBinaryBytes: Array[Byte] = stage match { case stage: ShuffleMapStage => JavaUtils.bufferToArray( closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef)) case stage: ResultStage => JavaUtils.bufferToArray(closureSerializer.serialize((stage.rdd, stage.func): AnyRef)) } taskBinary = sc.broadcast(taskBinaryBytes) } catch { // In the case of a failure during serialization, abort the stage. case e: NotSerializableException => abortStage(stage, "Task not serializable: " + e.toString, Some(e)) runningStages -= stage // Abort execution return case NonFatal(e) => abortStage(stage, s"Task serialization failed: $e\n${Utils.exceptionString(e)}", Some(e)) runningStages -= stage return }` image below shows that stage 33 blocked and never be scheduled. <img width="1273" alt="2018-06-28 4 28 42" src="https://user-images.githubusercontent.com/26762018/42621188-b87becca-85ef-11e8-9a0b-0ddf07504c96.png"> <img width="569" alt="2018-06-28 4 28 49" src="https://user-images.githubusercontent.com/26762018/42621191-b8b260e8-85ef-11e8-9d10-e97a5918baa6.png"> ## How was this patch tested? UT Closes apache#21664 from caneGuy/zhoukang/fix-noclassdeferror. Authored-by: zhoukang <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit 7c8f475) Signed-off-by: Sean Owen <[email protected]> * [SPARK-26422][R] Support to disable Hive support in SparkR even for Hadoop versions unsupported by Hive fork ## What changes were proposed in this pull request? Currently, even if I explicitly disable Hive support in SparkR session as below: ```r sparkSession <- sparkR.session("local[4]", "SparkR", Sys.getenv("SPARK_HOME"), enableHiveSupport = FALSE) ``` produces when the Hadoop version is not supported by our Hive fork: ``` java.lang.reflect.InvocationTargetException ... Caused by: java.lang.IllegalArgumentException: Unrecognized Hadoop major version number: 3.1.1.3.1.0.0-78 at org.apache.hadoop.hive.shims.ShimLoader.getMajorVersion(ShimLoader.java:174) at org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:139) at org.apache.hadoop.hive.shims.ShimLoader.getHadoopShims(ShimLoader.java:100) at org.apache.hadoop.hive.conf.HiveConf$ConfVars.<clinit>(HiveConf.java:368) ... 43 more Error in handleErrors(returnStatus, conn) : java.lang.ExceptionInInitializerError at org.apache.hadoop.hive.conf.HiveConf.<clinit>(HiveConf.java:105) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.spark.util.Utils$.classForName(Utils.scala:193) at org.apache.spark.sql.SparkSession$.hiveClassesArePresent(SparkSession.scala:1116) at org.apache.spark.sql.api.r.SQLUtils$.getOrCreateSparkSession(SQLUtils.scala:52) at org.apache.spark.sql.api.r.SQLUtils.getOrCreateSparkSession(SQLUtils.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ``` The root cause is that: ``` SparkSession.hiveClassesArePresent ``` check if the class is loadable or not to check if that's in classpath but `org.apache.hadoop.hive.conf.HiveConf` has a check for Hadoop version as static logic which is executed right away. This throws an `IllegalArgumentException` and that's not caught: https://github.com/apache/spark/blob/36edbac1c8337a4719f90e4abd58d38738b2e1fb/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L1113-L1121 So, currently, if users have a Hive built-in Spark with unsupported Hadoop version by our fork (namely 3+), there's no way to use SparkR even though it could work. This PR just propose to change the order of bool comparison so that we can don't execute `SparkSession.hiveClassesArePresent` when: 1. `enableHiveSupport` is explicitly disabled 2. `spark.sql.catalogImplementation` is `in-memory` so that we **only** check `SparkSession.hiveClassesArePresent` when Hive support is explicitly enabled by short circuiting. ## How was this patch tested? It's difficult to write a test since we don't run tests against Hadoop 3 yet. See apache#21588. Manually tested. Closes apache#23356 from HyukjinKwon/SPARK-26422. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 305e9b5) Signed-off-by: Hyukjin Kwon <[email protected]> * [SPARK-26366][SQL][BACKPORT-2.3] ReplaceExceptWithFilter should consider NULL as False ## What changes were proposed in this pull request? In `ReplaceExceptWithFilter` we do not consider properly the case in which the condition returns NULL. Indeed, in that case, since negating NULL still returns NULL, so it is not true the assumption that negating the condition returns all the rows which didn't satisfy it, rows returning NULL may not be returned. This happens when constraints inferred by `InferFiltersFromConstraints` are not enough, as it happens with `OR` conditions. The rule had also problems with non-deterministic conditions: in such a scenario, this rule would change the probability of the output. The PR fixes these problem by: - returning False for the condition when it is Null (in this way we do return all the rows which didn't satisfy it); - avoiding any transformation when the condition is non-deterministic. ## How was this patch tested? added UTs Closes apache#23350 from mgaido91/SPARK-26366_2.3. Authored-by: Marco Gaido <[email protected]> Signed-off-by: gatorsmile <[email protected]> * Revert "[SPARK-26366][SQL][BACKPORT-2.3] ReplaceExceptWithFilter should consider NULL as False" This reverts commit a7d50ae. * [SPARK-26366][SQL][BACKPORT-2.3] ReplaceExceptWithFilter should consider NULL as False ## What changes were proposed in this pull request? In `ReplaceExceptWithFilter` we do not consider properly the case in which the condition returns NULL. Indeed, in that case, since negating NULL still returns NULL, so it is not true the assumption that negating the condition returns all the rows which didn't satisfy it, rows returning NULL may not be returned. This happens when constraints inferred by `InferFiltersFromConstraints` are not enough, as it happens with `OR` conditions. The rule had also problems with non-deterministic conditions: in such a scenario, this rule would change the probability of the output. The PR fixes these problem by: - returning False for the condition when it is Null (in this way we do return all the rows which didn't satisfy it); - avoiding any transformation when the condition is non-deterministic. ## How was this patch tested? added UTs Closes apache#23372 from mgaido91/SPARK-26366_2.3_2. Authored-by: Marco Gaido <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> * [SPARK-26444][WEBUI] Stage color doesn't change with it's status ## What changes were proposed in this pull request? On job page, in event timeline section, stage color doesn't change according to its status. Below are some screenshots. ACTIVE: <img width="550" alt="active" src="https://user-images.githubusercontent.com/12194089/50438844-c763e580-092a-11e9-84f6-6fc30e08d69b.png"> COMPLETE: <img width="516" alt="complete" src="https://user-images.githubusercontent.com/12194089/50438847-ca5ed600-092a-11e9-9d2e-5d79807bc1ce.png"> FAILED: <img width="325" alt="failed" src="https://user-images.githubusercontent.com/12194089/50438852-ccc13000-092a-11e9-9b6b-782b96b283b1.png"> This PR lets stage color change with it's status. The main idea is to make css style class name match the corresponding stage status. ## How was this patch tested? Manually tested locally. ``` // active/complete stage sc.parallelize(1 to 3, 3).map { n => Thread.sleep(10* 1000); n }.count // failed stage sc.parallelize(1 to 3, 3).map { n => Thread.sleep(10* 1000); throw new Exception() }.count ``` Note we need to clear browser cache to let new `timeline-view.css` take effect. Below are screenshots after this PR. ACTIVE: <img width="569" alt="active-after" src="https://user-images.githubusercontent.com/12194089/50439986-08f68f80-092f-11e9-85d9-be1c31aed13b.png"> COMPLETE: <img width="567" alt="complete-after" src="https://user-images.githubusercontent.com/12194089/50439990-0bf18000-092f-11e9-8624-723958906e90.png"> FAILED: <img width="352" alt="failed-after" src="https://user-images.githubusercontent.com/12194089/50439993-101d9d80-092f-11e9-8dfd-3e20536f2fa5.png"> Closes apache#23385 from seancxmao/timeline-stage-color. Authored-by: seancxmao <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit 5bef4fe) Signed-off-by: Sean Owen <[email protected]> * [SPARK-26496][SS][TEST] Avoid to use Random.nextString in StreamingInnerJoinSuite ## What changes were proposed in this pull request? Similar with apache#21446. Looks random string is not quite safe as a directory name. ```scala scala> val prefix = Random.nextString(10); val dir = new File("/tmp", "del_" + prefix + "-" + UUID.randomUUID.toString); dir.mkdirs() prefix: String = 窽텘⒘駖ⵚ駢⡞Ρ닋੎ dir: java.io.File = /tmp/del_窽텘⒘駖ⵚ駢⡞Ρ닋੎-a3f99855-c429-47a0-a108-47bca6905745 res40: Boolean = false // nope, didn't like this one ``` ## How was this patch tested? Unit test was added, and manually. Closes apache#23405 from HyukjinKwon/SPARK-26496. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit e63243d) Signed-off-by: Dongjoon Hyun <[email protected]> * [SPARK-25591][PYSPARK][SQL][BRANCH-2.3] Avoid overwriting deserialized accumulator ## What changes were proposed in this pull request? If we use accumulators in more than one UDFs, it is possible to overwrite deserialized accumulators and its values. We should check if an accumulator was deserialized before overwriting it in accumulator registry. ## How was this patch tested? Added test. Closes apache#23432 from viirya/SPARK-25591-2.3. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> * [SPARK-26019][PYSPARK] Allow insecure py4j gateways Spark always creates secure py4j connections between java and python, but it also allows users to pass in their own connection. This restores the ability for users to pass in an _insecure_ connection, though it forces them to set the env variable 'PYSPARK_ALLOW_INSECURE_GATEWAY=1', and still issues a warning. Added test cases verifying the failure without the extra configuration, and verifying things still work with an insecure configuration (in particular, accumulators, as those were broken with an insecure py4j gateway before). For the tests, I added ways to create insecure gateways, but I tried to put in protections to make sure that wouldn't get used incorrectly. Closes apache#23337 from squito/SPARK-26019. Authored-by: Imran Rashid <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 1e99f4e) Signed-off-by: Hyukjin Kwon <[email protected]> * [MINOR][NETWORK][TEST] Fix TransportFrameDecoderSuite to use ByteBuf instead of ByteBuffer ## What changes were proposed in this pull request? `fireChannelRead` expects `io.netty.buffer.ByteBuf`.I checked that this is the only place which misuse `java.nio.ByteBuffer` in `network` module. ## How was this patch tested? Pass the Jenkins with the existing tests. Closes apache#23442 from dongjoon-hyun/SPARK-NETWORK-COMMON. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 27e42c1) Signed-off-by: Dongjoon Hyun <[email protected]> * [SPARK-26078][SQL][BACKPORT-2.3] Dedup self-join attributes on IN subqueries ## What changes were proposed in this pull request? When there is a self-join as result of a IN subquery, the join condition may be invalid, resulting in trivially true predicates and return wrong results. The PR deduplicates the subquery output in order to avoid the issue. ## How was this patch tested? added UT Closes apache#23450 from mgaido91/SPARK-26078_2.3. Authored-by: Marco Gaido <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> * [SPARK-26545] Fix typo in EqualNullSafe's truth table comment ## What changes were proposed in this pull request? The truth table comment in EqualNullSafe incorrectly marked FALSE results as UNKNOWN. ## How was this patch tested? N/A Closes apache#23461 from rednaxelafx/fix-typo. Authored-by: Kris Mok <[email protected]> Signed-off-by: gatorsmile <[email protected]> (cherry picked from commit 4ab5b5b) Signed-off-by: gatorsmile <[email protected]> * [SPARK-26537][BUILD][BRANCH-2.3] change git-wip-us to gitbox ## What changes were proposed in this pull request? This is a backport of apache#23454 due to apache recently moving from git-wip-us.apache.org to gitbox.apache.org, we need to update the packaging scripts to point to the new repo location. this will also need to be backported to 2.4, 2.3, 2.1, 2.0 and 1.6. ## How was this patch tested? the build system will test this. Closes apache#23472 from dongjoon-hyun/SPARK-26537. Authored-by: shane knapp <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> * [SPARK-25253][PYSPARK][FOLLOWUP] Undefined name: from pyspark.util import _exception_message HyukjinKwon ## What changes were proposed in this pull request? add __from pyspark.util import \_exception_message__ to python/pyspark/java_gateway.py ## How was this patch tested? [flake8](http://flake8.pycqa.org) testing of https://github.com/apache/spark on Python 3.7.0 $ __flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics__ ``` ./python/pyspark/java_gateway.py:172:20: F821 undefined name '_exception_message' emsg = _exception_message(e) ^ 1 F821 undefined name '_exception_message' 1 ``` Please review http://spark.apache.org/contributing.html before opening a pull request. Closes apache#22265 from cclauss/patch-2. Authored-by: cclauss <[email protected]> Signed-off-by: hyukjinkwon <[email protected]> * [MINOR][BUILD] Fix script name in `release-tag.sh` usage message ## What changes were proposed in this pull request? This PR fixes the old script name in `release-tag.sh`. $ ./release-tag.sh --help | head -n1 usage: tag-release.sh ## How was this patch tested? Manual. $ ./release-tag.sh --help | head -n1 usage: release-tag.sh Closes apache#23477 from dongjoon-hyun/SPARK-RELEASE-TAG. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 468d25e) Signed-off-by: Dongjoon Hyun <[email protected]>

…nerJoinSuite ## What changes were proposed in this pull request? Similar with apache#21446. Looks random string is not quite safe as a directory name. ```scala scala> val prefix = Random.nextString(10); val dir = new File("/tmp", "del_" + prefix + "-" + UUID.randomUUID.toString); dir.mkdirs() prefix: String = 窽텘⒘駖ⵚ駢⡞Ρ닋੎ dir: java.io.File = /tmp/del_窽텘⒘駖ⵚ駢⡞Ρ닋੎-a3f99855-c429-47a0-a108-47bca6905745 res40: Boolean = false // nope, didn't like this one ``` ## How was this patch tested? Unit test was added, and manually. Closes apache#23405 from HyukjinKwon/SPARK-26496. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…nerJoinSuite ## What changes were proposed in this pull request? Similar with apache#21446. Looks random string is not quite safe as a directory name. ```scala scala> val prefix = Random.nextString(10); val dir = new File("/tmp", "del_" + prefix + "-" + UUID.randomUUID.toString); dir.mkdirs() prefix: String = 窽텘⒘駖ⵚ駢⡞Ρ닋੎ dir: java.io.File = /tmp/del_窽텘⒘駖ⵚ駢⡞Ρ닋੎-a3f99855-c429-47a0-a108-47bca6905745 res40: Boolean = false // nope, didn't like this one ``` ## How was this patch tested? Unit test was added, and manually. Closes apache#23405 from HyukjinKwon/SPARK-26496. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit e63243d) Signed-off-by: Dongjoon Hyun <[email protected]>

…nerJoinSuite ## What changes were proposed in this pull request? Similar with apache/spark#21446. Looks random string is not quite safe as a directory name. ```scala scala> val prefix = Random.nextString(10); val dir = new File("/tmp", "del_" + prefix + "-" + UUID.randomUUID.toString); dir.mkdirs() prefix: String = 窽텘⒘駖ⵚ駢⡞Ρ닋੎ dir: java.io.File = /tmp/del_窽텘⒘駖ⵚ駢⡞Ρ닋੎-a3f99855-c429-47a0-a108-47bca6905745 res40: Boolean = false // nope, didn't like this one ``` ## How was this patch tested? Unit test was added, and manually. Closes #23405 from HyukjinKwon/SPARK-26496. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit e63243d) Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 1802124)

1) Use host/domain from dfsCluster 2) Use semaphores to remove timing based flakeyness 3) Ensure spark context is closed 4) Ignore ParquetAvroCompatibilitySuite temporarily 5) Adaptation of SPARK-28247 to 2.3 6) Disable codegen tests - these are fixed in 2.4 and 3.x, but require sql backports to 2.3 7) PR apache#21197 to handle test failures due to leaking spark session 8) PR#20926: Set default Spark session in test-only spark sessions 9) PR apache#20971: Active SparkSession should be set by getOrCreate 10) PR apache#21446: Random.nextString is not safe for directory namePrefix 11) Disabling QueryStageSuite#'adaptive skewed join' 12) SPARK-24318: Fix flakey SortShuffleSuite 13) Fix ForeachSinkSuite test failure 14) Diabling StreamingQuerySuite.'status, lastProgress, and recentProgress' 15) Disabling ResolvedDataSourceSuite.'avro: show deploy guide for loading the external avro module' 16) PR# 23405: Avoid to use Random.nextString in StreamingInnerJoinSuite 17) Adaptation of PR#25849, add LinkedIn repo as default to spark.sql.maven.additionalRemoteRepositories 18) Disabling HiveExternalCatalogVersionsSuite - only 2.4.6 and 3.0.0 are available from apache mirrors 19) Disabling INFER_AND_SAVE test in HiveSchemaInferenceSuite - Address Erik's review comments

dongjoon-hyun added 2 commits May 28, 2018 08:56

[SPARK-19613][SS][TEST] Random.nextString is not safe for directory n…

a46dc70

…amePrefix

fix

1412abd

felixcheung approved these changes May 28, 2018

View reviewed changes

HeartSaVioR approved these changes May 28, 2018

View reviewed changes

HyukjinKwon approved these changes May 29, 2018

View reviewed changes

asfgit closed this in b31b587 May 29, 2018

dongjoon-hyun deleted the SPARK-19613 branch May 29, 2018 04:34

HyukjinKwon mentioned this pull request Dec 29, 2018

[SPARK-26496][SS][TEST] Avoid to use Random.nextString in StreamingInnerJoinSuite #23405

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19613][SS][TEST] Random.nextString is not safe for directory namePrefix #21446

[SPARK-19613][SS][TEST] Random.nextString is not safe for directory namePrefix #21446

dongjoon-hyun commented May 28, 2018 •

edited

Loading

SparkQA commented May 28, 2018

SparkQA commented May 28, 2018

felixcheung left a comment

HeartSaVioR left a comment

dongjoon-hyun commented May 29, 2018

HyukjinKwon commented May 29, 2018

HyukjinKwon left a comment

HyukjinKwon commented May 29, 2018

dongjoon-hyun commented May 29, 2018

[SPARK-19613][SS][TEST] Random.nextString is not safe for directory namePrefix #21446

[SPARK-19613][SS][TEST] Random.nextString is not safe for directory namePrefix #21446

Conversation

dongjoon-hyun commented May 28, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented May 28, 2018

SparkQA commented May 28, 2018

felixcheung left a comment

Choose a reason for hiding this comment

HeartSaVioR left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented May 29, 2018

HyukjinKwon commented May 29, 2018

HyukjinKwon left a comment

Choose a reason for hiding this comment

HyukjinKwon commented May 29, 2018

dongjoon-hyun commented May 29, 2018

dongjoon-hyun commented May 28, 2018 •

edited

Loading