Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem select empty ORC table #13103

Closed
wants to merge 872 commits into from
Closed

Problem select empty ORC table #13103

wants to merge 872 commits into from

Conversation

pprado
Copy link

@pprado pprado commented May 13, 2016

Error when I selected empty ORC table

[pprado@hadoop-m ~]$ beeline -u jdbc:hive2://
WARNING: Use "yarn jar" to launch YARN applications.
Connecting to jdbc:hive2://
Connected to: Apache Hive (version 1.2.1000.2.4.2.0-258)
Driver: Hive JDBC (version 1.2.1000.2.4.2.0-258)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1000.2.4.2.0-258 by Apache Hive

On beeline => create table my_test (id int, name String) stored as orc;
On beeline => select * from my_test;

16/05/13 18:18:57 [main]: ERROR hdfs.KeyProviderCache: Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !!
OK
+-------------+---------------+--+
| my_test.id | my_test.name |
+-------------+---------------+--+
+-------------+---------------+--+
No rows selected (1.227 seconds)

Hive is OK!

Now, when i execute pyspark.

Welcome to
SPARK version 1.6.1

Using Python version 2.6.6 (r266:84292, Jul 23 2015 15:22:56)
SparkContext available as sc, HiveContext available as sqlContext.

PySpark => sqlContext.sql("select * from my_test")

16/05/13 18:33:41 INFO ParseDriver: Parsing command: select * from my_test
16/05/13 18:33:41 INFO ParseDriver: Parse Completed
Traceback (most recent call last):
File "", line 1, in
File "/usr/hdp/2.4.2.0-258/spark/python/pyspark/sql/context.py", line 580, in sql
return DataFrame(self._ssql_ctx.sql(sqlQuery), self)
File "/usr/hdp/2.4.2.0-258/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in call
File "/usr/hdp/2.4.2.0-258/spark/python/pyspark/sql/utils.py", line 53, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u'orcFileOperator: path hdfs://hadoop-m.c.sva-0001.internal:8020/apps/hive/warehouse/my_test does not have valid orc files matching the pattern'

when i create parquet table, it's all right. I do not have problem.

adrian-wang and others added 30 commits December 29, 2015 07:02
…hrow Buffer underflow exception

Since we only need to implement `def skipBytes(n: Int)`,
code in #10213 could be simplified.
davies scwf

Author: Daoyuan Wang <[email protected]>

Closes #10253 from adrian-wang/kryo.

(cherry picked from commit a6d3853)
Signed-off-by: Kousuke Saruta <[email protected]>
Include the following changes:

1. Close `java.sql.Statement`
2. Fix incorrect `asInstanceOf`.
3. Remove unnecessary `synchronized` and `ReentrantLock`.

Author: Shixiong Zhu <[email protected]>

Closes #10440 from zsxwing/findbugs.

(cherry picked from commit 710b411)
Signed-off-by: Shixiong Zhu <[email protected]>
…es in postgresql

If DataFrame has BYTE types, throws an exception:
org.postgresql.util.PSQLException: ERROR: type "byte" does not exist

Author: Takeshi YAMAMURO <[email protected]>

Closes #9350 from maropu/FixBugInPostgreJdbc.

(cherry picked from commit 73862a1)
Signed-off-by: Yin Huai <[email protected]>
…umn as value

`ifelse`, `when`, `otherwise` is unable to take `Column` typed S4 object as values.

For example:
```r
ifelse(lit(1) == lit(1), lit(2), lit(3))
ifelse(df$mpg > 0, df$mpg, 0)
```
will both fail with
```r
attempt to replicate an object of type 'environment'
```

The PR replaces `ifelse` calls with `if ... else ...` inside the function implementations to avoid attempt to vectorize(i.e. `rep()`). It remains to be discussed whether we should instead support vectorization in these functions for consistency because `ifelse` in base R is vectorized but I cannot foresee any scenarios these functions will want to be vectorized in SparkR.

For reference, added test cases which trigger failures:
```r
. Error: when(), otherwise() and ifelse() with column on a DataFrame ----------
error in evaluating the argument 'x' in selecting a method for function 'collect':
  error in evaluating the argument 'col' in selecting a method for function 'select':
  attempt to replicate an object of type 'environment'
Calls: when -> when -> ifelse -> ifelse

1: withCallingHandlers(eval(code, new_test_environment), error = capture_calls, message = function(c) invokeRestart("muffleMessage"))
2: eval(code, new_test_environment)
3: eval(expr, envir, enclos)
4: expect_equal(collect(select(df, when(df$a > 1 & df$b > 2, lit(1))))[, 1], c(NA, 1)) at test_sparkSQL.R:1126
5: expect_that(object, equals(expected, label = expected.label, ...), info = info, label = label)
6: condition(object)
7: compare(actual, expected, ...)
8: collect(select(df, when(df$a > 1 & df$b > 2, lit(1))))
Error: Test failures
Execution halted
```

Author: Forest Fang <[email protected]>

Closes #10481 from saurfang/spark-12526.

(cherry picked from commit d80cc90)
Signed-off-by: Shivaram Venkataraman <[email protected]>
Current schema inference for local python collections halts as soon as there are no NullTypes. This is different than when we specify a sampling ratio of 1.0 on a distributed collection. This could result in incomplete schema information.

Author: Holden Karau <[email protected]>

Closes #10275 from holdenk/SPARK-12300-fix-schmea-inferance-on-local-collections.

(cherry picked from commit d1ca634)
Signed-off-by: Davies Liu <[email protected]>
…ith an unknown app Id

I got an exception when accessing the below REST API with an unknown application Id.
`http://<server-url>:18080/api/v1/applications/xxx/jobs`
Instead of an exception, I expect an error message "no such app: xxx" which is a similar error message when I access `/api/v1/applications/xxx`
```
org.spark-project.guava.util.concurrent.UncheckedExecutionException: java.util.NoSuchElementException: no app with key xxx
	at org.spark-project.guava.cache.LocalCache$Segment.get(LocalCache.java:2263)
	at org.spark-project.guava.cache.LocalCache.get(LocalCache.java:4000)
	at org.spark-project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
	at org.spark-project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
	at org.apache.spark.deploy.history.HistoryServer.getSparkUI(HistoryServer.scala:116)
	at org.apache.spark.status.api.v1.UIRoot$class.withSparkUI(ApiRootResource.scala:226)
	at org.apache.spark.deploy.history.HistoryServer.withSparkUI(HistoryServer.scala:46)
	at org.apache.spark.status.api.v1.ApiRootResource.getJobs(ApiRootResource.scala:66)
```

Author: Carson Wang <[email protected]>

Closes #10352 from carsonwang/unknownAppFix.

(cherry picked from commit b244297)
Signed-off-by: Marcelo Vanzin <[email protected]>
shivaram

Author: felixcheung <[email protected]>

Closes #10408 from felixcheung/rcodecomment.

(cherry picked from commit c3d5056)
Signed-off-by: Shivaram Venkataraman <[email protected]>
…ame to be called value

Author: Xiu Guo <[email protected]>

Closes #10515 from xguo27/SPARK-12562.

(cherry picked from commit 84f8492)
Signed-off-by: Reynold Xin <[email protected]>
…sible.

This patch updates the ExecutorRunner's terminate path to use the new java 8 API
to terminate processes more forcefully if possible. If the executor is unhealthy,
it would previously ignore the destroy() call. Presumably, the new java API was
added to handle cases like this.

We could update the termination path in the future to use OS specific commands
for older java versions.

Author: Nong Li <[email protected]>

Closes #10438 from nongli/spark-12486-executors.

(cherry picked from commit 8f65939)
Signed-off-by: Andrew Or <[email protected]>
also only allocate required buffer size

Author: Pete Robbins <[email protected]>

Closes #10421 from robbinspg/master.

(cherry picked from commit b504b6a)
Signed-off-by: Davies Liu <[email protected]>

Conflicts:
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeRowJoiner.scala
Spark SQL's JDBC data source allows users to specify an explicit JDBC driver to load (using the `driver` argument), but in the current code it's possible that the user-specified driver will not be used when it comes time to actually create a JDBC connection.

In a nutshell, the problem is that you might have multiple JDBC drivers on the classpath that claim to be able to handle the same subprotocol, so simply registering the user-provided driver class with the our `DriverRegistry` and JDBC's `DriverManager` is not sufficient to ensure that it's actually used when creating the JDBC connection.

This patch addresses this issue by first registering the user-specified driver with the DriverManager, then iterating over the driver manager's loaded drivers in order to obtain the correct driver and use it to create a connection (previously, we just called `DriverManager.getConnection()` directly).

If a user did not specify a JDBC driver to use, then we call `DriverManager.getDriver` to figure out the class of the driver to use, then pass that class's name to executors; this guards against corner-case bugs in situations where the driver and executor JVMs might have different sets of JDBC drivers on their classpaths (previously, there was the (rare) potential for `DriverManager.getConnection()` to use different drivers on the driver and executors if the user had not explicitly specified a JDBC driver class and the classpaths were different).

This patch is inspired by a similar patch that I made to the `spark-redshift` library (databricks/spark-redshift#143), which contains its own modified fork of some of Spark's JDBC data source code (for cross-Spark-version compatibility reasons).

Author: Josh Rosen <[email protected]>

Closes #10519 from JoshRosen/jdbc-driver-precedence.

(cherry picked from commit 6c83d93)
Signed-off-by: Yin Huai <[email protected]>
This is the related thread: http://search-hadoop.com/m/q3RTtO3ReeJ1iF02&subj=Re+partitioning+json+data+in+spark

Michael suggested fixing the doc.

Please review.

Author: tedyu <[email protected]>

Closes #10499 from ted-yu/master.

(cherry picked from commit 40d0396)
Signed-off-by: Michael Armbrust <[email protected]>
…he row length.

The reader was previously not setting the row length meaning it was wrong if there were variable
length columns. This problem does not manifest usually, since the value in the column is correct and
projecting the row fixes the issue.

Author: Nong Li <[email protected]>

Closes #10576 from nongli/spark-12589.

(cherry picked from commit 34de24a)
Signed-off-by: Yin Huai <[email protected]>

Conflicts:
	sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java
checked that the change is in Spark 1.6.0.
shivaram

Author: felixcheung <[email protected]>

Closes #10574 from felixcheung/rwritemodedoc.

(cherry picked from commit 8896ec9)
Signed-off-by: Shivaram Venkataraman <[email protected]>
Author: Michael Armbrust <[email protected]>

Closes #10516 from marmbrus/datasetCleanup.

(cherry picked from commit 53beddc)
Signed-off-by: Michael Armbrust <[email protected]>
…termining the number of reducers: aggregate operator

change expected partition sizes

Author: Pete Robbins <[email protected]>

Closes #10599 from robbinspg/branch-1.6.
This patch added Py4jCallbackConnectionCleaner to clean the leak sockets of Py4J every 30 seconds. This is a workaround before Py4J fixes the leak issue py4j/py4j#187

Author: Shixiong Zhu <[email protected]>

Closes #10579 from zsxwing/SPARK-12617.

(cherry picked from commit 047a31b)
Signed-off-by: Davies Liu <[email protected]>
…erializer is called only once

There is an issue that Py4J's PythonProxyHandler.finalize blocks forever. (py4j/py4j#184)

Py4j will create a PythonProxyHandler in Java for "transformer_serializer" when calling "registerSerializer". If we call "registerSerializer" twice, the second PythonProxyHandler will override the first one, then the first one will be GCed and trigger "PythonProxyHandler.finalize". To avoid that, we should not call"registerSerializer" more than once, so that "PythonProxyHandler" in Java side won't be GCed.

Author: Shixiong Zhu <[email protected]>

Closes #10514 from zsxwing/SPARK-12511.

(cherry picked from commit 6cfe341)
Signed-off-by: Davies Liu <[email protected]>
SPARK-12450 . Un-persist broadcasted variables in KMeans.

Author: RJ Nowling <[email protected]>

Closes #10415 from rnowling/spark-12450.

(cherry picked from commit 78015a8)
Signed-off-by: Joseph K. Bradley <[email protected]>
Successfully ran kinesis demo on a live, aws hosted kinesis stream against master and 1.6 branches.  For reasons I don't entirely understand it required a manual merge to 1.5 which I did as shown here: BrianLondon@075c22e

The demo ran successfully on the 1.5 branch as well.

According to `mvn dependency:tree` it is still pulling a fairly old version of the aws-java-sdk (1.9.37), but this appears to have fixed the kinesis regression in 1.5.2.

Author: BrianLondon <[email protected]>

Closes #10492 from BrianLondon/remove-only.

(cherry picked from commit ff89975)
Signed-off-by: Sean Owen <[email protected]>
Add ```read.text``` and ```write.text``` for SparkR.
cc sun-rui felixcheung shivaram

Author: Yanbo Liang <[email protected]>

Closes #10348 from yanboliang/spark-12393.

(cherry picked from commit d1fea41)
Signed-off-by: Shivaram Venkataraman <[email protected]>
If initial model passed to GMM is not empty it causes `net.razorvine.pickle.PickleException`. It can be fixed by converting `initialModel.weights` to `list`.

Author: zero323 <[email protected]>

Closes #9986 from zero323/SPARK-12006.

(cherry picked from commit fcd013c)
Signed-off-by: Joseph K. Bradley <[email protected]>
Move Py4jCallbackConnectionCleaner to Streaming because the callback server starts only in StreamingContext.

Author: Shixiong Zhu <[email protected]>

Closes #10621 from zsxwing/SPARK-12617-2.

(cherry picked from commit 1e6648d)
Signed-off-by: Shixiong Zhu <[email protected]>
…lt root path to gain the streaming batch url.

Author: huangzhaowei <[email protected]>

Closes #10617 from SaintBacchus/SPARK-12672.
…of default root path to gain the streaming batch url."

This reverts commit 8f0ead3. Will merge #10618 instead.
… pyspark

JIRA: https://issues.apache.org/jira/browse/SPARK-12016

We should not directly use Word2VecModel in pyspark. We need to wrap it in a Word2VecModelWrapper when loading it in pyspark.

Author: Liang-Chi Hsieh <[email protected]>

Closes #10100 from viirya/fix-load-py-wordvecmodel.

(cherry picked from commit b51a4cd)
Signed-off-by: Joseph K. Bradley <[email protected]>
Otherwise the url will be failed to proxy to the right one if in YARN mode. Here is the screenshot:

![screen shot 2016-01-06 at 5 28 26 pm](https://cloud.githubusercontent.com/assets/850797/12139632/bbe78ecc-b49c-11e5-8932-94e8b3622a09.png)

Author: jerryshao <[email protected]>

Closes #10618 from jerryshao/SPARK-12673.

(cherry picked from commit 174e72c)
Signed-off-by: Shixiong Zhu <[email protected]>
MapPartitionsRDD was keeping a reference to `prev` after a call to
`clearDependencies` which could lead to memory leak.

Author: Guillaume Poulin <[email protected]>

Closes #10623 from gpoulin/map_partition_deps.

(cherry picked from commit b673852)
Signed-off-by: Reynold Xin <[email protected]>
…not None"

This reverts commit fcd013c.

Author: Yin Huai <[email protected]>

Closes #10632 from yhuai/pythonStyle.

(cherry picked from commit e5cde7a)
Signed-off-by: Yin Huai <[email protected]>
modify 'spark.memory.offHeap.enabled' default value to false

Author: zzcclp <[email protected]>

Closes #10633 from zzcclp/fix_spark.memory.offHeap.enabled_default_value.

(cherry picked from commit 84e77a1)
Signed-off-by: Reynold Xin <[email protected]>
Andrew Or added 2 commits May 11, 2016 13:37
## What changes were proposed in this pull request?

If an executor is still alive even after the scheduler has removed its metadata, we may receive a heartbeat from that executor and tell its block manager to reregister itself. If that happens, the block manager master will know about the executor, but the scheduler will not.

That is a dangerous situation, because when the executor does get disconnected later, the scheduler will not ask the block manager to also remove metadata for that executor. Later, when we try to clean up an RDD or a broadcast variable, we may try to send a message to that executor, triggering an exception.

## How was this patch tested?

Jenkins.

Author: Andrew Or <[email protected]>

Closes #13055 from andrewor14/block-manager-remove.

(cherry picked from commit 40a949a)
Signed-off-by: Shixiong Zhu <[email protected]>
## What changes were proposed in this pull request?

(This is the branch-1.6 version of #13039)

When we acquire execution memory, we do a lot of things between shrinking the storage memory pool and enlarging the execution memory pool. In particular, we call memoryStore.evictBlocksToFreeSpace, which may do a lot of I/O and can throw exceptions. If an exception is thrown, the pool sizes on that executor will be in a bad state.

This patch minimizes the things we do between the two calls to make the resizing more atomic.

## How was this patch tested?

Jenkins.

Author: Andrew Or <[email protected]>

Closes #13058 from andrewor14/safer-pool-1.6.
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented May 14, 2016

I think it would be good if you follow https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

@pprado
Copy link
Author

pprado commented May 14, 2016

Why?
Em 13 de mai de 2016 9:08 PM, "Hyukjin Kwon" [email protected]
escreveu:

I think it would be good if you follow
https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide.


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#13103 (comment)

@HyukjinKwon
Copy link
Member

Because I guess this is a contribution to Spark and there is a guide for this. It seems there are a lot of things wrong with this PR (e.g. no JIRA).

@pprado
Copy link
Author

pprado commented May 14, 2016

Sorry, I tried to follow this document:

https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-ContributingBugReports

Jira is not the best tool for bug reporting.

Again, sorry if I did not follow good documentation practices expected for
you, but I think more important to report a very serious BUG, which did not
exist in version 1.6.0.

Thanks,
Pedro Prado

2016-05-13 21:30 GMT-03:00 Hyukjin Kwon [email protected]:

Because I guess this is a contribution to Spark and there is a guide for
this. It seems there are a lot of things wrong with this PR (e.g. no JIRA).


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#13103 (comment)

@HyukjinKwon
Copy link
Member

In https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-ContributingBugReports , It seems not suggesting to create a PR first but ask it to mailing list first.

Also, It seems not a very serious bug according to the guide. It seems priority of JIRA can be minor or major.

And, I think it is more important to follow the guide. Otherwise, it would be chaotic to deal with issues and PRs because it seems Spark is a really active and popular project.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented May 14, 2016

I think It would be nicer if this bug report is asked to mailing list first or in the JIRA.

@andrewor14
Copy link
Contributor

Let's close this PR. This patch isn't opened against the correct branch anyway.

dosoft and others added 19 commits May 19, 2016 22:25
Fixed memory leak (HiveConf in the CommandProcessorFactory)

Author: Oleg Danilov <[email protected]>

Closes #12932 from dosoft/SPARK-14261.

(cherry picked from commit e384c7f)
Signed-off-by: Reynold Xin <[email protected]>
…for 1.6)

## What changes were proposed in this pull request?

Backport #13185 to branch 1.6.

## How was this patch tested?

Jenkins unit tests.

Author: Shixiong Zhu <[email protected]>

Closes #13196 from zsxwing/host-string-1.6.
… in generated code (branch-1.6)

## What changes were proposed in this pull request?

This PR introduce place holder for comment in generated code and the purpose is same for #12939 but much safer.

Generated code to be compiled doesn't include actual comments but includes place holder instead.

Place holders in generated code will be replaced with actual comments only at the time of logging.

Also, this PR can resolve SPARK-15205.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Added new test cases.

Author: Kousuke Saruta <[email protected]>

Closes #13230 from sarutak/SPARK-15165-branch-1.6.
## What changes were proposed in this pull request?

To ensure that the deserialization of TaskMetrics uses a ClassLoader that knows about RDDBlockIds. The problem occurs only very rarely since it depends on which thread of the thread pool is used for the heartbeat.

I observe that the code in question has been largely rewritten for v2.0.0 of Spark and the problem no longer manifests. However it would seem reasonable to fix this for those users who need to continue with the 1.6 version for some time yet. Hence I have created a fix for the 1.6 code branch.

## How was this patch tested?

Due to the nature of the problem a reproducible testcase is difficult to produce. This problem was causing our application's nightly integration tests to fail randomly. Since applying the fix the tests have not failed due to this problem, for nearly six weeks now.

Author: Simon Scott <[email protected]>

Closes #13222 from simonjscott/fix-10722.
This patch fixes a few integer overflows in `UnsafeSortDataFormat.copyRange()` and `ShuffleSortDataFormat copyRange()` that seems to be the most likely cause behind a number of `TimSort` contract violation errors seen in Spark 2.0 and Spark 1.6 while sorting large datasets.

Added a test in `ExternalSorterSuite` that instantiates a large array of the form of [150000000, 150000001, 150000002, ...., 300000000, 0, 1, 2, ..., 149999999] that triggers a `copyRange` in `TimSort.mergeLo` or `TimSort.mergeHi`. Note that the input dataset should contain at least 268.43 million rows with a certain data distribution for an overflow to occur.

Author: Sameer Agarwal <[email protected]>

Closes #13336 from sameeragarwal/timsort-bug.

(cherry picked from commit fe6de16)
Signed-off-by: Reynold Xin <[email protected]>
## What changes were proposed in this pull request?

Makes `UnsafeSortDataFormat`  and `RecordPointerAndKeyPrefix` public. These are already public in 2.0 and are used in an `ExternalSorterSuite` test (see 0b8bdf7)

## How was this patch tested?

Successfully builds locally

Author: Sameer Agarwal <[email protected]>

Closes #13339 from sameeragarwal/fix-compile.
## What changes were proposed in this pull request?
A local variable in NumberConverter is wrongly shared between threads.
This pr fixes the race condition.

## How was this patch tested?
Manually checked.

Author: Takeshi YAMAMURO <[email protected]>

Closes #13391 from maropu/SPARK-15528.

(cherry picked from commit 95db8a4)
Signed-off-by: Sean Owen <[email protected]>
…tents written if buffer isn't full

1. The class allocated 4x space than needed as it was using `Int` to store the `Byte` values

2. If CircularBuffer isn't full, currently toString() will print some garbage chars along with the content written as is tries to print the entire array allocated for the buffer. The fix is to keep track of buffer getting full and don't print the tail of the buffer if it isn't full (suggestion by sameeragarwal over #12194 (comment))

3. Simplified `toString()`

Added new test case

Author: Tejas Patil <[email protected]>

Closes #13351 from tejasapatil/circular_buffer.

(cherry picked from commit ac38bdc)
Signed-off-by: Sean Owen <[email protected]>
This pull request fixes an issue in which cluster-mode executors fail to properly register a JDBC driver when the driver is provided in a jar by the user, but the driver class name is derived from a JDBC URL (rather than specified by the user).  The consequence of this is that all JDBC accesses under the described circumstances fail with an `IllegalStateException`. I reported the issue here: https://issues.apache.org/jira/browse/SPARK-14204

My proposed solution is to have the executors register the JDBC driver class under all circumstances, not only when the driver is specified by the user.

This patch was tested manually.  I built an assembly jar, deployed it to a cluster, and confirmed that the problem was fixed.

Author: Kevin McHale <[email protected]>

Closes #12000 from mchalek/jdbc-driver-registration.
…iles

If an RDD partition is cached on disk and the DiskStore file is lost, then reads of that cached partition will fail and the missing partition is supposed to be recomputed by a new task attempt. In the current BlockManager implementation, however, the missing file does not trigger any metadata updates / does not invalidate the cache, so subsequent task attempts will be scheduled on the same executor and the doomed read will be repeatedly retried, leading to repeated task failures and eventually a total job failure.

In order to fix this problem, the executor with the missing file needs to properly mark the corresponding block as missing so that it stops advertising itself as a cache location for that block.

This patch fixes this bug and adds an end-to-end regression test (in `FailureSuite`) and a set of unit tests (`in BlockManagerSuite`).

This is a branch-1.6 backport of #13473.

Author: Josh Rosen <[email protected]>

Closes #13479 from JoshRosen/handle-missing-cache-files-branch-1.6.
…ation tokens to be added in current user credential.

## What changes were proposed in this pull request?
The credentials are not added to the credentials of UserGroupInformation.getCurrentUser(). Further if the client has possibility to login using keytab then the updateDelegationToken thread is not started on client.

## How was this patch tested?
ran dev/run-tests

Author: Subroto Sanyal <[email protected]>

Closes #13499 from subrotosanyal/SPARK-15754-save-ugi-from-changing.

(cherry picked from commit 61d729a)
Signed-off-by: Marcelo Vanzin <[email protected]>
…form "EST" is …

## What changes were proposed in this pull request?

Stop using the abbreviated and ambiguous timezone "EST" in a test, since it is machine-local default timezone dependent, and fails in different timezones.  Fixed [SPARK-15723](https://issues.apache.org/jira/browse/SPARK-15723).

## How was this patch tested?

Note that to reproduce this problem in any locale/timezone, you can modify the scalatest-maven-plugin argLine to add a timezone:

    <argLine>-ea -Xmx3g -XX:MaxPermSize=${MaxPermGen} -XX:ReservedCodeCacheSize=${CodeCacheSize} -Duser.timezone="Australia/Sydney"</argLine>

and run

    $ mvn test -DwildcardSuites=org.apache.spark.status.api.v1.SimpleDateParamSuite -Dtest=none. Equally this will fix it in an effected timezone:

    <argLine>-ea -Xmx3g -XX:MaxPermSize=${MaxPermGen} -XX:ReservedCodeCacheSize=${CodeCacheSize} -Duser.timezone="America/New_York"</argLine>

To test the fix, apply the above change to `pom.xml` to set test TZ to `Australia/Sydney`, and confirm the test now passes.

Author: Brett Randall <[email protected]>

Closes #13462 from javabrett/SPARK-15723-SimpleDateParamSuite.

(cherry picked from commit 4e767d0)
Signed-off-by: Sean Owen <[email protected]>
Some VertexRDD and EdgeRDD are created during the intermediate step of g.connectedComponents() but unnecessarily left cached after the method is done. The fix is to unpersist these RDDs once they are no longer in use.

A test case is added to confirm the fix for the reported bug.

Author: Jason Lee <[email protected]>

Closes #10713 from jasoncl/SPARK-12655.

(cherry picked from commit d0a5c32)
Signed-off-by: Sean Owen <[email protected]>
… empty .m2 cache

This patch fixes a bug in `./dev/test-dependencies.sh` which caused spurious failures when the script was run on a machine with an empty `.m2` cache. The problem was that extra log output from the dependency download was conflicting with the grep / regex used to identify the classpath in the Maven output. This patch fixes this issue by adjusting the regex pattern.

Tested manually with the following reproduction of the bug:

```
rm -rf ~/.m2/repository/org/apache/commons/
./dev/test-dependencies.sh
```

Author: Josh Rosen <[email protected]>

Closes #13568 from JoshRosen/SPARK-12712.

(cherry picked from commit 921fa40)
Signed-off-by: Josh Rosen <[email protected]>
…entral

Spark's SBT build currently uses a fork of the sbt-pom-reader plugin but depends on that fork via a SBT subproject which is cloned from https://github.com/scrapcodes/sbt-pom-reader/tree/ignore_artifact_id. This unnecessarily slows down the initial build on fresh machines and is also risky because it risks a build breakage in case that GitHub repository ever changes or is deleted.

In order to address these issues, I have published a pre-built binary of our forked sbt-pom-reader plugin to Maven Central under the `org.spark-project` namespace and have updated Spark's build to use that artifact. This published artifact was built from https://github.com/JoshRosen/sbt-pom-reader/tree/v1.0.0-spark, which contains the contents of ScrapCodes's branch plus an additional patch to configure the build for artifact publication.

/cc srowen ScrapCodes for review.

Author: Josh Rosen <[email protected]>

Closes #13564 from JoshRosen/use-published-fork-of-pom-reader.

(cherry picked from commit f74b777)
Signed-off-by: Josh Rosen <[email protected]>
## What changes were proposed in this pull request?

fixing documentation for the groupby/agg example in python

## How was this patch tested?

the existing example in the documentation dose not contain valid syntax (missing parenthesis) and is not using `Column` in the expression for `agg()`

after the fix here's how I tested it:

```
In [1]: from pyspark.sql import Row

In [2]: import pyspark.sql.functions as func

In [3]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:records = [{'age': 19, 'department': 1, 'expense': 100},
: {'age': 20, 'department': 1, 'expense': 200},
: {'age': 21, 'department': 2, 'expense': 300},
: {'age': 22, 'department': 2, 'expense': 300},
: {'age': 23, 'department': 3, 'expense': 300}]
:--

In [4]: df = sqlContext.createDataFrame([Row(**d) for d in records])

In [5]: df.groupBy("department").agg(df["department"], func.max("age"), func.sum("expense")).show()

+----------+----------+--------+------------+
|department|department|max(age)|sum(expense)|
+----------+----------+--------+------------+
|         1|         1|      20|         300|
|         2|         2|      22|         600|
|         3|         3|      23|         300|
+----------+----------+--------+------------+

Author: Mortada Mehyar <[email protected]>

Closes #13587 from mortada/groupby_agg_doc_fix.

(cherry picked from commit 675a737)
Signed-off-by: Reynold Xin <[email protected]>
## What changes were proposed in this pull request?

Currently, `AFTAggregator` is not being merged correctly. For example, if there is any single empty partition in the data, this creates an `AFTAggregator` with zero total count which causes the exception below:

```
IllegalArgumentException: u'requirement failed: The number of instances should be greater than 0.0, but got 0.'
```

Please see [AFTSurvivalRegression.scala#L573-L575](https://github.com/apache/spark/blob/6ecedf39b44c9acd58cdddf1a31cf11e8e24428c/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala#L573-L575) as well.

Just to be clear, the python example `aft_survival_regression.py` seems using 5 rows. So, if there exist partitions more than 5, it throws the exception above since it contains empty partitions which results in an incorrectly merged `AFTAggregator`.

Executing `bin/spark-submit examples/src/main/python/ml/aft_survival_regression.py` on a machine with CPUs more than 5 is being failed because it creates tasks with some empty partitions with defualt  configurations (AFAIK, it sets the parallelism level to the number of CPU cores).

## How was this patch tested?

An unit test in `AFTSurvivalRegressionSuite.scala` and manually tested by `bin/spark-submit examples/src/main/python/ml/aft_survival_regression.py`.

Author: hyukjinkwon <[email protected]>
Author: Hyukjin Kwon <[email protected]>

Closes #13619 from HyukjinKwon/SPARK-15892.

(cherry picked from commit e355460)
Signed-off-by: Joseph K. Bradley <[email protected]>
…n when override sameResult.

## What changes were proposed in this pull request?

This pr is a backport of #13638 for `branch-1.6`.

## How was this patch tested?

Added the same test as #13638 modified for `branch-1.6`.

Author: Takuya UESHIN <[email protected]>

Closes #13668 from ueshin/issues/SPARK-15915_1.6.
@asfgit asfgit closed this in 1a33f2e Jun 15, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.