Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32381][CORE][SQL][FOLLOWUP] More cleanup on HadoopFSUtils #29959

Closed
wants to merge 6 commits into from

Conversation

sunchao
Copy link
Member

@sunchao sunchao commented Oct 6, 2020

What changes were proposed in this pull request?

This PR is a follow-up of #29471 and does the following improvements for HadoopFSUtils:

  1. Removes the extra filterFun from the listing API and combines it with the filter.
  2. Removes SerializableBlockLocation and SerializableFileStatus given that BlockLocation and FileStatus are already serializable.
  3. Hides the isRootLevel flag from the top-level API.

Why are the changes needed?

Main purpose is to simplify the logic within HadoopFSUtils as well as cleanup the API.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing unit tests (e.g., FileIndexSuite)

@SparkQA
Copy link

SparkQA commented Oct 7, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34085/

@SparkQA
Copy link

SparkQA commented Oct 7, 2020

Test build #129479 has finished for PR 29959 at commit 6f7dc79.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 7, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34085/

@SparkQA
Copy link

SparkQA commented Oct 7, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34090/

@SparkQA
Copy link

SparkQA commented Oct 7, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34090/

@SparkQA
Copy link

SparkQA commented Oct 7, 2020

Test build #129484 has finished for PR 29959 at commit 8c8aa81.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for continuing to work on this, some initial questions.

if (ignoreLocality) {
fs.listStatus(path)
} else {
val remoteIter = fs.listLocatedStatus(path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a chance a FS won't have this implemented? as per the previous code's comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah a FS can choose not to implement it (although all the main ones override this). If not implemented it will fall back to the default impl in FileSystem, which basically calls listStatus and then getFileBlockLocations on each FileStatus received. The behavior is very similar to what this class is doing later on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HDFS and S3A both do this; ABFS merits minor optimisation too. Because they return a remote iterator they can do paged fetch of data

  • HDFS/webHDFS: paged download for better scalability
  • S3A (3.3.1+): async prefetch of next page of data

ABFS should copy the S3A approach; it's listing API is paged too.

Better to rely on the FS to do the work but make clear you expect the maintainers to do so

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @steveloughran , yes I also think it's better to rely on the FileSystem-specific listLocatedStatus impl rather than having the logic here. However, the change seems to break a few assumptions in the test cases so I'll isolate it into another PR.

Comment on lines 217 to 219
// listLocatedStatus will fail as a whole because the default impl calls
// getFileBlockLocations
assert(leafFiles.isEmpty)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to indicate the change needs some work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this test checks the case where a file was deleted after a listStatus call but before a subsequent getFileBlockLocations when locality info is needed. With the new impl, we'd call listLocatedStatus instead which will call getFileBlockLocations internally, and thus the listLocatedStatus call (as a whole) fails with FileNotFoundException.

As explained in the PR description, the behavior will be different when spark.sql.files.ignoreMissingFiles is set, although I think we currently don't give any guarantee when there is missing files during listing, so either is acceptable? anyway, I'm happy to remove this change if there is any concern.

@SparkQA
Copy link

SparkQA commented Oct 8, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34169/

@SparkQA
Copy link

SparkQA commented Oct 8, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34169/

@SparkQA
Copy link

SparkQA commented Oct 8, 2020

Test build #129563 has finished for PR 29959 at commit 5e299a2.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sunchao
Copy link
Member Author

sunchao commented Oct 8, 2020

Thanks @holdenk for the review. Yes this PR still needs a bit more work. Will update.

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34177/

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34177/

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Test build #129571 has finished for PR 29959 at commit f582b17.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34195/

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34195/

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Test build #129592 has finished for PR 29959 at commit e10e59f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Oct 9, 2020

Jenkins, add to whitelist

@sunchao
Copy link
Member Author

sunchao commented Oct 9, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34207/

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34207/

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34208/

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34208/

@SparkQA
Copy link

SparkQA commented Oct 10, 2020

Test build #129604 has finished for PR 29959 at commit e10e59f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 10, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34209/

@SparkQA
Copy link

SparkQA commented Oct 10, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34209/

@SparkQA
Copy link

SparkQA commented Oct 10, 2020

Test build #129605 has finished for PR 29959 at commit e10e59f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 10, 2020

Test build #129606 has finished for PR 29959 at commit 1b4bfbe.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sunchao sunchao changed the title [WIP][SPARK-32381][CORE][SQL][FOLLOWUP] More cleanup on HadoopFSUtils [SPARK-32381][CORE][SQL][FOLLOWUP] More cleanup on HadoopFSUtils Oct 10, 2020
@sunchao
Copy link
Member Author

sunchao commented Nov 16, 2020

@holdenk sure - it's done.

@SparkQA
Copy link

SparkQA commented Nov 16, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35775/

@SparkQA
Copy link

SparkQA commented Nov 16, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35776/

@SparkQA
Copy link

SparkQA commented Nov 16, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35775/

@SparkQA
Copy link

SparkQA commented Nov 16, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35776/

@SparkQA
Copy link

SparkQA commented Nov 16, 2020

Test build #131175 has finished for PR 29959 at commit be1517e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 16, 2020

Test build #131176 has finished for PR 29959 at commit cb76047.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 18, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35841/

@SparkQA
Copy link

SparkQA commented Nov 18, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35841/

@SparkQA
Copy link

SparkQA commented Nov 18, 2020

Test build #131237 has finished for PR 29959 at commit e9d399d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Nov 18, 2020

K8s failures are unrelated, this does not change any of the decommissioning logic. I'll work on a follow up to the decommissioning failures.

@asfgit asfgit closed this in 27cd945 Nov 18, 2020
@sunchao
Copy link
Member Author

sunchao commented Nov 18, 2020

Thanks @holdenk for the review & merge!

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Nov 20, 2020

Hi, @holdenk and @sunchao .

Could you check Hadoop 2.7 failure?

[info] - SPARK-24626 parallel file listing in Stats computation *** FAILED *** (2 seconds, 408 milliseconds)
[info]   org.apache.spark.SparkException: Job aborted due to stage failure: task 0.0 in stage 21.0 (TID 19) had a not serializable result: org.apache.hadoop.fs.Path
[info] Serialization stack:

@sunchao
Copy link
Member Author

sunchao commented Nov 20, 2020

thanks @dongjoon-hyun , let me take a look.

@sunchao
Copy link
Member Author

sunchao commented Nov 20, 2020

found potential issue and opened #30447

dongjoon-hyun pushed a commit that referenced this pull request Nov 21, 2020
…leFileStatus and SerializableBlockLocation for Hadoop 2.7

### What changes were proposed in this pull request?

Revert the change in #29959 and don't remove `SerializableFileStatus` and `SerializableBlockLocation`.

### Why are the changes needed?

In Hadoop 2.7 `FileStatus` and `BlockLocation` are not serializable, so we still need the two wrapper classes.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

N/A

Closes #30447 from sunchao/SPARK-32381-followup.

Authored-by: Chao Sun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
ahshahid added a commit to ahshahid/spark that referenced this pull request Nov 23, 2020
* [SPARK-33045][SQL][FOLLOWUP] Fix build failure with Scala 2.13

### What changes were proposed in this pull request?

Explicitly convert `scala.collection.mutable.Buffer` to `Seq`. In Scala 2.13 `Seq` is an alias of `scala.collection.immutable.Seq` instead of `scala.collection.Seq`.

### Why are the changes needed?

Without the change build with Scala 2.13 fails with the following:
```
[error] /home/runner/work/spark/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala:1417:41: type mismatch;
[error]  found   : scala.collection.mutable.Buffer[org.apache.spark.unsafe.types.UTF8String]
[error]  required: Seq[org.apache.spark.unsafe.types.UTF8String]
[error]                 case null => LikeAll(e, patterns)
[error]                                         ^
[error] /home/runner/work/spark/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala:1418:41: type mismatch;
[error]  found   : scala.collection.mutable.Buffer[org.apache.spark.unsafe.types.UTF8String]
[error]  required: Seq[org.apache.spark.unsafe.types.UTF8String]
[error]                 case _ => NotLikeAll(e, patterns)
[error]                                         ^
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

N/A

Closes apache#30431 from sunchao/SPARK-33045-followup.

Authored-by: Chao Sun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>

* [MINOR] Structured Streaming statistics page indent fix

### What changes were proposed in this pull request?
Structured Streaming statistics page code contains an indentation issue. This PR fixes it.

### Why are the changes needed?
Indent fix.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing unit tests.

Closes apache#30434 from gaborgsomogyi/STAT-INDENT-FIX.

Authored-by: Gabor Somogyi <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>

* [MINOR][DOCS] Document 'without' value for HADOOP_VERSION in pip installation

### What changes were proposed in this pull request?

I believe it's self-descriptive.

### Why are the changes needed?

To document supported features.

### Does this PR introduce _any_ user-facing change?

Yes, the docs are updated. It's master only.

### How was this patch tested?

Manually built the docs via `cd python/docs` and `make clean html`:

![Screen Shot 2020-11-20 at 10 59 07 AM](https://user-images.githubusercontent.com/6477701/99748225-7ad9b280-2b1f-11eb-86fd-165012b1bb7c.png)

Closes apache#30436 from HyukjinKwon/minor-doc-fix.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>

* [SPARK-32919][SHUFFLE][TEST-MAVEN][TEST-HADOOP2.7] Driver side changes for coordinating push based shuffle by selecting external shuffle services for merging partitions

### What changes were proposed in this pull request?
Driver side changes for coordinating push based shuffle by selecting external shuffle services for merging partitions.

This PR includes changes related to `ShuffleMapStage` preparation which is selection of merger locations and initializing them as part of `ShuffleDependency`.

Currently this code is not used as some of the changes would come subsequently as part of https://issues.apache.org/jira/browse/SPARK-32917 (shuffle blocks push as part of `ShuffleMapTask`), https://issues.apache.org/jira/browse/SPARK-32918 (support for finalize API) and https://issues.apache.org/jira/browse/SPARK-32920 (finalization of push/merge phase). This is why the tests here are also partial, once these above mentioned changes are raised as PR we will have enough tests for DAGScheduler piece of code as well.

### Why are the changes needed?
Added a new API in `SchedulerBackend` to get merger locations for push based shuffle. This is currently implemented for Yarn and other cluster managers can have separate implementations which is why a new API is introduced.

### Does this PR introduce _any_ user-facing change?
Yes, user facing config to enable push based shuffle is introduced

### How was this patch tested?
Added unit tests partially and some of the changes in DAGScheduler depends on future changes, DAGScheduler tests will be added along with those changes.

Lead-authored-by: Venkata krishnan Sowrirajan vsowrirajanlinkedin.com
Co-authored-by: Min Shen mshenlinkedin.com

Closes apache#30164 from venkata91/upstream-SPARK-32919.

Lead-authored-by: Venkata krishnan Sowrirajan <[email protected]>
Co-authored-by: Min Shen <[email protected]>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

* [SPARK-33441][BUILD][FOLLOWUP] Make unused-imports check for SBT specific

### What changes were proposed in this pull request?
Move "unused-imports" check config to `SparkBuild.scala` and make it SBT specific.

### Why are the changes needed?
Make unused-imports check for SBT specific.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes apache#30441 from LuciferYang/SPARK-33441-FOLLOWUP.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>

* [SPARK-32512][SQL][TESTS][FOLLOWUP] Remove duplicate tests for ALTER TABLE .. PARTITIONS from DataSourceV2SQLSuite

### What changes were proposed in this pull request?
Remove tests from `DataSourceV2SQLSuite` that were copied to `AlterTablePartitionV2SQLSuite` by apache#29339.

### Why are the changes needed?
- To reduce tests execution time
- To improve test maintenance

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the modified tests:
```
$ build/sbt "test:testOnly *DataSourceV2SQLSuite"
$ build/sbt "test:testOnly *AlterTablePartitionV2SQLSuite"
```

Closes apache#30444 from MaxGekk/dedup-tests-AlterTablePartitionV2SQLSuite.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-33422][DOC] Fix the correct display of left menu item

### What changes were proposed in this pull request?
Limit the height of the menu area on the left to display vertical scroll bar

### Why are the changes needed?

The bottom menu item cannot be displayed when the left menu tree is long

### Does this PR introduce any user-facing change?

Yes, if the menu item shows more, you'll see it by pulling down the vertical scroll bar

before:
![image](https://user-images.githubusercontent.com/28332082/98805115-16995d80-2452-11eb-933a-3b72c14bea78.png)

after:
![image](https://user-images.githubusercontent.com/28332082/98805418-7e4fa880-2452-11eb-9a9b-8d265078297c.png)

### How was this patch tested?
NA

Closes apache#30335 from liucht-inspur/master.

Authored-by: liucht <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>

* [SPARK-33468][SQL] ParseUrl  in ANSI mode should fail if input string is not a valid url

### What changes were proposed in this pull request?

With `ParseUrl`, instead of return null we throw exception if input string is not a vaild url.

### Why are the changes needed?

For ANSI mode.

### Does this PR introduce _any_ user-facing change?

Yes, user will get exception if `set spark.sql.ansi.enabled=true`.

### How was this patch tested?

Add test.

Closes apache#30399 from ulysses-you/SPARK-33468.

Lead-authored-by: ulysses <[email protected]>
Co-authored-by: ulysses-you <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-28704][SQL][TEST] Add back Skiped HiveExternalCatalogVersionsSuite in HiveSparkSubmitSuite at JDK9+

### What changes were proposed in this pull request?
We skip test HiveExternalCatalogVersionsSuite when testing with JAVA_9 or later because our previous version does not support JAVA_9 or later. We now add it back since we have a version supports JAVA_9 or later.

### Why are the changes needed?

To recover test coverage.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Check CI logs.

Closes apache#30428 from AngersZhuuuu/SPARK-28704.

Authored-by: angerszhu <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>

* [SPARK-33466][ML][PYTHON] Imputer support mode(most_frequent) strategy

### What changes were proposed in this pull request?
impl a new strategy `mode`: replace missing using the most frequent value along each column.

### Why are the changes needed?
it is highly scalable, and had been a function in [sklearn.impute.SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer) for a long time.

### Does this PR introduce _any_ user-facing change?
Yes, a new strategy is added

### How was this patch tested?
updated testsuites

Closes apache#30397 from zhengruifeng/imputer_max_freq.

Lead-authored-by: Ruifeng Zheng <[email protected]>
Co-authored-by: zhengruifeng <[email protected]>
Signed-off-by: Sean Owen <[email protected]>

* [MINOR][TESTS][DOCS] Use fully-qualified class name in docker integration test

### What changes were proposed in this pull request?
change
```
./build/sbt -Pdocker-integration-tests "testOnly *xxxIntegrationSuite"
```
to
```
./build/sbt -Pdocker-integration-tests "testOnly org.apache.spark.sql.jdbc.xxxIntegrationSuite"
```

### Why are the changes needed?
We only want to start v1 ```xxxIntegrationSuite```, not the newly added```v2.xxxIntegrationSuite```.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manually checked

Closes apache#30448 from huaxingao/dockertest.

Authored-by: Huaxin Gao <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>

* [SPARK-33492][SQL] DSv2: Append/Overwrite/ReplaceTable should invalidate cache

### What changes were proposed in this pull request?

This adds changes in the following places:
- logic to also refresh caches referencing the target table in v2 `AppendDataExec`, `OverwriteByExpressionExec`, `OverwritePartitionsDynamicExec`, as well as their v1 fallbacks `AppendDataExecV1` and `OverwriteByExpressionExecV1`.
- logic to invalidate caches referencing the target table in v2 `ReplaceTableAsSelectExec` and its atomic version `AtomicReplaceTableAsSelectExec`. These are only supported in v2 at the moment though.

In addition to the above, in order to test the v1 write fallback behavior, I extended `InMemoryTableWithV1Fallback` to also support batch reads.

### Why are the changes needed?

Currently in DataSource v2 we don't refresh or invalidate caches referencing the target table when the table content is changed by operations such as append, overwrite, or replace table. This is different from DataSource v1, and could potentially cause data correctness issue if the staled caches are queried later.

### Does this PR introduce _any_ user-facing change?

Yes. Now When a data source v2 is cached (either directly or indirectly), all the relevant caches will be refreshed or invalidated if the table is replaced.

### How was this patch tested?

Added unit tests for the new code path.

Closes apache#30429 from sunchao/SPARK-33492.

Authored-by: Chao Sun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>

* [SPARK-32670][SQL] Group exception messages in Catalyst Analyzer in one file

### What changes were proposed in this pull request?

Group all messages of `AnalysisExcpetions` created and thrown directly in org.apache.spark.sql.catalyst.analysis.Analyzer in one file.
* Create a new object: `org.apache.spark.sql.CatalystErrors` with many exception-creating functions.
* When the `Analyzer` wants to create and throw a new `AnalysisException`, call functions of `CatalystErrors`

### Why are the changes needed?

This is the sample PR that groups exception messages together in several files. It will largely help with standardization of error messages and its maintenance.

### Does this PR introduce _any_ user-facing change?

No. Error messages remain unchanged.

### How was this patch tested?

No new tests - pass all original tests to make sure it doesn't break any existing behavior.

### Naming of exception functions

All function names ended with `Error`.
* For specific errors like `groupingIDMismatch` and `groupingColInvalid`, directly use them as name, just like `groupingIDMismatchError` and `groupingColInvalidError`.
* For generic errors like `dataTypeMismatch`,
  * if confident with the context, prefix and condition can be added, like `pivotValDataTypeMismatchError`
  * if not sure about the context, add a `For` suffix of the specific component that this exception is related to, like `dataTypeMismatchForDeserializerError`

Closes apache#29497 from anchovYu/32670.

Lead-authored-by: anchovYu <[email protected]>
Co-authored-by: anchovYu <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>

* [SPARK-33223][SS][FOLLOWUP] Clarify the meaning of "number of rows dropped by watermark" in SS UI page

### What changes were proposed in this pull request?

This PR fixes the representation to clarify the meaning of "number of rows dropped by watermark" in SS UI page.

### Why are the changes needed?

`Aggregated Number Of State Rows Dropped By Watermark` says that the dropped rows are from the state, whereas they're not. We say "evicted from the state" for the case, which is "normal" to emit outputs and reduce memory usage of the state.

The metric actually represents the number of "input" rows dropped by watermark, and the meaning of "input" is relative to the "stateful operator". That's a bit confusing as we normally think "input" as "input from source" whereas it's not.

### Does this PR introduce _any_ user-facing change?

Yes, UI element & tooltip change.

### How was this patch tested?

Only text change in UI, so we know how thing will be changed intuitively.

Closes apache#30439 from HeartSaVioR/SPARK-33223-FOLLOWUP.

Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <[email protected]>

* [SPARK-33505][SQL][TESTS] Fix adding new partitions by INSERT INTO `InMemoryPartitionTable`

### What changes were proposed in this pull request?
1. Add a hook method to `addPartitionKey()` of `InMemoryTable` which is called per every row.
2. Override `addPartitionKey()` in `InMemoryPartitionTable`, and add partition key every time when new row is inserted to the table.

### Why are the changes needed?
To be able to write unified tests for datasources V1 and V2. Currently, INSERT INTO a V1 table creates partitions but the same doesn't work for the custom catalog `InMemoryPartitionTableCatalog` used in DSv2 tests.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the affected test suite `DataSourceV2SQLSuite`.

Closes apache#30449 from MaxGekk/insert-into-InMemoryPartitionTable.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>

* [SPARK-32381][CORE][FOLLOWUP][TEST-HADOOP2.7] Don't remove SerializableFileStatus and SerializableBlockLocation for Hadoop 2.7

### What changes were proposed in this pull request?

Revert the change in apache#29959 and don't remove `SerializableFileStatus` and `SerializableBlockLocation`.

### Why are the changes needed?

In Hadoop 2.7 `FileStatus` and `BlockLocation` are not serializable, so we still need the two wrapper classes.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

N/A

Closes apache#30447 from sunchao/SPARK-32381-followup.

Authored-by: Chao Sun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>

* Revert "[SPARK-28704][SQL][TEST] Add back Skiped HiveExternalCatalogVersionsSuite in HiveSparkSubmitSuite at JDK9+"

This reverts commit 47326ac.

* [SPARK-33463][SQL] Keep Job Id during incremental collect in Spark Thrift Server

### What changes were proposed in this pull request?

When enabling **spark.sql.thriftServer.incrementalCollect** Job Ids get lost and tracing queries in Spark Thrift Server ends up being too complicated.

### Why are the changes needed?

Because it will make easier tracing Spark Thrift Server queries.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

The current tests are enough. No need of more tests.

Closes apache#30390 from gumartinm/master.

Authored-by: Gustavo Martin Morcuende <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>

* [SPARK-28704][SQL][TEST] Add back Skiped HiveExternalCatalogVersionsSuite in HiveSparkSubmitSuite at JDK9+

### What changes were proposed in this pull request?
We skip test HiveExternalCatalogVersionsSuite when testing with JAVA_9 or later because our previous version does not support JAVA_9 or later. We now add it back since we have a version supports JAVA_9 or later.

### Why are the changes needed?

To recover test coverage.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Check CI logs.

Closes apache#30451 from AngersZhuuuu/SPARK-28704.

Authored-by: angerszhu <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>

* [SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source

### What changes were proposed in this pull request?

Two new options, _modifiiedBefore_  and _modifiedAfter_, is provided expecting a value in 'YYYY-MM-DDTHH:mm:ss' format.  _PartioningAwareFileIndex_ considers these options during the process of checking for files, just before considering applied _PathFilters_ such as `pathGlobFilter.`  In order to filter file results, a new PathFilter class was derived for this purpose.  General house-keeping around classes extending PathFilter was performed for neatness.  It became apparent support was needed to handle multiple potential path filters.  Logic was introduced for this purpose and the associated tests written.

### Why are the changes needed?

When loading files from a data source, there can often times be thousands of file within a respective file path.  In many cases I've seen, we want to start loading from a folder path and ideally be able to begin loading files having modification dates past a certain point.  This would mean out of thousands of potential files, only the ones with modification dates greater than the specified timestamp would be considered.  This saves a ton of time automatically and reduces significant complexity managing this in code.

### Does this PR introduce _any_ user-facing change?

This PR introduces an option that can be used with batch-based Spark file data sources.  A documentation update was made to reflect an example and usage of the new data source option.

**Example Usages**
_Load all CSV files modified after date:_
`spark.read.format("csv").option("modifiedAfter","2020-06-15T05:00:00").load()`

_Load all CSV files modified before date:_
`spark.read.format("csv").option("modifiedBefore","2020-06-15T05:00:00").load()`

_Load all CSV files modified between two dates:_
`spark.read.format("csv").option("modifiedAfter","2019-01-15T05:00:00").option("modifiedBefore","2020-06-15T05:00:00").load()
`

### How was this patch tested?

A handful of unit tests were added to support the positive, negative, and edge case code paths.

It's also live in a handful of our Databricks dev environments.  (quoted from cchighman)

Closes apache#30411 from HeartSaVioR/SPARK-31962.

Lead-authored-by: CC Highman <[email protected]>
Co-authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <[email protected]>

* [SPARK-33469][SQL] Add current_timezone function

### What changes were proposed in this pull request?

Add a `CurrentTimeZone` function and replace the value at `Optimizer` side.

### Why are the changes needed?

Let user get current timezone easily. Then user can call
```
SELECT current_timezone()
```

Presto: https://prestodb.io/docs/current/functions/datetime.html
SQL Server: https://docs.microsoft.com/en-us/sql/t-sql/functions/current-timezone-transact-sql?view=sql-server-ver15

### Does this PR introduce _any_ user-facing change?

Yes, a new function.

### How was this patch tested?

Add test.

Closes apache#30400 from ulysses-you/SPARK-33469.

Lead-authored-by: ulysses <[email protected]>
Co-authored-by: ulysses-you <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>

* [SPARK-33512][BUILD] Upgrade test libraries

### What changes were proposed in this pull request?

This PR aims to update the test libraries.
- ScalaTest: 3.2.0 -> 3.2.3
- JUnit: 4.12 -> 4.13.1
- Mockito: 3.1.0 -> 3.4.6
- JMock: 2.8.4 -> 2.12.0
- maven-surefire-plugin: 3.0.0-M3 -> 3.0.0-M5
- scala-maven-plugin: 4.3.0 -> 4.4.0

### Why are the changes needed?

This will make the test frameworks up-to-date for Apache Spark 3.1.0.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes apache#30456 from dongjoon-hyun/SPARK-33512.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>

* [MINOR][INFRA] Suppress warning in check-license

### What changes were proposed in this pull request?
This PR aims to suppress the warning `File exists` in check-license

### Why are the changes needed?

**BEFORE**
```
% dev/check-license
Attempting to fetch rat
RAT checks passed.

% dev/check-license
mkdir: target: File exists
RAT checks passed.
```

**AFTER**
```
% dev/check-license
Attempting to fetch rat
RAT checks passed.

% dev/check-license
RAT checks passed.
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manually do dev/check-license twice.

Closes apache#30460 from williamhyun/checklicense.

Authored-by: William Hyun <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>

* [SPARK-33427][SQL][FOLLOWUP] Put key and value into IdentityHashMap sequantially

### What changes were proposed in this pull request?

This follow-up fixes an issue when inserting key/value pairs into `IdentityHashMap` in `SubExprEvaluationRuntime`.

### Why are the changes needed?

The last commits to apache#30341 follows review comment to use `IdentityHashMap`. Because we leverage `IdentityHashMap` to compare keys in reference, we should not convert expression pairs to Scala map before inserting. Scala map compares keys by equality so we will loss keys with different references.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Run benchmark to verify.

Closes apache#30459 from viirya/SPARK-33427-map.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>

* [SPARK-33143][PYTHON] Add configurable timeout to python server and client

### What changes were proposed in this pull request?
Spark creates local server to serialize several type of data for python. The python code tries to connect to the server, immediately after it's created but there are several system calls in between (this may change in each Spark version):
* getaddrinfo
* socket
* settimeout
* connect

Under some circumstances in heavy user environments these calls can be super slow (more than 15 seconds). These issues must be analyzed one-by-one but since these are system calls the underlying OS and/or DNS servers must be debugged and fixed. This is not trivial task and at the same time data processing must work somehow. In this PR I'm only intended to add a configuration possibility to increase the mentioned timeouts in order to be able to provide temporary workaround. The rootcause analysis is ongoing but I think this can vary in each case.

Because the server part doesn't contain huge amount of log entries to with one can measure time, I've added some.

### Why are the changes needed?
Provide workaround when localhost python server connection timeout appears.

### Does this PR introduce _any_ user-facing change?
Yes, new configuration added.

### How was this patch tested?
Existing unit tests + manual test.
```
#Compile Spark

echo "spark.io.encryption.enabled true" >> conf/spark-defaults.conf
echo "spark.python.authenticate.socketTimeout 10" >> conf/spark-defaults.conf

$ ./bin/pyspark
Python 3.8.5 (default, Jul 21 2020, 10:48:26)
[Clang 11.0.3 (clang-1103.0.32.62)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
20/11/20 10:17:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/11/20 10:17:03 WARN SparkEnv: I/O encryption enabled without RPC encryption: keys will be visible on the wire.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.1.0-SNAPSHOT
      /_/

Using Python version 3.8.5 (default, Jul 21 2020 10:48:26)
Spark context Web UI available at http://192.168.0.189:4040
Spark context available as 'sc' (master = local[*], app id = local-1605863824276).
SparkSession available as 'spark'.
>>> sc.setLogLevel("TRACE")
>>> sc.parallelize([0, 2, 3, 4, 6], 5).glom().collect()
20/11/20 10:17:09 TRACE PythonParallelizeServer: Creating listening socket
20/11/20 10:17:09 TRACE PythonParallelizeServer: Setting timeout to 10 sec
20/11/20 10:17:09 TRACE PythonParallelizeServer: Waiting for connection on port 59726
20/11/20 10:17:09 TRACE PythonParallelizeServer: Connection accepted from address /127.0.0.1:59727
20/11/20 10:17:09 TRACE PythonParallelizeServer: Client authenticated
20/11/20 10:17:09 TRACE PythonParallelizeServer: Closing server
...
20/11/20 10:17:10 TRACE SocketFuncServer: Creating listening socket
20/11/20 10:17:10 TRACE SocketFuncServer: Setting timeout to 10 sec
20/11/20 10:17:10 TRACE SocketFuncServer: Waiting for connection on port 59735
20/11/20 10:17:10 TRACE SocketFuncServer: Connection accepted from address /127.0.0.1:59736
20/11/20 10:17:10 TRACE SocketFuncServer: Client authenticated
20/11/20 10:17:10 TRACE SocketFuncServer: Closing server
[[0], [2], [3], [4], [6]]
>>>
```

Closes apache#30389 from gaborgsomogyi/SPARK-33143.

Lead-authored-by: Gabor Somogyi <[email protected]>
Co-authored-by: Hyukjin Kwon <[email protected]>
Co-authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>

* [SPARK-33510][BUILD] Update SBT to 1.4.4

### What changes were proposed in this pull request?
This PR aims to update SBT from 1.4.2 to 1.4.4.

### Why are the changes needed?

This will bring the latest bug fixes.
- https://github.com/sbt/sbt/releases/tag/v1.4.3
- https://github.com/sbt/sbt/releases/tag/v1.4.4

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass the CIs.

Closes apache#30453 from williamhyun/sbt143.

Authored-by: William Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>

* Revert "[SPARK-32481][CORE][SQL] Support truncate table to move data to trash"

### What changes were proposed in this pull request?

This reverts commit 065f173, which is not part of any released version. That is, this is an unreleased feature

### Why are the changes needed?

I like the concept of Trash, but I think this PR might just resolve a very specific issue by introducing a mechanism without a proper design doc. This could make the usage more complex.

I think we need to consider the big picture. Trash directory is an important concept. If we decide to introduce it, we should consider all the code paths of Spark SQL that could delete the data, instead of Truncate only. We also need to consider what is the current behavior if the underlying file system does not provide the API `Trash.moveToAppropriateTrash`. Is the exception good? How about the performance when users are using the object store instead of HDFS? Will it impact the GDPR compliance?

In sum, I think we should not merge the PR apache#29552 without the design doc and implementation plan. That is why I reverted it before the code freeze of Spark 3.1

### Does this PR introduce _any_ user-facing change?
Reverted the original commit

### How was this patch tested?
The existing tests.

Closes apache#30463 from gatorsmile/revertSpark-32481.

Authored-by: Xiao Li <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>

* [SPARK-33515][SQL] Improve exception messages while handling UnresolvedTable

### What changes were proposed in this pull request?

This PR proposes to improve the exception messages while `UnresolvedTable` is handled based on this suggestion: apache#30321 (comment).

Currently, when an identifier is resolved to a view when a table is expected, the following exception message is displayed (e.g., for `COMMENT ON TABLE`):
```
v is a temp view not table.
```
After this PR, the message will be:
```
v is a temp view. 'COMMENT ON TABLE' expects a table.
```

Also, if an identifier is not resolved, the following exception message is currently used:
```
Table not found: t
```
After this PR, the message will be:
```
Table not found for 'COMMENT ON TABLE': t
```

### Why are the changes needed?

To improve the exception message.

### Does this PR introduce _any_ user-facing change?

Yes, the exception message will be changed as described above.

### How was this patch tested?

Updated existing tests.

Closes apache#30461 from imback82/unresolved_table_message.

Authored-by: Terry Kim <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-33511][SQL] Respect case sensitivity while resolving V2 partition specs

### What changes were proposed in this pull request?
1. Pre-process partition specs in `ResolvePartitionSpec`, and convert partition names according to the partition schema and the SQL config `spark.sql.caseSensitive`. In the PR, I propose to invoke `normalizePartitionSpec` for that. The function is used in DSv1 commands, so, the behavior will be similar to DSv1.
2. Move `normalizePartitionSpec()` from `sql/core/.../datasources/PartitioningUtils` to `sql/catalyst/.../util/PartitioningUtils` to use it in Catalyst's rule `ResolvePartitionSpec`

### Why are the changes needed?
DSv1 commands like `ALTER TABLE .. ADD PARTITION` and `ALTER TABLE .. DROP PARTITION` respect the SQL config `spark.sql.caseSensitive` while resolving partition specs. For example:
```sql
spark-sql> CREATE TABLE tbl1 (id bigint, data string) USING parquet PARTITIONED BY (id);
spark-sql> ALTER TABLE tbl1 ADD PARTITION (ID=1);
spark-sql> SHOW PARTITIONS tbl1;
id=1
```
The same command fails on V2 Table catalog with error:
```
AnalysisException: Partition key ID not exists
```

### Does this PR introduce _any_ user-facing change?
Yes. After the changes, partition spec resolution works as for DSv1 (without the exception showed above).

### How was this patch tested?
By running `AlterTablePartitionV2SQLSuite`.

Closes apache#30454 from MaxGekk/partition-spec-case-sensitivity.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-33278][SQL][FOLLOWUP] Improve OptimizeWindowFunctions to avoid transfer first to nth_value

### What changes were proposed in this pull request?
apache#30178 provided `OptimizeWindowFunctions` used to transfer `first` to `nth_value`.
If the window frame is `UNBOUNDED PRECEDING AND CURRENT ROW` or `UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING`, `nth_value` has better performance than `first`.
But the `OptimizeWindowFunctions` need to exclude other window frame.

### Why are the changes needed?
 Improve `OptimizeWindowFunctions` to avoid transfer `first` to `nth_value` if the specified window frame isn't `UNBOUNDED PRECEDING AND CURRENT ROW` or `UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING`.

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
Jenkins test.

Closes apache#30419 from beliefer/SPARK-33278_followup.

Lead-authored-by: gengjiaan <[email protected]>
Co-authored-by: beliefer <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

Co-authored-by: Chao Sun <[email protected]>
Co-authored-by: Gabor Somogyi <[email protected]>
Co-authored-by: HyukjinKwon <[email protected]>
Co-authored-by: Venkata krishnan Sowrirajan <[email protected]>
Co-authored-by: Min Shen <[email protected]>
Co-authored-by: yangjie01 <[email protected]>
Co-authored-by: Max Gekk <[email protected]>
Co-authored-by: liucht <[email protected]>
Co-authored-by: ulysses <[email protected]>
Co-authored-by: angerszhu <[email protected]>
Co-authored-by: Ruifeng Zheng <[email protected]>
Co-authored-by: Huaxin Gao <[email protected]>
Co-authored-by: anchovYu <[email protected]>
Co-authored-by: anchovYu <[email protected]>
Co-authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Co-authored-by: Dongjoon Hyun <[email protected]>
Co-authored-by: Gustavo Martin Morcuende <[email protected]>
Co-authored-by: CC Highman <[email protected]>
Co-authored-by: William Hyun <[email protected]>
Co-authored-by: Liang-Chi Hsieh <[email protected]>
Co-authored-by: Hyukjin Kwon <[email protected]>
Co-authored-by: Xiao Li <[email protected]>
Co-authored-by: Terry Kim <[email protected]>
Co-authored-by: gengjiaan <[email protected]>
Co-authored-by: beliefer <[email protected]>
val allLeafStatuses = {
val (dirs, topLevelFiles) = filteredStatuses.partition(_.isDirectory)
val (dirs, topLevelFiles) = statuses.partition(_.isDirectory)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sunchao the dirs here may contain hidden directories. We still need to filter them before listing leaf files.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gengliangwang you're right. Thanks for catching this! and sorry for introducing this regression.

HyukjinKwon pushed a commit that referenced this pull request Jan 14, 2021
…ition inference

### What changes were proposed in this pull request?

Fix a regression from #29959.

In Spark, the following file paths are considered as hidden paths and they are ignored on file reads:
1. starts with "_" and doesn't contain "="
2. starts with "."

However, after the refactoring PR #29959, the hidden paths are not filtered out on partition inference: https://github.com/apache/spark/pull/29959/files#r556432426

This PR is to fix the bug. To archive the goal, the method `InMemoryFileIndex.shouldFilterOut` is refactored as `HadoopFSUtils.shouldFilterOutPathName`

### Why are the changes needed?

Bugfix

### Does this PR introduce _any_ user-facing change?

Yes, it fixes a bug for reading file paths with partitions.

### How was this patch tested?

Unit test

Closes #31169 from gengliangwang/fileListingBug.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
HyukjinKwon pushed a commit that referenced this pull request Jan 14, 2021
…ition inference

### What changes were proposed in this pull request?

Fix a regression from #29959.

In Spark, the following file paths are considered as hidden paths and they are ignored on file reads:
1. starts with "_" and doesn't contain "="
2. starts with "."

However, after the refactoring PR #29959, the hidden paths are not filtered out on partition inference: https://github.com/apache/spark/pull/29959/files#r556432426

This PR is to fix the bug. To archive the goal, the method `InMemoryFileIndex.shouldFilterOut` is refactored as `HadoopFSUtils.shouldFilterOutPathName`

### Why are the changes needed?

Bugfix

### Does this PR introduce _any_ user-facing change?

Yes, it fixes a bug for reading file paths with partitions.

### How was this patch tested?

Unit test

Closes #31169 from gengliangwang/fileListingBug.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit 467d758)
Signed-off-by: HyukjinKwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants