Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23268][SQL]Reorganize packages in data source V2 #20435

Closed
wants to merge 8 commits into from

Conversation

gengliangwang
Copy link
Member

@gengliangwang gengliangwang commented Jan 30, 2018

What changes were proposed in this pull request?

  1. create a new package for partitioning/distribution related classes.
    As Spark will add new concrete implementations of Distribution in new releases, it is good to
    have a new package for partitioning/distribution related classes.

  2. move streaming related class to package org.apache.spark.sql.sources.v2.reader/writer.streaming, instead of org.apache.spark.sql.sources.v2.streaming.reader/writer.
    So that the there won't be package reader/writer inside package streaming, which is quite confusing.
    Before change:

v2
├── reader
├── streaming
│   ├── reader
│   └── writer
└── writer

After change:

v2
├── reader
│   └── streaming
└── writer
    └── streaming

How was this patch tested?

Unit test.

@SparkQA
Copy link

SparkQA commented Jan 30, 2018

Test build #86814 has finished for PR 20435 at commit 3dc5622.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 30, 2018

Test build #86817 has finished for PR 20435 at commit 719cae2.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class KafkaSourceOffset(partitionToOffsets: Map[TopicPartition, Long])

@SparkQA
Copy link

SparkQA commented Jan 30, 2018

Test build #86820 has finished for PR 20435 at commit 17f8a5e.

  • This patch fails to build.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 30, 2018

Test build #86824 has finished for PR 20435 at commit c7c0a1d.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class KafkaSourceOffset(partitionToOffsets: Map[TopicPartition, Long])

@SparkQA
Copy link

SparkQA commented Jan 30, 2018

Test build #86827 has finished for PR 20435 at commit d272952.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 30, 2018

Test build #86830 has finished for PR 20435 at commit f451194.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

cc @jose-torres

@SparkQA
Copy link

SparkQA commented Jan 30, 2018

Test build #86832 has finished for PR 20435 at commit 0b6b59e.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

retest this please

@gatorsmile
Copy link
Member

cc @zsxwing @marmbrus too

@jose-torres
Copy link
Contributor

Streaming part LGTM; I have no particular opinion or context on the distribution stuff.

@gengliangwang
Copy link
Member Author

retest this please

@gatorsmile
Copy link
Member

LGTM to adding the new package of partitioning/distribution.

@zsxwing
Copy link
Member

zsxwing commented Jan 30, 2018

cc @marmbrus

@SparkQA
Copy link

SparkQA commented Jan 30, 2018

Test build #86839 has finished for PR 20435 at commit 0b6b59e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 30, 2018

Test build #86838 has finished for PR 20435 at commit 0b6b59e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 30, 2018

Test build #86840 has finished for PR 20435 at commit 0b6b59e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -20,14 +20,16 @@ package org.apache.spark.sql.kafka010
import org.apache.kafka.common.TopicPartition

import org.apache.spark.sql.execution.streaming.{Offset, SerializedOffset}
import org.apache.spark.sql.sources.v2.streaming.reader.{Offset => OffsetV2, PartitionOffset}
import org.apache.spark.sql.sources.v2.reader.streaming
import org.apache.spark.sql.sources.v2.reader.streaming.PartitionOffset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we keep it same as before?

import org.apache.spark.sql.sources.v2.reader.streaming.{Offset => OffsetV2, PartitionOffset}

@@ -20,14 +20,16 @@ package org.apache.spark.sql.kafka010
import org.apache.kafka.common.TopicPartition

import org.apache.spark.sql.execution.streaming.{Offset, SerializedOffset}
import org.apache.spark.sql.sources.v2.streaming.reader.{Offset => OffsetV2, PartitionOffset}
import org.apache.spark.sql.sources.v2.reader.streaming
import org.apache.spark.sql.sources.v2.reader.streaming.PartitionOffset

/**
* An [[Offset]] for the [[KafkaSource]]. This one tracks all partitions of subscribed topics and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this Offset be streaming.Offset?

import org.apache.spark.sql.sources.v2.streaming.StreamWriteSupport
import org.apache.spark.sql.sources.v2.streaming.writer.StreamWriter
import org.apache.spark.sql.sources.v2.writer._
import org.apache.spark.sql.sources.v2.writer.{StreamWriteSupport, _}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import org.apache.spark.sql.sources.v2.writer._?


/**
* An [[Offset]] for the [[KafkaSource]]. This one tracks all partitions of subscribed topics and
* their offsets.
*/
private[kafka010]
case class KafkaSourceOffset(partitionToOffsets: Map[TopicPartition, Long]) extends OffsetV2 {
case class KafkaSourceOffset(partitionToOffsets: Map[TopicPartition, Long])
extends OffsetV2 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary change?

@@ -403,7 +403,7 @@ class MicroBatchExecution(
val current = committedOffsets.get(reader).map(off => reader.deserializeOffset(off.json))
reader.setOffsetRange(
toJava(current),
Optional.of(available.asInstanceOf[OffsetV2]))
Optional.of(available.asInstanceOf[streaming.Offset]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we still use OffsetV2?

@SparkQA
Copy link

SparkQA commented Jan 31, 2018

Test build #86873 has finished for PR 20435 at commit e609060.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 31, 2018

Test build #86879 has finished for PR 20435 at commit 1d90cf1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class KafkaSourceOffset(partitionToOffsets: Map[TopicPartition, Long]) extends OffsetV2

@cloud-fan
Copy link
Contributor

LGTM, also cc @rdblue

@gatorsmile
Copy link
Member

Thanks! Merged to master/2.3

asfgit pushed a commit that referenced this pull request Feb 1, 2018
## What changes were proposed in this pull request?
1. create a new package for partitioning/distribution related classes.
    As Spark will add new concrete implementations of `Distribution` in new releases, it is good to
    have a new package for partitioning/distribution related classes.

2. move streaming related class to package `org.apache.spark.sql.sources.v2.reader/writer.streaming`, instead of `org.apache.spark.sql.sources.v2.streaming.reader/writer`.
So that the there won't be package reader/writer inside package streaming, which is quite confusing.
Before change:
```
v2
├── reader
├── streaming
│   ├── reader
│   └── writer
└── writer
```

After change:
```
v2
├── reader
│   └── streaming
└── writer
    └── streaming
```
## How was this patch tested?
Unit test.

Author: Wang Gengliang <[email protected]>

Closes #20435 from gengliangwang/new_pkg.

(cherry picked from commit 56ae326)
Signed-off-by: gatorsmile <[email protected]>
@asfgit asfgit closed this in 56ae326 Feb 1, 2018
asfgit pushed a commit that referenced this pull request Feb 8, 2018
## What changes were proposed in this pull request?

This is a followup of #20435.

While reorganizing the packages for streaming data source v2, the top level stream read/write support interfaces should not be in the reader/writer package, but should be in the `sources.v2` package, to follow the `ReadSupport`, `WriteSupport`, etc.

## How was this patch tested?

N/A

Author: Wenchen Fan <[email protected]>

Closes #20509 from cloud-fan/followup.

(cherry picked from commit a75f927)
Signed-off-by: Wenchen Fan <[email protected]>
asfgit pushed a commit that referenced this pull request Feb 8, 2018
## What changes were proposed in this pull request?

This is a followup of #20435.

While reorganizing the packages for streaming data source v2, the top level stream read/write support interfaces should not be in the reader/writer package, but should be in the `sources.v2` package, to follow the `ReadSupport`, `WriteSupport`, etc.

## How was this patch tested?

N/A

Author: Wenchen Fan <[email protected]>

Closes #20509 from cloud-fan/followup.
robert3005 pushed a commit to palantir/spark that referenced this pull request Feb 12, 2018
## What changes were proposed in this pull request?

This is a followup of apache#20435.

While reorganizing the packages for streaming data source v2, the top level stream read/write support interfaces should not be in the reader/writer package, but should be in the `sources.v2` package, to follow the `ReadSupport`, `WriteSupport`, etc.

## How was this patch tested?

N/A

Author: Wenchen Fan <[email protected]>

Closes apache#20509 from cloud-fan/followup.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants