[SPARK-15689][SQL] data source v2 read path #19136

cloud-fan · 2017-09-05T15:26:03Z

What changes were proposed in this pull request?

This PR adds the infrastructure for data source v2, and implement features which Spark already have in data source v1, i.e. column pruning, filter push down, catalyst expression filter push down, InternalRow scan, schema inference, data size report. The write path is excluded to avoid making this PR growing too big, and will be added in follow-up PR.

How was this patch tested?

new tests

cloud-fan · 2017-09-05T15:26:51Z

cc @rxin @j-baker @rdblue

cloud-fan · 2017-09-05T15:29:09Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataSourceV2Reader.java

+ *   2. propagate information upward to Spark, e.g., report statistics, report ordering, etc.
+ * Spark first applies all operator push down optimizations which this data source supports. Then
+ * Spark collects information this data source provides for further optimizations. Finally Spark
+ * issues the scan request and does the actual data reading.


TODO: this is not true now, as we push down operators at the planning phase. We need to do some refactor and move it to the optimizing phase.

This would be really nice imho.

SparkQA · 2017-09-05T17:09:37Z

Test build #81412 has finished for PR 19136 at commit 543a40b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class DataSourceV2Options
public abstract class DataSourceV2Reader
class RowToUnsafeRowReadTask implements ReadTask<UnsafeRow>
class RowToUnsafeDataReader implements DataReader<UnsafeRow>
class DataSourceRDDPartition[T : ClassTag](val index: Int, val readTask: ReadTask[T])
class DataSourceRDD[T: ClassTag](
case class DataSourceV2Relation(
case class DataSourceV2ScanExec(

SparkQA · 2017-09-06T07:04:46Z

Test build #81441 has finished for PR 19136 at commit a824d44.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class DataSourceV2Options
public abstract class DataSourceV2Reader
class RowToUnsafeRowReadTask implements ReadTask<UnsafeRow>
class RowToUnsafeDataReader implements DataReader<UnsafeRow>
class DataSourceRDDPartition(val index: Int, val readTask: ReadTask[UnsafeRow])
class DataSourceRDD(
case class DataSourceV2Relation(
case class DataSourceV2ScanExec(

cloud-fan · 2017-09-06T16:54:04Z

retest this please

SparkQA · 2017-09-06T19:22:13Z

Test build #81466 has finished for PR 19136 at commit a824d44.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class DataSourceV2Options
public abstract class DataSourceV2Reader
class RowToUnsafeRowReadTask implements ReadTask<UnsafeRow>
class RowToUnsafeDataReader implements DataReader<UnsafeRow>
class DataSourceRDDPartition(val index: Int, val readTask: ReadTask[UnsafeRow])
class DataSourceRDD(
case class DataSourceV2Relation(
case class DataSourceV2ScanExec(

rdblue · 2017-09-06T23:45:19Z

Thanks for pinging me. I left comments on the older PR, since other discussion was already there. If you'd prefer comments here, just let me know.

cloud-fan · 2017-09-07T02:40:35Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/SchemaRequiredDataSourceV2.java

+/**
+ * A variant of `DataSourceV2` which requires users to provide a schema when reading data. A data
+ * source can inherit both `DataSourceV2` and `SchemaRequiredDataSourceV2` if it supports both schema
+ * inference and user-specified schemas.


cc @rdblue for the new API of schema reference.

SparkQA · 2017-09-07T04:59:11Z

Test build #81490 has finished for PR 19136 at commit 89cbfb7.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class DataSourceV2Options
public abstract class DataSourceV2Reader
class RowToUnsafeRowReadTask implements ReadTask<UnsafeRow>
class RowToUnsafeDataReader implements DataReader<UnsafeRow>
class DataSourceRDDPartition(val index: Int, val readTask: ReadTask[UnsafeRow])
class DataSourceRDD(
case class DataSourceV2Relation(
case class DataSourceV2ScanExec(

sureshthalamati · 2017-09-07T06:16:50Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataSourceV2Reader.java

+   */
+  @Experimental
+  @InterfaceStability.Unstable
+  public List<ReadTask<UnsafeRow>> createUnsafeRowReadTasks() {


I really like the new API's flexibility to implement the different types of support. Considering UnsafeRow is unstable , Would it be possible to move createUnsafeRowReadTasks to a different interface ? That might make data source implement two types of data sources one with Row , and another one with UnsafeRow and make it easily configurable based on the spark version.

sureshthalamati · 2017-09-07T06:17:48Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceV2Options.java

+   * Adds one more entry to the options.
+   * This should only be called by Spark, not data source implementations.
+   */
+  public void addOption(String key, String value) {


The check added for addOption protects modifying the options passed to the datasource, but data source can still add new options by accident. I think it might be safer to pass DataSourceV2Options that are Unmodifiable by the data source.

good point, I'll make it immutable.

cloud-fan · 2017-09-07T07:53:36Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/scan/UnsafeRowScan.java

+ */
+@Experimental
+@InterfaceStability.Unstable
+public interface UnsafeRowScan extends DataSourceV2Reader {


cc @j-baker for the new unsafe row scan API. Programmatically unsafe row scan should be in the base class, and normal row scan should be in the child class. However, conceptually for a developer, normal row scan is a basic interface and should be in the base class. Unsafe row scan is kind of an add-on and should be in the child class.

ash211 · 2017-09-07T07:54:11Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ReadTask.java

+   * task will always run on these locations. Implementations should make sure that it can
+   * be run on any location.
+   */
+  default String[] preferredLocations() {


what format are these strings expected to be in? If Spark will be placing this ReadTask onto an executor that is a preferred location, the format will need to be a documented part of the API

are there levels of preference, or only the binary? I'm thinking node vs rack vs datacenter for on-prem clusters, or instance vs AZ vs region for cloud clusers

These have previously only been ip/hostnames. To match the RDD definition I think we would have to continue with that.

This API matches the RDD.preferredLocations directly, I'll add more documents here.

can we have a class Host which represents this? Just makes the API more clear.

hmmm, do you mean create a Host class which only has a string field?

ash211 · 2017-09-07T08:00:22Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/upward/StatisticsSupport.java

+ * A mix in interface for `DataSourceV2Reader`. Users can implement this interface to report
+ * statistics to Spark.
+ */
+public interface StatisticsSupport {


some datasources have per-column statistics, like how many bytes a column has or its min/max (e.g. things required for CBO).

should that be a separate interface from this one?

I'd like to put column stats in a separated interface, because we already separate basic stats and column stats in ANALYZE TABLE.

SparkQA · 2017-09-07T09:52:10Z

Test build #81506 has finished for PR 19136 at commit ee5faf1.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class DataSourceV2Options
class DataSourceRDDPartition(val index: Int, val readTask: ReadTask[UnsafeRow])
class DataSourceRDD(
case class DataSourceV2Relation(
case class DataSourceV2ScanExec(
class RowToUnsafeRowReadTask(rowReadTask: ReadTask[Row], schema: StructType)
class RowToUnsafeDataReader(rowReader: DataReader[Row], encoder: ExpressionEncoder[Row])

SparkQA · 2017-09-07T10:17:14Z

Test build #81510 has finished for PR 19136 at commit 182b89d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class DataSourceV2Options
class DataSourceRDDPartition(val index: Int, val readTask: ReadTask[UnsafeRow])
class DataSourceRDD(
case class DataSourceV2Relation(
case class DataSourceV2ScanExec(
class RowToUnsafeRowReadTask(rowReadTask: ReadTask[Row], schema: StructType)
class RowToUnsafeDataReader(rowReader: DataReader[Row], encoder: ExpressionEncoder[Row])

rdblue · 2017-09-07T16:20:18Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/SchemaRequiredDataSourceV2.java

+   *
+   * @param schema the full schema of this data source reader. Full schema usually maps to the
+   *               physical schema of the underlying storage of this data source reader, e.g.
+   *               parquet files, JDBC tables, etc, while this reader may not read data with full


Maybe update the doc here, since JDBC sources and Parquet files probably shouldn't implement this. CSV and JSON are the examples that come to mind for sources that require a schema.

rdblue · 2017-09-07T16:36:49Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala

+
+object DataSourceV2Relation {
+  def apply(reader: DataSourceV2Reader): DataSourceV2Relation = {
+    new DataSourceV2Relation(reader.readSchema().toAttributes, reader)


Is this the right schema? The docs for readSchema say it is the result of pushdown and projection, which doesn't seem appropriate for a Relation. Does relation represent a table that can be filtered and projected, or does it represent a single read? At least in the Hive read path, it's a table.

On the other hand, it is much better to compute stats on a relation that's already filtered.

I leave it as a TODO as it needs some refactoring on the optimizer. For now DataSourceV2Relation represents a data source without any optimization: we do these optimizations during planning. This is also a problem for data source v1, and that's why we implement partition pruning as an optimizer rule instead of data source internal, because we need to update the stats.

Are you saying that partition pruning isn't delegated to the data source in this interface?

I was just looking into how the data source should provide partition data, or at least fields that are the same for all rows in a ReadTask. It would be nice to have a way to pass those up instead of materializing them in each UnsafeRow.

In data source V2, we will delegate partition pruning to the data source, although we need to do some refactoring to make it happen.

I was just looking into how the data source should provide partition data, or at least fields that are the same for all rows in a ReadTask. It would be nice to have a way to pass those up instead of materializing them in each UnsafeRow.

This can be achieved by the columnar reader. Think about a data source having a data column i and a partition column j, the returned columnar batch has 2 column vectors for i and j. Column vector i is a normal one that contains all the values of column i within this batch, column vector j is a constant vector that only contains a single value.

I think we should add a way to provide partition values outside of the columnar reader. It wouldn't be too difficult to add a method on ReadTask that returns them, then create a joined row in the scan exec. Otherwise, this requires a lot of wasted memory for a scan.

I think users can write a special ReadTask to do it, but we can't save memory by doing this. When an operator(the scan operator) transfers data to another operator, the data must be UnsafeRows. So even users return a joined row in ReadTask, Spark need to convert it to UnsafeRow.

viirya · 2017-09-10T15:05:06Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala

+        case r: CatalystFilterPushDownSupport =>
+          r.pushCatalystFilters(filters.toArray)
+
+        case r: FilterPushDownSupport =>


Looks like CatalystFilterPushDownSupport and FilterPushDownSupport are exclusive? But we can't prevent users to implement them both?

yea we can't prevent users to implement them both, and we will pick CatalystFilterPushDownSupport over FilterPushDownSupport. Let me document it.

can FilterPushDownSupport be an interface which extends CatalystFilterPushDownSupport and provides a default impl of pruning the catalyst flter? Like, this code can just go there as a method:

interface FilterPushDownSupport extends CatalystFilterPushDownSupport { List<Filter> pushFilters(List<Filter> filters); default List<Expression> pushCatalystFilters(List<Expression> filters) { Map<Filter, Expression> translatedMap = new HashMap<>(); List<Filter> nonconvertiblePredicates = new ArrayList<>(); for (Expression catalystFilter : filters) { Optional<Filter> translatedFilter = DataSourceStrategy.translateFilter(catalystFilter); if (translatedFilter.isPresent()) { translatedMap.put(translatedFilter.get(), catalystFilter); } else { nonconvertiblePredicates.add(catalystFilter); } } List<Filter> unhandledFilters = pushFilters(new ArrayList<>(translatedMap.values())); return Stream.concat( nonconvertiblePredicates.stream(), unhandledFilters().stream().map(translatedMap::get)) .collect(toList()); } }

and we can trivially ignore the interface confusion (it's truly confusing if you can implement two interfaces)

like, we might as well not document it if the code can document it

By doing so, do we still need to match both CatalystFilterPushDownSupport and FilterPushDownSupport here?

After some attempts, I went back with 2 individual interfaces. The reason is that, a) CatalystFilterPushDownSupport is an unstable interface, and it looks weird to let a stable interface extend an unstable one. b) the logic that converts expressions to public filters belongs to Spark internal, and we may change it in the future, so we should not put these codes in a public interface. We may have a risk of breaking compatibility for this interface.

viirya · 2017-09-10T15:09:25Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala

+
+      // Match original case of attributes.
+      // TODO: nested fields pruning
+      val requiredColumns = (projectSet ++ filterSet).toSeq.map(attrMap)


Do we need to request columns that are only referenced by pushed filters?

It seems reasonable to only request the ones that will be used, or that have residuals after pushing filters.

gatorsmile · 2017-09-12T05:09:55Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataReader.java

+import java.io.Closeable;
+
+/**
+ * A data reader returned by a read task and is responsible for outputting data for an RDD


Nit: an -> a

gatorsmile · 2017-09-12T05:14:12Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataSourceV2Reader.java

+ *   1. push operators downward to the data source, e.g., column pruning, filter push down, etc.
+ *   2. propagate information upward to Spark, e.g., report statistics, report ordering, etc.
+ *   3. special scans like columnar scan, unsafe row scan, etc. Note that a data source reader can
+ *      at most implement one special scan.


at most implement one -> implement at most one

gatorsmile · 2017-09-12T05:15:08Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataSourceV2Reader.java

+ *   3. special scans like columnar scan, unsafe row scan, etc. Note that a data source reader can
+ *      at most implement one special scan.
+ *
+ * Spark first applies all operator push down optimizations which this data source supports. Then


push down -> push-down

gatorsmile · 2017-09-12T05:21:52Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataSourceV2Reader.java

+  StructType readSchema();
+
+  /**
+   * Returns a list of read tasks, each task is responsible for outputting data for one RDD


, each -> . Each

gatorsmile · 2017-09-12T05:22:50Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataSourceV2Reader.java

+
+  /**
+   * Returns a list of read tasks, each task is responsible for outputting data for one RDD
+   * partition, which means the number of tasks returned here is same as the number of RDD


, which means -> That means

gatorsmile · 2017-09-12T05:33:09Z

...main/java/org/apache/spark/sql/sources/v2/reader/downward/CatalystFilterPushDownSupport.java

+
+/**
+ * A mix-in interface for `DataSourceV2Reader`. Users can implement this interface to push down
+ * arbitrary expressions as predicates to the data source.


Note that, this is an experimental and unstable interface

gatorsmile · 2017-09-12T05:34:31Z

...core/src/main/java/org/apache/spark/sql/sources/v2/reader/downward/ColumnPruningSupport.java

+
+/**
+ * A mix-in interface for `DataSourceV2Reader`. Users can implement this interface to only read
+ * required columns/nested fields during scan.


-> the required

gatorsmile · 2017-09-12T05:35:55Z

...core/src/main/java/org/apache/spark/sql/sources/v2/reader/downward/ColumnPruningSupport.java

+  /**
+   * Apply column pruning w.r.t. the given requiredSchema.
+   *
+   * Implementation should try its best to prune unnecessary columns/nested fields, but it's also


the unnecessary

gatorsmile · 2017-09-12T05:38:22Z

...ore/src/main/java/org/apache/spark/sql/sources/v2/reader/downward/FilterPushDownSupport.java

+public interface FilterPushDownSupport {
+
+  /**
+   * Push down filters, returns unsupported filters.


Pushes down filters, and returns unsupported filters.

gatorsmile · 2017-09-12T06:07:42Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/upward/StatisticsSupport.java

+ * statistics to Spark.
+ */
+public interface StatisticsSupport {
+  Statistics getStatistics();


Will the returned stats be adjusted by the data sources based on the operator push-down?

It should, but we need some refactor on optimizer, see #19136 (comment)

j-baker · 2017-09-12T08:28:44Z

...main/java/org/apache/spark/sql/sources/v2/reader/downward/CatalystFilterPushDownSupport.java

+  /**
+   * Push down filters, returns unsupported filters.
+   */
+  Expression[] pushCatalystFilters(Expression[] filters);


any chance this could push java lists? They're just more idiomatic in a java interface

java list is not friendly to scala implementations :)

j-baker · 2017-09-12T08:50:13Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/upward/Statistics.java

+ * An interface to represent statistics for a data source.
+ */
+public interface Statistics {
+  long sizeInBytes();


OptionalLong for sizeInBytes? It's not obvious that sizeInBytes is well defined for e.g. JDBC datasources, but row count can generally be easily estimated from the query plan.

like, I get that it's non-optional at the moment, but it's odd that we have a method that the normal implementor will have to replace with

public long sizeInBytes() { return Long.MAX_VALUE; }

and now is a good time to fix it :)

j-baker · 2017-09-12T09:00:38Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceV2Options.java

+
+  /**
+   * Returns the option value to which the specified key is mapped, case-insensitively,
+   * or {@code null} if there is no mapping for the key.


can we return Optional<String> here? JDK maintainers wish they could return optional on Map

j-baker · 2017-09-12T09:00:56Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceV2Options.java

+   * Returns the option value to which the specified key is mapped, case-insensitively,
+   * or {@code defaultValue} if there is no mapping for the key.
+   */
+  public String getOrDefault(String key, String defaultValue) {


if the above returns Optional, you probably don't need this method.

cloud-fan · 2017-09-13T15:33:48Z

@yueawang these new push-downs are in my prototype. This PR is the first version of data source v2, so I'd like to cut down the patch size and only implement features that we already have in data source v1.

SparkQA · 2017-09-13T15:38:32Z

Test build #81717 has finished for PR 19136 at commit 4ff1b18.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class DataSourceV2Options
class DataSourceRDDPartition(val index: Int, val readTask: ReadTask[UnsafeRow])
class DataSourceRDD(
case class DataSourceV2Relation(
case class DataSourceV2ScanExec(
class RowToUnsafeRowReadTask(rowReadTask: ReadTask[Row], schema: StructType)
class RowToUnsafeDataReader(rowReader: DataReader[Row], encoder: ExpressionEncoder[Row])

SparkQA · 2017-09-13T17:49:02Z

Test build #81722 has finished for PR 19136 at commit 1e86d5c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-13T18:04:57Z

Test build #81723 has finished for PR 19136 at commit abcc606.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2017-09-13T18:57:53Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/Statistics.java

+import java.util.OptionalLong;
+
+/**
+ * An interface to represent statistics for a data source.


link back to SupportsReportStatistics

SparkQA · 2017-09-14T10:05:34Z

Test build #81773 has finished for PR 19136 at commit a1301f5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-09-14T15:19:59Z

any more comments? is it ready to go?

rxin · 2017-09-14T16:32:31Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceV2.java

+ * constructor.
+ *
+ * Note that this is an empty interface, data source implementations should mix-in at least one of
+ * the plug-in interfaces like `ReadSupport`. Otherwise it's just a dummy data source which is


use an actual link ...

rxin · 2017-09-14T16:34:15Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/ReadSupport.java

+import org.apache.spark.sql.sources.v2.reader.DataSourceV2Reader;
+
+/**
+ * A mix-in interface for `DataSourceV2`. Users can implement this interface to provide data reading


Users -> data source implementers

Actually a better one is

"Data sources can implement"

rxin · 2017-09-14T16:39:05Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/ReadSupportWithSchema.java

+ * source can implement both `ReadSupport` and `ReadSupportWithSchema` if it supports both schema
+ * inference and user-specified schema.
+ */
+public interface ReadSupportWithSchema {


I still find ReadSupport vs ReadSupportWithSchema pretty confusing. But let's address that separately.

rxin · 2017-09-14T16:39:42Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/Statistics.java

+
+/**
+ * An interface to represent statistics for a data source, which is returned by
+ * `SupportsReportStatistics`.


also use @link

rxin · 2017-09-14T16:40:57Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceRDD.scala

+
+class DataSourceRDD(
+    sc: SparkContext,
+    @transient private val generators: java.util.List[ReadTask[UnsafeRow]])


why is this called a generators?

rxin · 2017-09-14T16:42:38Z

LGTM.

Still some feedback that can be addressed later. We should also document all the APIs as Evolving.

SparkQA · 2017-09-15T06:17:49Z

Test build #81809 has finished for PR 19136 at commit d2c86f4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-09-15T14:17:39Z

thank you all for the review, merging to master!

## What changes were proposed in this pull request? As we discussed in apache#19136 (comment) , we should push down operators to data source before planning, so that data source can report statistics more accurate. This PR also includes some cleanup for the read path. ## How was this patch tested? existing tests. Author: Wenchen Fan <[email protected]> Closes apache#19424 from cloud-fan/follow.

This PR adds the infrastructure for data source v2, and implement features which Spark already have in data source v1, i.e. column pruning, filter push down, catalyst expression filter push down, InternalRow scan, schema inference, data size report. The write path is excluded to avoid making this PR growing too big, and will be added in follow-up PR. new tests Author: Wenchen Fan <[email protected]> Closes apache#19136 from cloud-fan/data-source-v2.

As we discussed in apache#19136 (comment) , we should push down operators to data source before planning, so that data source can report statistics more accurate. This PR also includes some cleanup for the read path. existing tests. Author: Wenchen Fan <[email protected]> Closes apache#19424 from cloud-fan/follow.

cloud-fan commented Sep 5, 2017

View reviewed changes

dongjoon-hyun mentioned this pull request Sep 5, 2017

[SPARK-21783][SQL] Turn on ORC filter push-down by default #18991

Closed

cloud-fan force-pushed the data-source-v2 branch from 543a40b to a824d44 Compare September 6, 2017 05:53

cloud-fan force-pushed the data-source-v2 branch from a824d44 to 89cbfb7 Compare September 7, 2017 02:25

cloud-fan commented Sep 7, 2017

View reviewed changes

sureshthalamati reviewed Sep 7, 2017

View reviewed changes

cloud-fan force-pushed the data-source-v2 branch 2 times, most recently from ee5faf1 to 182b89d Compare September 7, 2017 07:49

cloud-fan commented Sep 7, 2017

View reviewed changes

ash211 reviewed Sep 7, 2017

View reviewed changes

rdblue reviewed Sep 7, 2017

View reviewed changes

viirya reviewed Sep 10, 2017

View reviewed changes

cloud-fan changed the title ~~[DO NOT MERGE][SPARK-15689][SQL] data source v2~~ [SPARK-15689][SQL] data source v2 Sep 12, 2017

gatorsmile reviewed Sep 12, 2017

View reviewed changes

j-baker reviewed Sep 12, 2017

View reviewed changes

cloud-fan changed the title ~~[SPARK-15689][SQL] data source v2~~ [SPARK-15689][SQL] data source v2 read path Sep 13, 2017

naming updates

abcc606

cloud-fan force-pushed the data-source-v2 branch from 1e86d5c to abcc606 Compare September 13, 2017 15:32

rxin reviewed Sep 13, 2017

View reviewed changes

more comments

a1301f5

rxin reviewed Sep 14, 2017

View reviewed changes

document improvement

d2c86f4

asfgit closed this in c7307ac Sep 15, 2017

rdblue mentioned this pull request Sep 19, 2017

[SPARK-22026][SQL] data source v2 write path #19269

Closed

cloud-fan mentioned this pull request Oct 4, 2017

[SPARK-22197][SQL] push down operators to data source before planning #19424

Closed

gengliangwang mentioned this pull request Sep 7, 2020

[SPARK-32708] Query optimization fails to reuse exchange with DataSourceV2 #29564

Closed

[SPARK-15689][SQL] data source v2 read path #19136

[SPARK-15689][SQL] data source v2 read path #19136

Conversation

cloud-fan commented Sep 5, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Sep 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 5, 2017

SparkQA commented Sep 6, 2017

cloud-fan commented Sep 6, 2017

SparkQA commented Sep 6, 2017

rdblue commented Sep 6, 2017

Choose a reason for hiding this comment

SparkQA commented Sep 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 7, 2017

SparkQA commented Sep 7, 2017

Choose a reason for hiding this comment

rdblue Sep 7, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya Sep 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Sep 13, 2017

SparkQA commented Sep 13, 2017

SparkQA commented Sep 13, 2017

SparkQA commented Sep 13, 2017

Choose a reason for hiding this comment

SparkQA commented Sep 14, 2017

cloud-fan commented Sep 14, 2017

Choose a reason for hiding this comment

rxin Sep 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rxin commented Sep 14, 2017

SparkQA commented Sep 15, 2017

cloud-fan commented Sep 5, 2017 •

edited

Loading

rdblue Sep 7, 2017 •

edited

Loading

viirya Sep 10, 2017 •

edited

Loading

rxin Sep 14, 2017 •

edited

Loading