Add InputSource and InputFormat interfaces #8823

jihoonson · 2019-11-05T06:40:07Z

Description

This is the First PR for #8812 which includes the new interfaces proposed in #8812. A couple of implementations are also included such as LocalInputSource and HttpInputSource for InputSource, and CsvInputFormat and JsonInputFormat for InputFormat. Their formats are:

"inputSource": {
  "type" : "local",
  "baseDir" : "/path/to/dir",
  "filter" : "your filter"
}

"inputSource": {
  "type" : "http",
  "uris" : ["http://example.com/uri1", "http://example.com/uri2"],
  "httpAuthenticationUsername": "username",
  "httpAuthenticationPassword": "password provider"
}

"inputFormat": {
  "type": "csv",
  "columns": [ "col1", "col2", "col3" ],
  "listDelimiter": "|",
  "findColumnsFromHeader" : true,
  "skipHeaderRows" : 3
}

"inputFormat": {
  "type": "json",
  "flattenSpec": {
    // your flatten spec
  }
}

Note that both inputSource and inputFormat are in ioConfig as below:

"ioConfig": {
  "type" : "index" or "index_parallel",
  "inputSource" : {
    // your input source
  },
  "inputFormat": {
    // your input format
  }
}

These are supported only by native batch indexing tasks yet. Sampler doesn't support them yet.

The old firehose and parser parameters should still work, but you cannot mix them. Only the combinations of firehose + parser or inputSource + inputFormat are allowed.

Documents will be added after more inputSources and inputFormats are implemented in follow-up PRs.

This PR has:

been self-reviewed.
- using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths.
added integration tests.
been tested in a test Druid cluster.

This change is

vogievetsky · 2019-11-05T07:36:16Z

This is amazing.

…-source-format

lgtm-com · 2019-11-06T01:23:19Z

This pull request fixes 1 alert when merging d451582 into 3b602da - view on LGTM.com

fixed alerts:

1 for Dereferenced variable may be null

fjy · 2019-11-06T01:35:03Z

One of the most exciting PRs on Druid ingestion in awhile. Glad we got it out.

lgtm-com · 2019-11-06T20:38:35Z

This pull request fixes 1 alert when merging b7c8b87 into 5c0fc0a - view on LGTM.com

fixed alerts:

1 for Dereferenced variable may be null

lgtm-com · 2019-11-06T23:48:29Z

This pull request fixes 1 alert when merging e942a21 into 517c146 - view on LGTM.com

fixed alerts:

1 for Dereferenced variable may be null

ccaominh · 2019-11-05T20:30:05Z

core/src/main/java/org/apache/druid/data/input/FirehoseFactoryToInputSourceAdaptor.java

+    if (firehoseFactory.isSplittable()) {
+      return ((FiniteFirehoseFactory) firehoseFactory).getSplits(splitHintSpec);
+    } else {
+      throw new UnsupportedOperationException();


Is supporting unsplittable Firehoses future work?

No, only splittable firehose can create splits.

ccaominh · 2019-11-06T21:37:21Z

core/src/test/java/org/apache/druid/data/input/FirehoseFactoryToInputSourceAdaptorTest.java

+    }
+  }
+
+  private static class TestCsvParseSpec extends CSVParseSpec


Suggestion: Rename the class to something like UnimplementedInputFormatCsvParseSpec. Currently, looking at just the body of testUnimplementedInputFormat, it's not apparent where the unimplemented input format is coming from.

👍 renamed.

ccaominh · 2019-11-06T21:50:22Z

core/src/main/java/org/apache/druid/data/input/impl/TimestampSpec.java

@@ -57,10 +58,10 @@

  @JsonCreator
  public TimestampSpec(
-      @JsonProperty("column") String timestampColumn,
-      @JsonProperty("format") String format,
+      @JsonProperty("column") @Nullable String timestampColumn,


Thanks for adding!

ccaominh · 2019-11-06T21:56:39Z

indexing-hadoop/src/main/java/org/apache/druid/indexer/HadoopDruidIndexerConfig.java

    Preconditions.checkNotNull(schema.getDataSchema().getParser().getParseSpec(), "parseSpec");
-    Preconditions.checkNotNull(schema.getDataSchema().getParser().getParseSpec().getTimestampSpec(), "timestampSpec");
+    Preconditions.checkNotNull(schema.getDataSchema().getNonNullTimestampSpec(), "timestampSpec");


Checking this one for null seems redundant

ccaominh · 2019-11-06T22:00:41Z

indexing-service/src/main/java/org/apache/druid/indexing/common/TaskToolbox.java

@@ -294,7 +294,7 @@ public IndexMergerV9 getIndexMergerV9()
    return indexMergerV9;
  }

-  public File getFirehoseTemporaryDir()
+  public File getIndexingTmpDir()
  {
    return new File(taskWorkDir, "firehose");


Perhaps rename the temporary directory as well

Renamed to indexing-tmp.

ccaominh · 2019-11-08T01:00:41Z

indexing-service/src/main/java/org/apache/druid/indexing/common/task/IndexTask.java

+          ImmutableList.of(new Property<>("firehose", firehoseFactory), new Property<>("inputSource", inputSource))
+      );
+      if (firehoseFactory != null && inputFormat != null) {
+        throw new IAE("Cannot use firehose and inputFormat together. Try use inputSource instead of firehose.");


Typo: Try use inputFormat -> Try using inputSource

ccaominh · 2019-11-08T01:05:47Z

...in/java/org/apache/druid/indexing/common/task/batch/parallel/ParallelIndexIngestionSpec.java

+    );
+    if (dataSchema.getParserMap() != null && ioConfig.getInputSource() != null) {
+      if (!(ioConfig.getInputSource() instanceof FirehoseFactoryToInputSourceAdaptor)) {
+        throw new IAE("Cannot use parser and inputSource together. Try use inputFormat instead of parser.");


Typo: Try use inputFormat -> Try using inputFormat

ccaominh · 2019-11-08T01:31:13Z

indexing-service/src/test/java/org/apache/druid/indexing/common/task/IndexTaskTest.java

+        new Object[]{LockGranularity.TIME_CHUNK, false},
+        new Object[]{LockGranularity.TIME_CHUNK, true},
+        new Object[]{LockGranularity.SEGMENT, false},
+        new Object[]{LockGranularity.SEGMENT, true}


This is a relatively slow test (~15 seconds per parameterized run), so all the permutations may be overkill. Perhaps remove (SEGMENT, false), which will still give coverage of both lock granularities and both with/without the input format API.

ccaominh · 2019-11-08T01:38:08Z

...ava/org/apache/druid/indexing/common/task/batch/parallel/MultiPhaseParallelIndexingTest.java

-        new Object[]{LockGranularity.SEGMENT}
+        new Object[]{LockGranularity.TIME_CHUNK, false},
+        new Object[]{LockGranularity.TIME_CHUNK, true},
+        new Object[]{LockGranularity.SEGMENT, false},


Similar comment to IndexingTest about skipping this permutation

ccaominh · 2019-11-08T01:40:43Z

...va/org/apache/druid/indexing/common/task/batch/parallel/SinglePhaseParallelIndexingTest.java

-        new Object[]{LockGranularity.SEGMENT}
+        new Object[]{LockGranularity.TIME_CHUNK, false},
+        new Object[]{LockGranularity.TIME_CHUNK, true},
+        new Object[]{LockGranularity.SEGMENT, false},


Similar comment to IndexingTest about skipping this permutation

lgtm-com · 2019-11-09T02:58:55Z

This pull request fixes 1 alert when merging 546d957 into 0e8c3f7 - view on LGTM.com

fixed alerts:

1 for Dereferenced variable may be null

lgtm-com · 2019-11-09T09:07:23Z

This pull request fixes 1 alert when merging ea2c8f9 into 75ea0d5 - view on LGTM.com

fixed alerts:

1 for Dereferenced variable may be null

This reverts commit 1ea7758.

lgtm-com · 2019-11-11T07:19:07Z

This pull request fixes 1 alert when merging 218b392 into e9e1625 - view on LGTM.com

fixed alerts:

1 for Dereferenced variable may be null

clintropolis

overall lgtm 🤘

clintropolis · 2019-11-14T19:18:22Z

core/src/main/java/org/apache/druid/data/input/InputEntity.java

+   */
+  default CleanableFile fetch(File temporaryDirectory, byte[] fetchBuffer) throws IOException
+  {
+    final File tempFile = File.createTempFile("druid-object-source", ".tmp", temporaryDirectory);


should druid-object-source be druid-input-entity since this class had it's name changed? (same with the LOG.debug message just below this line, as well as some of the javadoc)

clintropolis · 2019-11-14T19:19:25Z

core/src/main/java/org/apache/druid/data/input/InputEntity.java

+    File file();
+  }
+
+  T getObject();


getEntity?

Removed this and added getUri() instead.

clintropolis · 2019-11-14T20:02:18Z

core/src/main/java/org/apache/druid/data/input/impl/CsvReader.java

+  {
+    final Map<String, Object> zipped = parseLine(line);
+    return Collections.singletonList(
+        MapInputRowParser.parse(


Can you add a version of MapInputRowParser.parse that takes an InputRowSchema since providing timestamp and dimension specs from it seems like it will be common

clintropolis · 2019-11-14T20:15:18Z

processing/src/main/java/org/apache/druid/segment/transform/Transformer.java

+    if (transforms.isEmpty()) {
+      transformedRow = row;
+    } else {
+      transformedRow = InputRowListPlusJson.of(new TransformedInputRow(row.getInputRow(), transforms), row.getRaw());


should this be transforming the list of InputRow to TransformedInputRow?

👍 fixed.

clintropolis · 2019-11-14T20:21:38Z

core/src/main/java/org/apache/druid/guice/annotations/UnstableApi.java

+import java.lang.annotation.Target;
+
+/**
+ * Signifies that the annotated entity is an unstable API for extension authors. Unstable APIs may change at any time


this should also maybe indicate that someday it will likely become either an @ExtensionPoint or @PublicApi?

👍 added.

clintropolis

lgtm 👍

I'm slightly hesitant since I feel that this will be a moderately disruptive change that further fractures the state of indexing with regards to differences between specs, but I think these new interfaces are nicer going forward, so worth the pain of migrating stuff to this model and fully replacing firehoses.

lgtm-com · 2019-11-14T21:53:19Z

This pull request fixes 1 alert when merging ce88049 into ce4ee42 - view on LGTM.com

fixed alerts:

1 for Dereferenced variable may be null

ccaominh · 2019-11-15T00:47:40Z

core/src/main/java/org/apache/druid/data/input/Firehose.java

@@ -74,13 +74,13 @@
   *
   * @return an InputRowPlusRaw which may contain any of: an InputRow, the raw data, or a ParseException


javadoc for @return needs to be updated

This method is only for sampler and will be removed in the follow-up pr.

ccaominh · 2019-11-15T00:53:48Z

core/src/main/java/org/apache/druid/data/input/InputRowListPlusJson.java

  {
-    return new InputRowPlusRaw(null, raw, parseException);
+    return (inputRows == null || inputRows.isEmpty()) && raw == null && rawJson == null && parseException == null;


Should this also check if rawJson.isEmpty()?

This class is also used only by sampler and will be cleaned up in the follow-up pr.

ccaominh

LGTM 👍

jon-wei

LGTM

jihoonson · 2019-11-15T17:22:31Z

@ccaominh @clintropolis @jon-wei thanks for the review!

* Refactor parallel indexing perfect rollup partitioning Refactoring to make it easier to later add range partitioning for perfect rollup parallel indexing. This is accomplished by adding several new base classes (e.g., PerfectRollupWorkerTask) and new classes for encapsulating logic that needs to be changed for different partitioning strategies (e.g., IndexTaskInputRowIteratorBuilder). The code is functionally equivalent to before except for the following small behavior changes: 1) PartialSegmentMergeTask: Previously, this task had a priority of DEFAULT_TASK_PRIORITY. It now has a priority of DEFAULT_BATCH_INDEX_TASK_PRIORITY (via the new PerfectRollupWorkerTask base class), since it is a batch index task. 2) ParallelIndexPhaseRunner: A decorator was added to subTaskSpecIterator to ensure the subtasks are generated with unique ids. Previously, only tests (i.e., MultiPhaseParallelIndexingTest) would have this decorator, but this behavior is desired for non-test code as well. * Fix forbidden apis and pmd warnings * Fix analyze dependencies warnings * Fix IndexTask json and add IT diags * Fix parallel index supervisor<->worker serde * Fix TeamCity inspection errors/warnings * Fix TeamCity inspection errors/warnings again * Integrate changes with those from #8823 * Address review comments * Address more review comments * Fix forbidden apis * Address more review comments

* Refactor parallel indexing perfect rollup partitioning Refactoring to make it easier to later add range partitioning for perfect rollup parallel indexing. This is accomplished by adding several new base classes (e.g., PerfectRollupWorkerTask) and new classes for encapsulating logic that needs to be changed for different partitioning strategies (e.g., IndexTaskInputRowIteratorBuilder). The code is functionally equivalent to before except for the following small behavior changes: 1) PartialSegmentMergeTask: Previously, this task had a priority of DEFAULT_TASK_PRIORITY. It now has a priority of DEFAULT_BATCH_INDEX_TASK_PRIORITY (via the new PerfectRollupWorkerTask base class), since it is a batch index task. 2) ParallelIndexPhaseRunner: A decorator was added to subTaskSpecIterator to ensure the subtasks are generated with unique ids. Previously, only tests (i.e., MultiPhaseParallelIndexingTest) would have this decorator, but this behavior is desired for non-test code as well. * Fix forbidden apis and pmd warnings * Fix analyze dependencies warnings * Fix IndexTask json and add IT diags * Fix parallel index supervisor<->worker serde * Fix TeamCity inspection errors/warnings * Fix TeamCity inspection errors/warnings again * Integrate changes with those from apache#8823 * Address review comments * Address more review comments * Fix forbidden apis * Address more review comments

The FiniteFirehoseFactory and InputRowParser classes were deprecated in 0.17.0 (#8823) in favor of InputSource & InputFormat. This PR removes the FiniteFirehoseFactory and all its implementations along with classes solely used by them like Fetcher (Used by PrefetchableTextFilesFirehoseFactory). Refactors classes including tests using FiniteFirehoseFactory to use InputSource instead. Removing InputRowParser may not be as trivial as many classes that aren't deprecated depends on it (with no alternatives), like EventReceiverFirehoseFactory. Hence FirehoseFactory, EventReceiverFirehoseFactory, and Firehose are marked deprecated.

Add InputSource and InputFormat interfaces

095bb32

jihoonson added the Area - Batch Ingestion label Nov 5, 2019

jihoonson added 7 commits November 4, 2019 23:38

revert orc dependency

f52f967

fix dimension exclusions and failing unit tests

93ab23f

fix tests

e0b80cb

fix test

b4f041e

fix test

d349db5

fix firehose and inputSource for parallel indexing task

f308f13

Merge branch 'master' of github.com:apache/incubator-druid into input…

d451582

…-source-format

vogievetsky mentioned this pull request Nov 6, 2019

Web console: support new ingest spec format #8828

Merged

fix tc

b7c8b87

gianm mentioned this pull request Nov 6, 2019

support for array expressions in TransformSpec with ExpressionTransform #8744

Merged

4 tasks

fix tc: remove unused method

e942a21

ccaominh reviewed Nov 8, 2019

View reviewed changes

jihoonson added 3 commits November 7, 2019 18:17

Formattable

08d7872

add needsFormat(); renamed to ObjectSource; pass metricsName for reader

c70af75

address comments

546d957

jihoonson added 3 commits November 8, 2019 23:41

fix closing resource

7bb5d5f

fix checkstyle

6dba81a

fix tests

ea2c8f9

jihoonson added 2 commits November 10, 2019 12:01

remove verify from csv

1ea7758

Revert "remove verify from csv"

218b392

This reverts commit 1ea7758.

rename source -> entity

540759b

clintropolis reviewed Nov 14, 2019

View reviewed changes

address comments

ce88049

clintropolis approved these changes Nov 14, 2019

View reviewed changes

ccaominh reviewed Nov 15, 2019

View reviewed changes

ccaominh approved these changes Nov 15, 2019

View reviewed changes

jon-wei approved these changes Nov 15, 2019

View reviewed changes

jihoonson added the Design Review label Nov 15, 2019

jihoonson merged commit 1611792 into apache:master Nov 15, 2019

clintropolis mentioned this pull request Nov 15, 2019

refactor InputFormat and InputEntityReader implementations #8875

Merged

ccaominh added a commit to ccaominh/druid that referenced this pull request Nov 15, 2019

Integrate changes with those from apache#8823

3ea47c6

clintropolis mentioned this pull request Nov 16, 2019

add parquet support to native batch #8883

Merged

5 tasks

jihoonson mentioned this pull request Nov 18, 2019

Support inputFormat and inputSource for sampler #8901

Merged

10 tasks

This was referenced Nov 19, 2019

S3 input source #8903

Merged

add google cloud storage InputSource for native batch #8907

Merged

jon-wei added this to the 0.17.0 milestone Dec 17, 2019

jihoonson mentioned this pull request Feb 11, 2020

Decoupling FirehoseFactory and InputRowParser #5584

Closed

FrankChen021 mentioned this pull request Aug 10, 2020

Kafka ingestion fails to parse multiple-line messages in 0.19 #10259

Closed

tejaswini-imply mentioned this pull request Aug 2, 2022

Removes FiniteFirehoseFactory and its implementations #12852

Merged

9 tasks

		@@ -74,13 +74,13 @@
		*
		* @return an InputRowPlusRaw which may contain any of: an InputRow, the raw data, or a ParseException

Add InputSource and InputFormat interfaces #8823

Add InputSource and InputFormat interfaces #8823

Conversation

jihoonson commented Nov 5, 2019 • edited Loading

Description

vogievetsky commented Nov 5, 2019

lgtm-com bot commented Nov 6, 2019

fjy commented Nov 6, 2019

lgtm-com bot commented Nov 6, 2019

lgtm-com bot commented Nov 6, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lgtm-com bot commented Nov 9, 2019

lgtm-com bot commented Nov 9, 2019

lgtm-com bot commented Nov 11, 2019

clintropolis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clintropolis left a comment

Choose a reason for hiding this comment

lgtm-com bot commented Nov 14, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ccaominh left a comment

Choose a reason for hiding this comment

jon-wei left a comment

Choose a reason for hiding this comment

jihoonson commented Nov 15, 2019

jihoonson commented Nov 5, 2019 •

edited

Loading