remove Firehose and FirehoseFactory

changes: * removed `Firehose` and `FirehoseFactory` and remaining implementations which were mostly no longer used after apache#16602 * Moved `IngestSegmentFirehose` which was still used internally by Hadoop ingestion to `DatasourceRecordReader.SegmentReader` * Rename `SQLFirehoseFactoryDatabaseConnector` to `SQLInputSourceDatabaseConnector` and similar renames for sub-classes * Moved anything remaining in a 'firehose' package somewhere else * Clean up docs on firehose stuff
clintropolis · Jul 19, 2024 · 79a7246 · 79a7246
1 parent 35b8764
commit 79a7246
Show file tree

Hide file tree

Showing 154 changed files with 438 additions and 3,633 deletions.
diff --git a/docs/configuration/extensions.md b/docs/configuration/extensions.md
@@ -80,7 +80,7 @@ All of these community extensions can be downloaded using [pull-deps](../operati
 |aliyun-oss-extensions|Aliyun OSS deep storage |[link](../development/extensions-contrib/aliyun-oss-extensions.md)|
 |ambari-metrics-emitter|Ambari Metrics Emitter |[link](../development/extensions-contrib/ambari-metrics-emitter.md)|
 |druid-cassandra-storage|Apache Cassandra deep storage.|[link](../development/extensions-contrib/cassandra.md)|
-|druid-cloudfiles-extensions|Rackspace Cloudfiles deep storage and firehose.|[link](../development/extensions-contrib/cloudfiles.md)|
+|druid-cloudfiles-extensions|Rackspace Cloudfiles deep storage.|[link](../development/extensions-contrib/cloudfiles.md)|
 |druid-compressed-bigdecimal|Compressed Big Decimal Type | [link](../development/extensions-contrib/compressed-big-decimal.md)|
 |druid-ddsketch|Support for DDSketch approximate quantiles based on [DDSketch](https://github.com/datadog/sketches-java) | [link](../development/extensions-contrib/ddsketch-quantiles.md)|
 |druid-deltalake-extensions|Support for ingesting Delta Lake tables.|[link](../development/extensions-contrib/delta-lake.md)|

diff --git a/docs/configuration/index.md b/docs/configuration/index.md
@@ -395,7 +395,6 @@ Metric monitoring is an essential part of Druid operations. The following monito
 |`org.apache.druid.java.util.metrics.CgroupCpuSetMonitor`|Reports CPU core/HT and memory node allocations as per the `cpuset` cgroup.|
 |`org.apache.druid.java.util.metrics.CgroupDiskMonitor`|Reports disk statistic as per the blkio cgroup.|
 |`org.apache.druid.java.util.metrics.CgroupMemoryMonitor`|Reports memory statistic as per the memory cgroup.|
-|`org.apache.druid.server.metrics.EventReceiverFirehoseMonitor`|Reports how many events have been queued in the EventReceiverFirehose.|
 |`org.apache.druid.server.metrics.HistoricalMetricsMonitor`|Reports statistics on Historical services. Available only on Historical services.|
 |`org.apache.druid.server.metrics.SegmentStatsMonitor` | **EXPERIMENTAL** Reports statistics about segments on Historical services. Available only on Historical services. Not to be used when lazy loading is configured.|
 |`org.apache.druid.server.metrics.QueryCountStatsMonitor`|Reports how many queries have been successful/failed/interrupted.|
@@ -607,7 +606,7 @@ the [HDFS input source](../ingestion/input-sources.md#hdfs-input-source).
 
 |Property|Possible values|Description|Default|
 |--------|---------------|-----------|-------|
-|`druid.ingestion.hdfs.allowedProtocols`|List of protocols|Allowed protocols for the HDFS input source and HDFS firehose.|`["hdfs"]`|
+|`druid.ingestion.hdfs.allowedProtocols`|List of protocols|Allowed protocols for the HDFS input source.|`["hdfs"]`|
 
 #### HTTP input source
 
@@ -616,7 +615,7 @@ the [HTTP input source](../ingestion/input-sources.md#http-input-source).
 
 |Property|Possible values|Description|Default|
 |--------|---------------|-----------|-------|
-|`druid.ingestion.http.allowedProtocols`|List of protocols|Allowed protocols for the HTTP input source and HTTP firehose.|`["http", "https"]`|
+|`druid.ingestion.http.allowedProtocols`|List of protocols|Allowed protocols for the HTTP input source.|`["http", "https"]`|
 
 ### External data access security configuration
 

diff --git a/docs/development/extensions-contrib/cloudfiles.md b/docs/development/extensions-contrib/cloudfiles.md
@@ -40,59 +40,3 @@ To use this Apache Druid extension, [include](../../configuration/extensions.md#
 |`druid.cloudfiles.apiKey`||Rackspace Cloud API key.|Must be set.|
 |`druid.cloudfiles.provider`|rackspace-cloudfiles-us,rackspace-cloudfiles-uk|Name of the provider depending on the region.|Must be set.|
 |`druid.cloudfiles.useServiceNet`|true,false|Whether to use the internal service net.|true|
-
-## Firehose
-
-<a name="firehose"></a>
-
-#### StaticCloudFilesFirehose
-
-This firehose ingests events, similar to the StaticAzureBlobStoreFirehose, but from Rackspace's Cloud Files.
-
-Data is newline delimited, with one JSON object per line and parsed as per the `InputRowParser` configuration.
-
-The storage account is shared with the one used for Rackspace's Cloud Files deep storage functionality, but blobs can be in a different region and container.
-
-As with the Azure blobstore, it is assumed to be gzipped if the extension ends in .gz
-
-This firehose is _splittable_ and can be used by [native parallel index tasks](../../ingestion/native-batch.md).
-Since each split represents an object in this firehose, each worker task of `index_parallel` will read an object.
-
-Sample spec:
-
-```json
-"firehose" : {
-    "type" : "static-cloudfiles",
-    "blobs": [
-        {
-          "region": "DFW"
-          "container": "container",
-          "path": "/path/to/your/file.json"
-        },
-        {
-          "region": "ORD"
-          "container": "anothercontainer",
-          "path": "/another/path.json"
-        }
-    ]
-}
-```
-This firehose provides caching and prefetching features. In IndexTask, a firehose can be read twice if intervals or
-shardSpecs are not specified, and, in this case, caching can be useful. Prefetching is preferred when direct scan of objects is slow.
-
-|property|description|default|required?|
-|--------|-----------|-------|---------|
-|type|This should be `static-cloudfiles`.|N/A|yes|
-|blobs|JSON array of Cloud Files blobs.|N/A|yes|
-|maxCacheCapacityBytes|Maximum size of the cache space in bytes. 0 means disabling cache.|1073741824|no|
-|maxCacheCapacityBytes|Maximum size of the cache space in bytes. 0 means disabling cache. Cached files are not removed until the ingestion task completes.|1073741824|no|
-|maxFetchCapacityBytes|Maximum size of the fetch space in bytes. 0 means disabling prefetch. Prefetched files are removed immediately once they are read.|1073741824|no|
-|fetchTimeout|Timeout for fetching a Cloud Files object.|60000|no|
-|maxFetchRetry|Maximum retry for fetching a Cloud Files object.|3|no|
-
-Cloud Files Blobs:
-
-|property|description|default|required?|
-|--------|-----------|-------|---------|
-|container|Name of the Cloud Files container|N/A|yes|
-|path|The path where data is located.|N/A|yes|
diff --git a/docs/development/extensions-core/postgresql.md b/docs/development/extensions-core/postgresql.md
@@ -87,7 +87,7 @@ In most cases, the configuration options map directly to the [postgres JDBC conn
 | `druid.metadata.postgres.ssl.sslPasswordCallback` | The classname of the SSL password provider. | none | no |
 | `druid.metadata.postgres.dbTableSchema` | druid meta table schema | `public` | no |
 
-### PostgreSQL Firehose
+### PostgreSQL InputSource
 
 The PostgreSQL extension provides an implementation of an [SQL input source](../../ingestion/input-sources.md) which can be used to ingest data into Druid from a PostgreSQL database.
 

diff --git a/docs/development/overview.md b/docs/development/overview.md
@@ -53,8 +53,17 @@ Most of the coordination logic for (real-time) ingestion is in the Druid indexin
 
 ## Real-time Ingestion
 
-Druid loads data through `FirehoseFactory.java` classes. Firehoses often wrap other firehoses, where, similar to the design of the
-query runners, each firehose adds a layer of logic, and the persist and hand-off logic is in `RealtimePlumber.java`.
+Druid streaming tasks are based the 'seekable stream' classes such as `SeekableStreamSupervisor.java`,
+`SeekableStreamIndexTask.java`, and `SeekableStreamIndexTaskRunner.java`. The data processing happen through
+`StreamAppenderator.java`, and the persist and hand-off logic is in `StreamAppenderatorDriver.java`.
+
+## Native Batch Ingestion
+
+Druid native batch ingestion main task types are based on `AbstractBatchTask.java` and `AbstractBatchSubtask.java`.
+Parallel processing uses `ParallelIndexSupervisorTask.java`, which spawns subtasks to perform various operations such
+as data analysis and partitioning depending on the task specification. Segment generation happens in
+`SinglePhaseSubTask.java`, `PartialHashSegmentGenerateTask.java`, or `PartialRangeSegmentGenerateTask.java` through
+`BatchAppenderator`, and the persist and hand-off logic is in `BatchAppenderatorDriver.java`.
 
 ## Hadoop-based Batch Ingestion