Releases: apache/druid
druid-0.20.0
Apache Druid 0.20.0 contains around 160 new features, bug fixes, performance enhancements, documentation improvements, and additional test coverage from 36 contributors. Refer to the complete list of changes and everything tagged to the milestone for further details.
# New Features
# Ingestion
# Combining InputSource
A new combining InputSource has been added, allowing the user to combine multiple input sources during ingestion. Please see https://druid.apache.org/docs/0.20.0/ingestion/native-batch.html#combining-input-source for more details.
# Automatically determine numShards for parallel ingestion hash partitioning
When hash partitioning is used in parallel batch ingestion, it is no longer necessary to specify numShards
in the partition spec. Druid can now automatically determine a number of shards by scanning the data in a new ingestion phase that determines the cardinalities of the partitioning key.
# Subtask file count limits for parallel batch ingestion
The size-based splitHintSpec
now supports a new maxNumFiles
parameter, which limits how many files can be assigned to individual subtasks in parallel batch ingestion.
The segment-based splitHintSpec
used for reingesting data from existing Druid segments also has a new maxNumSegments
parameter which functions similarly.
Please see https://druid.apache.org/docs/0.20.0/ingestion/native-batch.html#split-hint-spec for more details.
# Task slot usage metrics
New task slot usage metrics have been added. Please see the entries for the taskSlot
metrics at https://druid.apache.org/docs/0.20.0/operations/metrics.html#indexing-service for more details.
# Compaction
# Support for all partitioning schemes for auto-compaction
A partitioning spec can now be defined for auto-compaction, allowing users to repartition their data at compaction time. Please see the documentation for the new partitionsSpec
property in the compaction tuningConfig
for more details:
https://druid.apache.org/docs/0.20.0/configuration/index.html#compaction-tuningconfig
# Auto-compaction status API
A new coordinator API which shows the status of auto-compaction for a datasource has been added. The new API shows whether auto-compaction is enabled for a datasource, and a summary of how far compaction has progressed.
The web console has also been updated to show this information:
https://user-images.githubusercontent.com/177816/94326243-9d07e780-ff57-11ea-9f80-256fa08580f0.png
Please see https://druid.apache.org/docs/latest/operations/api-reference.html#compaction-status for details on the new API, and https://druid.apache.org/docs/latest/operations/metrics.html#coordination for information on new related compaction metrics.
# Querying
# Query segment pruning with hash partitioning
Druid now supports query-time segment pruning (excluding certain segments as read candidates for a query) for hash partitioned segments. This optimization applies when all of the partitionDimensions
specified in the hash partition spec during ingestion time are present in the filter set of a query, and the filters in the query filter on discrete values of the partitionDimensions
(e.g., selector filters). Segment pruning with hash partitioning is not supported with non-discrete filters such as bound filters.
For existing users with existing segments, you will need to reingest those segments to take advantage of this new feature, as the segment pruning requires a partitionFunction
to be stored together with the segments, which does not exist in segments created by older versions of Druid. It is not necessary to specify the partitionFunction
explicitly, as the default is the same partition function that was used in prior versions of Druid.
Note that segments created with a default partitionDimensions
value (partition by all dimensions + the time column) cannot be pruned in this manner, the segments need to be created with an explicit partitionDimensions
.
# Vectorization
To enable vectorization features, please set the druid.query.default.context.vectorizeVirtualColumns
property to true
or set the vectorize
property in the query context. Please see https://druid.apache.org/docs/0.20.0/querying/query-context.html#vectorization-parameters for more information.
# Vectorization support for expression virtual columns
Expression virtual columns now have vectorization support (depending on the expressions being used), which an results in a 3-5x performance improvement in some cases.
Please see https://druid.apache.org/docs/0.20.0/misc/math-expr.html#vectorization-support for details on the specific expressions that support vectorization.
# More vectorization support for aggregators
Vectorization support has been added for several aggregation types: numeric min/max aggregators, variance aggregators, ANY aggregators, and aggregators from the druid-histogram
extension.
#10260 - numeric min/max
#10304 - histogram
#10338 - ANY
#10390 - variance
We've observed about a 1.3x to 1.8x performance improvement in some cases with vectorization enabled for the min, max, and ANY aggregator, and about 1.04x to 1.07x wuth the histogram aggregator.
# offset
parameter for GroupBy and Scan queries
It is now possible set an offset
parameter for GroupBy and Scan queries, which tells Druid to skip a number of rows when returning results. Please see https://druid.apache.org/docs/0.20.0/querying/limitspec.html and https://druid.apache.org/docs/0.20.0/querying/scan-query.html for details.
# OFFSET
clause for SQL queries
Druid SQL queries now support an OFFSET
clause. Please see https://druid.apache.org/docs/0.20.0/querying/sql.html#offset for details.
# Substring search operators
Druid has added new substring search operators in its expression language and for SQL queries.
Please see documentation for CONTAINS_STRING
and ICONTAINS_STRING
string functions for Druid SQL (https://druid.apache.org/docs/0.20.0/querying/sql.html#string-functions) and documentation for contains_string
and icontains_string
for the Druid expression language (https://druid.apache.org/docs/0.20.0/misc/math-expr.html#string-functions).
We've observed about a 2.5x performance improvement in some cases by using these functions instead of STRPOS
.
# UNION ALL operator for SQL queries
Druid SQL queries now support the UNION ALL
operator, which fuses the results of multiple queries together. Please see https://druid.apache.org/docs/0.20.0/querying/sql.html#union-all for details on what query shapes are supported by this operator.
# Cluster-wide default query context settings
It is now possible to set cluster-wide default query context properties by adding a configuration of the form druid.query.override.default.context.*
, with *
replaced by the property name.
# Other features
# Improved retention rules UI
The retention rules UI in the web console has been improved. It now provides suggestions and basic validation in the period dropdown, shows the cluster default rules, and makes editing the default rules more accessible.
# Redis cache extension enhancements
The Redis cache extension now supports Redis Cluster, selecting which database is used, connecting to password-protected servers, and period-style configurations for the `exp...
druid-0.19.0
Apache Druid 0.19.0 contains around 200 new features, bug fixes, performance enhancements, documentation improvements, and additional test coverage from 51 contributors. Refer to the complete list of changes and everything tagged to the milestone for further details.
# New Features
# GroupBy and Timeseries vectorized query engines enabled by default
Vectorized query engines for GroupBy and Timeseries queries were introduced in Druid 0.16, as an opt in feature. Since then we have extensively tested these engines and feel that the time has come for these improvements to find a wider audience. Note that not all of the query engine is vectorized at this time, but this change makes it so that any query which is eligible to be vectorized will do so. This feature may still be disabled if you encounter any problems by setting druid.query.vectorize
to false
.
# Druid native batch support for Apache Avro Object Container Files
New in Druid 0.19.0, native batch indexing now supports Apache Avro Object Container Format encoded files, allowing batch ingestion of Avro data without needing an external Hadoop cluster. Check out the docs for more details
# Updated Druid native batch support for SQL databases
An 'SqlInputSource' has been added in Druid 0.19.0 to work with the new native batch ingestion specifications first introduced in Druid 0.17, deprecating the SqlFirehose. Like the 'SqlFirehose' it currently supports MySQL and PostgreSQL, using the driver from those extensions. This is a relatively low level ingestion task, and the operator must take care to manually ensure that the correct data is ingested, either by specially crafting queries to ensure no duplicate data is ingested for appends, or ensuring that the entire set of data is queried to be replaced when overwriting. See the docs for more operational details.
# Apache Ranger based authorization
A new extension in Druid 0.19.0 adds an Authorizer which implements access control for Druid, backed by Apache Ranger. Please see [the extension documentation]((https://druid.apache.org/docs/0.19.0/development/extensions-core/druid-ranger-security.html) and Authentication and Authorization for more information on the basic facilities this extension provides.
# Alibaba Object Storage Service support
A new 'contrib' extension has been added for Alibaba Cloud Object Storage Service (OSS) to provide both deep storage and usage as a batch ingestion input source. Since this is a 'contrib' extension, it will not be packaged by default in the binary distribution, please see community extensions for more details on how to use in your cluster.
# Ingestion worker autoscaling for Google Compute Engine
Another 'contrib' extension new in 0.19.0 has been added to support ingestion worker autoscaling, which allows a Druid Overlord to provision or terminate worker instances (MiddleManagers or Indexers) whenever there are pending tasks or idle workers, for Google Compute Engine. Unlike the Amazon Web Services ingestion autoscaling extension, which provisions and terminates instances directly without using an Auto Scaling Group, the GCE autoscaler uses Managed Instance Groups to more closely align with how operators are likely to provision their clusters in GCE. Like other 'contrib' extensions, it will not be packaged by default in the binary distribution, please see community extensions for more details on how to use in your cluster.
# REGEXP_LIKE
A new REGEXP_LIKE
function has been added to Druid SQL and native expressions, which behaves similar to LIKE
, except using regular expressions for the pattern.
# Web console lookup management improvements
Druid 0.19 also web console also includes some useful improvements to the lookup table management interface. Creating and editing lookups is now done with a form to accept user input, rather than a raw text editor to enter the JSON spec.
Additionally, clicking the magnifying glass icon next to a lookup will now allow displaying the first 5000 values of that lookup.
# New Coordinator per datasource 'loadstatus' API
A coordinator API can make it easier to determine if the latest published segments are available for querying. This is similar to the existing coordinator 'loadstatus' API, but is datasource specific, may specify an interval, and can optionally live refresh the metadata store snapshot to get the latest up to date information. Note that operators should still exercise caution when using this API to query large numbers of segments, especially if forcing a metadata refresh, as it can potentially be a 'heavy' call on large clusters.
# Native batch append support for range and hash partitioning
Part bug fix, part new feature, Druid native batch (once again) supports appending new data to existing time chunks when those time chunks were partitioned with 'hash' or 'range' partitioning algorithms. Note that currently the appended segments only support 'dynamic' partitioning, and when rolling back to older versions that these appended segments will not be recognized by Druid after the downgrade. In order to roll back to a previous version, these appended segments should be compacted with the rest of the time chunk in order to have a homogenous partitioning scheme.
# Bug fixes
Druid 0.19.0 contains 65 bug fixes, you can see the complete list here.
# Fix for batch ingested 'dynamic' partitioned segments not becoming queryable atomically
Druid 0.19.0 fixes an important query correctness issue, where 'dynamic' partitioned segments produced by a batch ingestion task were not tracking the overall number of partitions. This had the implication that when these segments came online, they did not do so as a complete set, but rather as individual segments, meaning that there would be periods of swapping where results could be queried from an incomplete partition set within a time chunk.
# Fix to allow 'hash' and 'range' partitioned segments with empty buckets to now be queryable
Prior to 0.19.0, Druid had a bug when using hash or ranged partitioning where if data skew was such that any of the buckets were 'empty' after ingesting, the partitions would never be recognized as 'complete' and so never become queryable. Druid 0.19.0 fixes this issue by adjusting the schema of the partitioning spec. These changes to the json format should be backwards compatible, however rolling back to a previous version will again make these segments no longer queryable.
# Incorrect balancer behavior
A bug in Druid versions prior to 0.19.0 allowed for (incorrect) coordinator operation in the event druid.server.maxSize
was not set. This bug would allow segments to load, and effectively randomly balance them in the cluster (regardless of what balancer strategy was actually configured) if all historicals did not have this value set. This bug has been fixed, but as a result druid.server.maxSize
must be set to the sum of the segment cache location sizes for historicals, or else they will not load segments.
# Upgrading to Druid 0.19.0
Please be aware of the f...
druid-0.18.1
Apache Druid 0.18.1 is a bug fix release that fixes Streaming ingestion failure with Avro, ingestion performance issue, upgrade issue with HLLSketch, and so on. The complete list of bug fixes can be found at https://github.com/apache/druid/pulls?q=is%3Apr+milestone%3A0.18.1+label%3ABug+is%3Aclosed.
# Bug fixes
- #9823 rollbacks the new Kinesis lag metrics as it can stall the Kinesis supervisor indefinitely with a large number of shards.
- #9734 fixes the Streaming ingestion failure issue when you use a data format other than CSV or JSON.
- #9812 fixes filtering on boolean values during transformation.
- #9723 fixes slow ingestion performance due to frequent flushes on local file system.
- #9751 reverts the version of
datasketches-java
from 1.2.0 to 1.1.0 to workaround upgrade failure with HLLSketch. - #9698 fixes a bug in inline subquery with multi-valued dimension.
- #9761 fixes a bug in
CloseableIterator
which potentially leads to resource leaks in Data loader.
# Known issues
Incorrect result of nested groupBy query on Join of subqueries
A nested groupBy query can result in an incorrect result when it is on top of a Join of subqueries and the inner and the outer groupBys have different filters. See #9866 for more details.
# Credits
Thanks to everyone who contributed to this release!
@clintropolis
@gianm
@jihoonson
@maytasm
@suneet-s
@viongpanzi
@whutjs
druid-0.18.0
Apache Druid 0.18.0 contains over 200 new features, performance enhancements, bug fixes, and major documentation improvements from 42 contributors. Check out the complete list of changes and everything tagged to the milestone.
# New Features
# Join support
Join is a key operation in data analytics. Prior to 0.18.0, Druid supported some join-related features, such as Lookups or semi-joins in SQL. However, the use cases for those features were pretty limited and, for other join use cases, users had to denormalize their datasources when they ingest data instead of joining them at query time, which could result in exploding data volume and long ingestion time.
Druid 0.18.0 supports real joins for the first time ever in its history. Druid supports INNER, LEFT, and CROSS joins for now. For native queries, the join
datasource has been newly introduced to represent a join of two datasources. Currently, only the left-deep join is allowed. That means, only a table
or another join
datasource is allowed for the left datasource. For the right datasource, lookup
, inline
, or query
datasources are allowed. Note that join of Druid datasources is not supported yet. There should be only one table
datasource in the same join query.
Druid SQL also supports joins. Under the covers, SQL join queries are translated into one or several native queries that include join datasources. See Query translation for more details of SQL translation and best practices to write efficient queries.
When a join query is issued, the Broker first evaluates all datasources except for the base datasource which is the only table
datasource in the query. The evaluation can include executing subqueries for query
datasources. Once the Broker evaluates all non-base datasources, it replaces them with inline
datasources and sends the rewritten query to data nodes (see the below "Query inlining in Brokers" section for more details). Data nodes use the hash join to process join queries. They build a hash table for each non-primary leaf datasource unless it already exists. Note that only lookup
datasource currently has a pre-built hash table. See Query execution for more details about join query execution.
Joins can affect performance of your queries. In general, any queries including joins can be slower than equivalent queries against a denormalized datasource. The LOOKUP
function could perform better than joins with lookup datasources. See Join performance for more details about join query performance and future plans for performance improvement.
# Query inlining in Brokers
Druid is now able to execute a nested query by inlining subqueries. Any type of subquery can be on top of any type of another, such as in the following example:
topN
|
(join datasource)
/ \
(table datasource) groupBy
To execute this query, the Broker first evaluates the leaf groupBy subquery; it sends the subquery to data nodes and collects the result. The collected result is materialized in the Broker memory. Once the Broker collects all results for the groupBy query, it rewrites the topN query by replacing the leaf groupBy with an inline datasource which has the result of the groupBy query. Finally, the rewritten query is sent to data nodes to execute the topN query.
# Query laning and prioritization
When you run multiple queries of heterogenous workloads at a time, you may sometimes want to control the resource commitment for a query based on its priority. For example, you would want to limit the resources assigned to less important queries, so that important queries can be executed in time without being disrupted by less important ones.
Query laning allows you to control capacity utilization for heterogeneous query workloads. With laning, the broker examines and classifies a query for the purpose of assigning it to a 'lane'. Lanes have capacity limits, enforced by the Broker, that can be used to ensure sufficient resources are available for other lanes or for interactive queries (with no lane), or to limit overall throughput for queries within the lane.
Automatic query prioritization determines the query priority based on the configured strategy. The threshold-based prioritization strategy has been added; it automatically lowers the priority of queries that cross any of a configurable set of thresholds, such as how far in the past the data is, how large of an interval a query covers, or the number of segments taking part in a query.
See Query prioritization and laning for more details.
New dimension in query metrics
Since a native query containing subqueries can be executed part-by-part, a new subQueryId
has been introduced. Each subquery has different subQueryId
s but same queryId
. The subQueryId
is available as a new dimension in query metrics.
New configuration
A new druid.server.http.maxSubqueryRows
configuration controls the maximum number of rows materialized in the Broker memory.
Please see Query execution for more details.
# SQL grouping sets
GROUPING SETS is now supported, allowing you to combine multiple GROUP BY clauses into one GROUP BY clause. This GROUPING SETS clause is internally translated into the groupBy query with subtotalsSpec
. The LIMIT clause is now applied after subtotalsSpec, rather than applied to each grouping set.
# SQL Dynamic parameters
Druid now supports dynamic parameters for SQL. To use dynamic parameters, replace any literal in the query with a question mark (?
) character. These question marks represent the places where the parameters will be bound at execution time. See SQL dynamic parameters for more details.
# Important Changes
# applyLimitPushDownToSegments
is disabled by default
applyLimitPushDownToSegments
was added in 0.17.0 to push down limit evaluation to queryable nodes, limiting results during segment scan for groupBy v2. This can lead to performance degradation, as reported in #9689, if many segments are involved in query processing. This is because “limit push down to segment scan” initializes an aggregation buffer per segment, the overhead for which is not negligible. Enable this configuration only if your query involves a relatively small number of segments per historical or realtime task.
# Roaring bitmaps as default
Druid supports two bitmap types, i.e., Roaring and CONCISE. Since Roaring bitmaps provide a better out-of-box experience (faster query speed in general), the default bitmap type is now switched to Roaring bitmaps. See Segment compression for more details about bitmaps.
# Complex metrics behavior change at ingestion time when SQL-compatible null handling is disabled (default mode)
When SQL-compatible null handling is disabled, the behavior of complex metric aggregation at ingestion time has now changed to be consistent with that at query time. The complex metrics are aggregated to the default 0 values for nulls instead of skipping them during ingestion.
# Array expression syntax change
Druid expression now supports typed constructors for creating arrays. Arrays can be defined with an explicit type. For example, <LONG>[1, 2, null]
creates an array of LONG
type containing 1
, 2
, and null
. Note that you can still create an array without an explicit type. For example, [1, 2, null]
is still a valid syntax to create an equivalent array. In this case, Druid will infer the type of array from its elements. This new syntax applies to empty arrays as well. <STRING>[]
, <DOUBLE>[]
, and <LONG>[]
will create an empty array of STRING
, DOUBLE
, and LONG
type, respectively.
# Enabling pending segments cleanup by default
The pendingSegments
table in the metadata store is used to create unique new segment IDs for appending tasks such as Kafka/Kinesis indexing tasks or batch tasks of appending mode. Automatic pending segments cleanup was introduced in 0.12.0, but has been disabled by default prior to 0.18....
druid-0.17.1
Apache Druid 0.17.1 is a security bug fix release that addresses the following CVE for LDAP authentication:
- [CVE-2020-1958]: Apache Druid LDAP injection vulnerability (https://lists.apache.org/thread.html/r9d437371793b410f8a8e18f556d52d4bb68e18c537962f6a97f4945e%40%3Cdev.druid.apache.org%3E)
druid-0.17.0
Apache Druid 0.17.0 contains over 250 new features, performance enhancements, bug fixes, and major documentation improvements from 52 contributors. Check out the complete list of changes and everything tagged to the milestone.
Highlights
Batch ingestion improvements
Druid 0.17.0 includes a significant update to the native batch ingestion system. This update adds the internal framework to support non-text binary formats, with initial support for ORC and Parquet. Additionally, native batch tasks can now read data from HDFS.
This rework changes how the ingestion source and data format are specified in the ingestion task. To use the new features, please refer to the documentation on InputSources and InputFormats.
Please see the following documentation for details:
https://druid.apache.org/docs/0.17.0/ingestion/data-formats.html#input-format
https://druid.apache.org/docs/0.17.0/ingestion/native-batch.html#input-sources
https://druid.apache.org/docs/0.17.0/ingestion/native-batch.html#partitionsspec
Single dimension range partitioning for parallel native batch ingestion
The parallel index task now supports the single_dim
type partitions spec, which allows for range-based partitioning on a single dimension.
Please see https://druid.apache.org/docs/0.17.0/ingestion/native-batch.html for details.
Compaction changes
Parallel index task split hints
The parallel indexing task now has a new configuration, splitHintSpec
, in the tuningConfig
to allow for operators to provide hints to control the amount of data that each first phase subtask reads. There is currently one split hint spec type, SegmentsSplitHintSpec
, used for re-ingesting Druid segments.
Parallel auto-compaction
Auto-compaction can now use the parallel indexing task, allowing for greater compaction throughput.
To control the level of parallelism, the auto-compactiontuningConfig
has new parameters, maxNumConcurrentSubTasks
and splitHintSpec
.
Please see https://druid.apache.org/docs/0.17.0/configuration/index.html#compaction-dynamic-configuration for details.
Stateful auto-compaction
Auto-compaction now uses the partitionSpec to track changes made by previous compaction tasks, allowing the coordinator to reduce redundant compaction operations.
Please see #8489 for details.
If you have auto-compaction enabled, please see the information under "Stateful auto-compaction changes" in the "Upgrading to Druid 0.17.0" section before upgrading.
Parallel query merging on brokers
The Druid broker can now opportunistically merge query results in parallel using multiple threads.
Please see druid.processing.merge.useParallelMergePool
in the Broker section of the configuration reference for details on how to configure this new feature.
Parallel merging is enabled by default (controlled by the druid.processing.merge.useParallelMergePool
property), and most users should not have to change any of the advanced configuration properties described in the configuration reference.
Additionally, merge parallelism can be controlled on a per-query basis using the query context. Information about the new query context parameters can be found at https://druid.apache.org/docs/0.17.0/querying/query-context.html.
SQL-compatible null handling
In 0.17.0, we have added official documentation for Druid's SQL-compatible null handling mode.
Please see https://druid.apache.org/docs/0.17.0/configuration/index.html#sql-compatible-null-handling and https://druid.apache.org/docs/0.17.0/design/segments.html#sql-compatible-null-handling for details.
Several bugs that existed in this previously undocumented mode have been fixed, particularly around null handling in numeric columns. We recommend that users begin to consider transitioning their clusters to this new mode after upgrading to 0.17.0.
The full list of null handling bugs fixed in 0.17.0 can be found at https://github.com/apache/druid/issues?utf8=%E2%9C%93&q=label%3A%22Area+-+Null+Handling%22+milestone%3A0.17.0+
LDAP extension
Druid now supports LDAP authentication. Authorization using LDAP groups is also supported by mapping LDAP groups to Druid roles.
- LDAP authentication is handled by specifying an LDAP-type credentials validator.
- Authorization using LDAP is handled by specifying an LDAP-type role provider, and defining LDAP group->Druid role mappings within Druid.
LDAP integration requires the druid-basic-security
core extension. Please see https://druid.apache.org/docs/0.17.0/development/extensions-core/druid-basic-security.html for details.
As this is the first release with LDAP support, and there are a large variety of LDAP ecosystems, some LDAP use cases and features may not be supported yet. Please file an issue if you need enhancements to this new functionality.
Dropwizard emitter
A new Dropwizard metrics emitter has been added as a contrib extension.
The currently supported Dropwizard metrics types are counter, gauge, meter, timer and histogram. These metrics can be emitted using either a Console or JMX reporter.
Please see https://druid.apache.org/docs/0.17.0/design/extensions-contrib/dropwizard.html for details.
Self-discovery resource
A new pair of endpoints have been added to all Druid services that return information about whether the Druid service has received a confirmation that the service has been added to the cluster, from the central service discovery mechanism (currently ZooKeeper). These endpoints can be useful as health/ready checks.
The new endpoints are:
/status/selfDiscovered/status
/status/selfDiscovered
Please see the Druid API reference for details.
Supervisors system table
Task supervisors (e.g. Kafka or Kinesis supervisors) are now recorded in the system tables in a new sys.supervisors
table.
Please see https://druid.apache.org/docs/0.17.0/querying/sql.html#supervisors-table for details.
Fast historical start with lazy loading
A new boolean configuration property for historicals, druid.segmentCache.lazyLoadOnStart
, has been added.
This new property allows historicals to defer loading of a segment until the first time that segment is queried, which can significantly decrease historical startup times for clusters with a large number of segments.
Please see the configuration reference for details.
Historical segment cache distribution change
A new historical property, druid.segmentCache.locationSelectorStrategy
, has been added.
If there are multiple segment storage locations specified in druid.segmentCache.locations
, the new locationSelectorStrategy
property allows the user to specify what strategy is used to fill the locations. Currently supported options are roundRobin
and leastBytesUsed
.
Please see the configuration reference for details.
New readiness endpoints
A new Broker endpoint has been added: /druid/broker/v1/readiness
.
A new Historical endpoint has been added: /druid/historical/v1/readiness
.
These endpoints are similar to the existing /druid/broker/v1/loadstatus
and /druid/historical/v1/loadstatus
endpoints.
They differ in that they do not require authentication/authorization checks, and instead of a JSON body they only return a 200 success or 503 HTTP response code.
Support task assignment based on MiddleManager categories
It is now possible to define a "category" name property for each MiddleManager. New worker select strategies that are category-aware have been added, allowing the user to control how tasks are assigned to MiddleManagers based on the configured categories.
Please see the documentation for druid.worker.category
in the configuration reference, and the following links, for more details:
https://druid.apache.org/docs/0.17.0/configuration/index.htmlEqual-Distribution-With-Category-Spec
https://druid.apache.org/docs/0.17.0/configuration/index.html#Fill-Capacity-With-Category-Spec
https://druid.apache.org/docs/0.17.0/configuration/index.html#WorkerCategorySpec
Security vulnerability updates
A large number of dependencies have been updated to newer versions to address security vulnerabilities.
Please see the PRs below for details:
Upgrading to Druid 0.17.0
Select native query has been replaced
The deprecated Select native query type has been removed in 0.17.0.
If you have native queries that use Select, you need to modify them to use Scan instead. See the Scan query documentation (https://druid.apache.org/docs/0.17.0/querying/scan-query.html) for syntax and output format details.
For Druid SQL queries that use Select, no...
druid-0.16.1-incubating
Apache Druid 0.16.1-incubating is a bug fix and user experience improvement release that fixes a rolling upgrade issue, improves the startup scripts, and updates licensing information.
Bug Fixes
#8682 implement FiniteFirehoseFactory in InlineFirehose
#8905 Retrying with a backward compatible task type on unknown task type error in parallel indexing
User Experience Improvements
#8792 Use bundled ZooKeeper in tutorials.
#8794 Startup scripts: verify Java 8 (exactly), improve port/java verification messages.
#8942 Improve verify-default-ports to check both INADDR_ANY and 127.0.0.1.
#8798 Fix verify script.
Licensing Update
#8944 Add license for tutorial wiki data
#8968 Add licenses.yaml entry for Wikipedia sample data
Other
#8419 Bump Apache Thrift to 0.10.0
Updating from 0.16.0-incubating and earlier
PR #8905 fixes an issue with rolling upgrades when updating from earlier versions.
Credits
Thanks to everyone who contributed to this release!
@aditya-r-m
@clintropolis
@Fokko
@gianm
@jihoonson
@jon-wei
Apache Druid (incubating) is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.
druid-0.16.0-incubating
Apache Druid 0.16.0-incubating contains over 350 new features, performance enhancements, bug fixes, and major documentation improvements from 50 contributors. Check out the complete list of changes and everything tagged to the milestone.
Highlights
# Performance
# 'Vectorized' query processing
An experimental 'vectorized' query execution engine is new in 0.16.0, which can provide a speed increase in the range of 1.3-3x for timeseries and group by v2 queries. It operates on the principle of batching operations on rows instead of processing a single row at a time, e.g. iterating bitmaps in batches instead of per row, reading column values in batches, filtering in batches, aggregating values in batches, and so on. This results in significantly fewer method calls, better memory locality, and increased cache efficiency.
This is an experimental feature, but we view it as the path forward for Druid query processing and are excited for feedback as we continue to improve and fill out missing features in upcoming releases.
- Only timeseries and groupBy have vectorized engines.
- GroupBy doesn't handle multi-value dimensions or granularity other than "all" yet.
- Vector cursors cannot handle virtual columns or descending order.
- Expressions are not supported anywhere: not as inputs to aggregators, in virtual functions, or in filters.
- Only some aggregators have vectorized implementations: "count", "doubleSum", "floatSum", "longSum", "hyperUnique", and "filtered".
- Only some filters have vectorized matchers: "selector", "bound", "in", "like", "regex", "search", "and", "or", and "not".
- Dimension specs other than "default" don't work yet (no extraction functions or filtered dimension specs).
The feature can be enabled by setting "vectorize": true
your query context (the default is false
). This works both for Druid SQL and for native queries. When set to true
, vectorization will be used if possible; otherwise, Druid will fall back to its non-vectorized query engine. You can also set it to "force"
, which will return an error if the query cannot be fully vectorized. This is helpful for confirming that vectorization is indeed being used.
You can control the block size during execution by setting the vectorSize
query context parameter (default is 1000
).
# GroupBy array-based result rows
groupBy v2 queries now use an array-based representation of result rows, rather than the map-based representation used by prior versions of Druid. This provides faster generation and processing of result sets. Out of the box this change is invisible and backwards-compatible; you will not have to change any configuration to reap the benefits of this more efficient format, and it will have no impact on cached results. Internally this format will always be utilized automatically by the broker in the queries that it issues to historicals. By default the results will be translated back to the existing 'map' based format at the broker before sending them back to the client.
However, if you would like to avoid the overhead of this translation, and get even faster results,resultAsArray
may be set on the query context to directly pass through the new array based result row format. The schema is as follows, in order:
- Timestamp (optional; only if granularity != ALL)
- Dimensions (in order)
- Aggregators (in order)
- Post-aggregators (optional; in order, if present)
# Additional performance enhancements
The complete set of pull requests tagged as performance enhancements for 0.16 can be found here.
# "Minor" compaction
Users of the Kafka indexing service and compaction and who get a trickle of late data, can find a huge improvement in the form of a new concept called 'minor' compaction. Enabled by internal changes to how data segments are versioned, minor compaction is based on the idea of 'segment' based locking at indexing time instead of the current Druid locking behavior (which is now referred to as 'time chunk' locking). Segment locking as you might expect allows only the segments which are being compacted to be locked, while still allowing new 'appending' indexing tasks (like Kafka indexing tasks) to continue to run and create new segments, simulataneously. This is a big deal if you get a lot of late data, because the current behavior results in compaction tasks starving as higher priority realtime tasks hog the locks. This prevention of compaction tasks from optimizing the datasources segment sizes results in reduced overall performance.
To enable segment locking, you will need to set forceTimeChunkLock
to false
in the task context, or set druid.indexer.tasklock.forceTimeChunkLock=false
in the Overlord configuration. However, beware, after enabling this feature, due to the changes in segment versioning, there is no rollback path built in, so once you upgrade to 0.16, you cannot downgrade to an older version of Druid. Because of this, we highly recommend confirming that Druid 0.16 is stable in your cluster before enabling this feature.
It has a humble name, but the changes of minor compaction run deep, and it is not possible to adequately describe the mechanisms that drive this in these release notes, so check out the proposal and PR for more details.
# Druid "indexer" process
The new Indexer process is an alternative to the MiddleManager + Peon task execution system. Instead of forking a separate JVM process per-task, the Indexer runs tasks as separate threads within a single JVM process. The Indexer is designed to be easier to configure and deploy compared to the MiddleManager + Peon system and to better enable resource sharing across tasks.
The advantage of the Indexer is that it allows query processing resources, lookups, cached authentication/authorization information, and much more to be shared between all running indexing task threads, giving each individual task access to a larger pool of resources and far fewer redundant actions done than is possible with the Peon model of execution where each task is isolated in its own process.
Using Indexer does come with one downside: the loss of process isolation provided by Peon processes means that a single task can potentially affect all running indexing tasks on that Indexer. The druid.worker.globalIngestionHeapLimitBytes
and druid.worker.numConcurrentMerges
configurations are meant to help minimize this. Additionally, task logs for indexer processes will be inline with the Indexer process log, and not persisted to deep storage.
You can start using indexing by supplying server indexer
as the command-line argument to org.apache.druid.cli.Main
when starting the service. To use Indexer in place of a MiddleManager and Peon, you should be able to adapt values from the configuration into the Indexer configuration, lifting druid.indexer.fork.property.
configurations directly to the Indexer, and sizing heap and direct memory based on the Peon sizes multiplied by the number of task slots (unlike a MiddleManager, it does not accept the configurations druid.indexer.runner.javaOpts
or druid.indexer.runner.javaOptsArray
). See the indexer documentation for details.
# Native parallel batch indexing with shuffle
In 0.16.0, Druid's index_parallel
native parallel batch indexing task now supports 'perfect' rollup with the implementation of a 2 stage shuffle process.
Tasks in stage 1 perform a secondary partitioning of rows on top of the standard time based partitioning of segment granularity, creating an intermediary data segment for each partition. Stage 2 tasks are each assigned a set of the partitionings created during stage 1, and will collect and combine the set of intermediary data segments which belong to that partitioning, allowing it to achieve complete rollup when building the final segments. At this time, only hash-based partitioning is supported.
This can be enabled by setting forceGuaranteedRollup
to true
in the tuningConfig
; numShards
in partitionsSpec
and intervals
in granularitySpec
must also be set.
The Druid MiddleManager (or the new Indexer) processes have a new responsibility for these indexing tasks, serving the intermediary partition segments output of stage 1 into the stage 2 tasks, so depending on configuration and cluster size, the MiddleManager jvm configuration might need to be adjusted to increase heap allocation and http threads. These numbers are expected to scale with cluster size, as all MiddleManager or Indexer processes involved in a shuffle will need the ability to communicate with each other, but we do not expect the footprint to be significantly larger than it is currently. Optimistically we suggest trying with your existing configurations, and bumping up heap and http thread count only if issues are encountered.
#...
druid-0.15.1-incubating
Apache Druid 0.15.1-incubating is a bug fix release that includes important fixes for Apache Zookeeper based segment loading, the 'druid-datasketches' extension, and much more.
Bug Fixes
Coordinator
#8137 coordinator throwing exception trying to load segments (fixed by #8140)
Middlemanager
#7886 Middlemanager fails startup due to corrupt task files (fixed by #7917)
#8085 fix forking task runner task shutdown to be more graceful
Queries
#7777 timestamp_ceil function is either wrong or misleading (fixed by #7823)
#7820 subtotalsSpec and filtering returns no results (fixed by #7827)
#8013 Fix ExpressionVirtualColumn capabilities; fix groupBy's improper uses of StorageAdapter#getColumnCapabilities.
API
#6786 apache-druid-0.13.0-incubating router /druid/router/v1/brokers (fixed by #8026)
#8044 SupervisorManager: Add authorization checks to bulk endpoints.
Metrics Emitters
#8204 HttpPostEmitter throw Class cast exception when using emitAndReturnBatch (fixed by #8205)
Extensions
Datasketches
#7666 sketches-core-0.13.4
#8055 force native order when wrapping ByteBuffer
Kinesis Indexing Service
#7830 Kinesis: Fix getPartitionIds, should be checking isHasMoreShards.
Moving Average Query
#7999 Druid moving average query results in circular reference error (fixed by #8192)
Documentation Fixes
#8002 Improve pull-deps reference in extensions page.
#8003 Add missing reference to Materialized-View extension.
#8079 Fix documentation formatting
#8087 fix references to bin/supervise in tutorial docs
Updating from 0.15.0-incubating and earlier
Due to issue #8137, when updating from specifically 0.15.0-incubating to 0.15.1-incubating, it is recommended to update the Coordinator before the Historical servers to prevent segment unavailability during an upgrade (this is typically reversed). Upgrading from any version older than 0.15.0-incubating does not have these concerns and can be done normally.
Known Issues
Building Docker images is currently broken and will be fixed in the next release, see #8054 which is fixed by #8237 for more details.
Credits
Thanks to everyone who contributed to this release!
@AlexanderSaydakov
@ArtyomyuS
@ccl0326
@clintropolis
@gianm
@himanshug
@jihoonson
@legoscia
@leventov
@pjain1
@yurmix
@xueyumusic
Apache Druid (incubating) is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.
druid-0.15.0-incubating
Apache Druid 0.15.0-incubating contains over 250 new features, performance/stability/documentation improvements, and bug fixes from 39 contributors. Major new features and improvements include:
- New Data Loader UI
- Support transactional Kafka topic
- New Moving Average query
- Time ordering for Scan query
- New Moments Sketch aggregator
- SQL enhancements
- Light lookup module for routers
- Core ORC extension
- Core GCP extension
- Document improvements
The full list of changes is here: https://github.com/apache/incubator-druid/pulls?q=is%3Apr+is%3Aclosed+milestone%3A0.15.0
Documentation for this release is at: http://druid.apache.org/docs/0.15.0-incubating/
Highlights
New Data Loader UI (Batch indexing part)
Druid has a new Data Loader UI which is integrated with the Druid Console. The new Data Loader UI shows some sampled data to easily verify the ingestion spec and generates the final ingestion spec automatically. The users are expected to easily issue batch index tasks instead of writing a JSON spec by themselves.
Added by @vogievetsky and @dclim in #7572 and #7531, respectively.
Support Kafka Transactional Topics
The Kafka indexing service now supports Kafka Transactional Topics.
Please note that only Kafka 0.11.0 or later versions are supported after this change.
Added by @surekhasaharan in #6496.
New Moving Average Query
A new query type was introduced to compute moving average.
Please see http://druid.apache.org/docs/0.15.0-incubating/development/extensions-contrib/moving-average-query.html for more details.
Time Ordering for Scan Query
The Scan query type now supports time ordering. Please see http://druid.apache.org/docs/0.15.0-incubating/querying/scan-query.html#time-ordering for more details.
Added by @justinborromeo in #7133.
New Moments Sketch Aggregator
The Moments Sketch is a new sketch type for approximate quantile computation. Please see http://druid.apache.org/docs/0.15.0-incubating/development/extensions-contrib/momentsketch-quantiles.html for more details.
SQL enhancements
Druid community has been striving to enhance SQL support and now it's no longer experimental.
New SQL functions
- LPAD and RPAD functions were added by @xueyumusic in #7388.
- DEGREES and RADIANS functions were added by @xueyumusic in #7336.
- STRING_FORMAT function was added by @gianm in #7327.
- PARSE_LONG function was added by @gianm in #7326.
- ROUND function was added by @gianm in #7224.
- Trigonometric functions were added by @xueyumusic in #7182.
Autocomplete in Druid Console
Druid Console now supports autocomplete for SQL.
Time-ordered scan support for SQL
Druid SQL supports time-ordered scan query.
Added by @justinborromeo in #7373.
Lookups view added to the web console
You can now configure your lookups from the web console directly.
Misc web console improvements
"NoSQL" mode : #7493 [@shuqi7]
The web console now has a backup mode that allows it to function as best as it can if DruidSQL is disabled or unavailable.
Added compaction configuration dialog : #7242 [@shuqi7]
You can now configure the auto compaction settings for a data source from the Datasource view.
Auto wrap query with limit : #7449 [@vogievetsky]
The console query view will now (by default) wrap DruidSQL queries with a SELECT * FROM (...) LIMIT 1000
allowing you to enter queries like SELECT * FROM your_table
without worrying about the impact to the cluster. You can still send 'raw' queries by selecting the option from the ...
menu.
SQL explain query : #7402 [@shuqi7]
You can now click on the ...
menu in the query view to get an explanation of the DruidSQL query.
Surface is_overshadowed
as a column in the segments table #7555 , #7425 [@shuqi7][@surekhasaharan]
is_overshadowed
column represents that this segment is overshadowed by any published segments. It can be useful to see what segments should be loaded by historicals. Please see http://druid.apache.org/docs/0.15.0-incubating/querying/sql.html for more details.
Improved status UI for actions on tasks, supervisors, and datasources : #7528 [shuqi7]
This PR condenses the actions list into a tidy menu and lets you see the detailed status for supervisors and tasks. New actions for datasources around loading and dropping data by interval has also been added.
Light Lookup Module for Routers
Light lookup module was introduced for Routers and they now need only minimum amount of memory. Please see http://druid.apache.org/docs/0.15.0-incubating/operations/basic-cluster-tuning.html#router for basic memory tuning.
Added by @clintropolis in #7222.
Core ORC extension
ORC extension is now promoted to a core extension. Please read the below 'Updating from 0.14.0-incubating and earlier' section if you are using the ORC extension in an earlier version of Druid.
Added by @clintropolis in #7138.
Core GCP extension
GCP extension is now promoted to a core extension. Please read the below 'Updating from 0.14.0-incubating and earlier' section if you are using the GCP extension in an earlier version of Druid.
Added by @drcrallen in #6953.
Document Improvements
Single-machine deployment example configurations and scripts
Several configurations and scripts were added for easy single machine setup. Please see http://druid.apache.org/docs/0.15.0-incubating/operations/single-server.html for details.
Tool for migrating from local deep storage/Derby metadata
A new tool was added for easy migration from single machine to a cluster environment. Please see http://druid.apache.org/docs/0.15.0-incubating/operations/deep-storage-migration.html for details.
Document for basic tuning guide
Documents for basic tuning guide was added. Please see http://druid.apache.org/docs/0.15.0-incubating/operations/basic-cluster-tuning.html for details.
Security Improvement
The Druid system table now requires only mandatory permissions instead of the read permission for the whole sys
database. Please see http://druid.apache.org/docs/0.15.0-incubating/development/extensions-core/druid-basic-security.html for details.
Deprecated/removed
Drop support for automatic segment merge
The automatic segment merge by the coordinator is not supported anymore. Please use auto compaction instead.
Added by @jihoonson in #6883.
Drop support for insert-segment-to-db
tool
In Druid 0.14.x or earlier, Druid stores segment metadata (descriptor.json
file) in deep storage in addition to metadata store. This behavior has changed in 0.15.0 and it doesn't store segment metadata file in deep storage anymore. As a result, insert-segment-to-db
tool is no longer supported as well since it works based on descriptor.json
files in deep storage. Please see http://druid.apache.org/docs/0.15.0-incubating/operations/insert-segment-db.html for details.
Please note that kill task will fail if you're using HDFS as deep storage and descriptor.json
file is missing in 0.14.x or earlier versions.
Added by @jihoonson in #6911.
Removed "useFallback" configuration for SQL
This option was removed since it generates unscalable query plans and doesn't work with some SQL functions.