You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Druid’s ingestion system is very versatile however currently the ingestion specs in Druid are complicated and require a meticulous examination of the documentation to craft. Some of the complexity in the ingestion specs is due to the fact that they evolved naturally over the lifetime of the project, over time supporting more and more features. Another difficulty is that there is no way to know if an ingestion spec will work as intended without submitting it to Druid and seeing if it works.
A well made, step-by-step GUI with a helpful wizard would not only make it easier to get data loaded into Druid but can also serve as an educational tool for all the features of the ingestion system. There would be less frustration and fewer failed tasks if a user could iteratively get feedback at every step of the process.
Proposed changes
Web interface
The proposed change is to build on top of the Druid web console to create a GUI spec editor / data loader wizard.
The specific ‘design direction’ is like that of TurboTax (or similar software). The ingestion spec will always be available for viewing / editing but the user is expected to interact with it through a series of steps that are arranged in a specific logical flow building one on top of the other. The web console (GUI) change will also be accompanied by a Druid based “sampler” that will be able to accept a partial spec and preview the resulting Druid data structure that will be generated. The UI would make repeated calls to the sampler module showing the user a progressively more refined preview as they progress through the steps (just like TurboTax gives a refund preview).
At the high level the data loader would guide the user through the following steps:
parser - configure the parser (json, csv, tsv, regex, custom)
timestamp - configure the timestampSpec
Transform and configure schema
transform - configure the transforms
filter - configure the ingest time filter (if any)
dimensions + metrics - configure the dimensions and metrics
Parameters
partition - partitioning options like segment granularity, max segment size, secondary partitioning, and other options
tuning - define the tuning config
Full spec - see the full spec (like seeing the full tax return in TurboTax)
Here are some potential designs for some of the steps:
A user selects the input source (configuring the ioConfig) and get a preview of the raw data in that source:
A best effort is then made to chose a parser. The user can override the parser as needed:
Once the parser is chosen the timestamp parsing can be configured and previewed:
The transforms apply on the parsed data:
Then the user can preview the schema and add dimensions and metrics (and turn rollup on and off) as needed:
And so on through all the steps.
At any point the user can jump to see / directly edit the full spec:
And then they are satisfied click “Submit” safe in the knowledge that everything will work.
Data Sampler
The data loader will be powered by a new sampler implementation that will be added to Druid’s core codebase. Some of the primary classes/interfaces that will be added/modified:
Adds an endpoint on the overlord at POST /druid/indexer/v1/sampler and receives an object in the request body that implements the Sampler interface.
IndexTaskSamplerSpec
This is an example of a sampler spec that can be used with any input source compatible with a native indexing task (i.e. Firehose based). Note that it takes a complete IndexIngestionSpec, meaning that in addition to supporting arbitrary firehoses, it can also apply arbitrary parseSpecs, transformSpecs, aggregators, and apply query granularity (the latter two are for previewing the effects of rollup).
public class IndexTaskSamplerSpec implements SamplerSpec
{
@JsonCreator
public IndexTaskSamplerSpec(
@JsonProperty("spec") final IndexTask.IndexIngestionSpec ingestionSpec,
@JsonProperty("samplerConfig") final SamplerConfig samplerConfig,
@JacksonInject FirehoseSampler firehoseSampler
)
Non-firehose based ingestion methods (e.g. Kafka and Kinesis indexing) would add additional SamplerSpec implementations.
SamplerConfig
public class SamplerConfig
{
private static final int DEFAULT_NUM_ROWS = 200;
private static final boolean DEFAULT_SKIP_CACHE = false;
private final Integer numRows;
private final String cacheKey;
private final boolean skipCache;
}
SamplerResponse
public class SamplerResponse
{
private final String cacheKey;
private final Integer numRowsRead;
private final Integer numRowsIndexed;
private final List<SamplerResponseRow> data;
}
SamplerResponseRow
public class SamplerResponseRow
{
private final String raw;
private final Map<String, Object> parsed;
private final Boolean unparseable;
private final String error;
}
Note that in addition to providing the parsed row after it has been parsed and run through an IncrementalIndex, it also includes the raw row which is helpful for the initial stages of the data loader, to determine that you are reading the intended source data and applying the correct parseSpec and timestampSpec.
Firehose
The Firehose interface needs to be modified to allow implementations to return the raw row (where applicable) in addition to the InputRow. A default method that does not return the raw row will handle implementations that do not support this:
/**
* Returns an InputRowPlusRaw object containing the InputRow plus the raw, unparsed data corresponding to the next row
* available. Used in the sampler to provide the caller with information to assist in configuring a parse spec. If a
* ParseException is thrown by the parser, it should be caught and returned in the InputRowPlusRaw so we will be able
* to provide information on the raw row which failed to be parsed. Should only be called if hasMore returns true.
*
* @return an InputRowPlusRaw which may contain any of: an InputRow, the raw data, or a ParseException
*/
default InputRowPlusRaw nextRowWithRaw()
{
try {
return InputRowPlusRaw.of(nextRow(), null);
}
catch (ParseException e) {
return InputRowPlusRaw.of(null, e);
}
}
FirehoseFactory
The FirehoseFactory interface may need to be modified to add a connection method suitable for a sampler. This means indicating to the sampler that we only need a limited amount of data and not to pre-cache an entire directory of files (PrefretchableTextFilesFirehoseFactory I'm looking at you!). We may be able to skip this and document to the API consumer to use appropriate Firehose configurations for the sampler:
A desirable design goal for the sampler is that it should return "processed" results on the same set of raw data every time, and the data should maintain a consistent ordering whenever possible (ideally the order it was read by the Firehose). This will make for a better user experience as they go through the different pages of the data loader (raw data -> parsed -> timestamp column identified -> transformed -> filtered -> column data types applied). The proposal is to add a temporary internal column as a 'metric' that we can use to sort the results we read back out of the IncrementalIndex. Especially for streaming-based sources, we would also want to cache the raw data so that we can continually feed the same raw data into the parser so the user can see the effects of their changes. We will use org.apache.druid.client.cache.Cache, so whatever cache implementation gets bound using druid.cache.type (default Caffeine).
Rationale
This would greatly simplify the complexity of on-boarding data into Druid.
A potential alternative to this would be to simply invest effort into simplifying the ingestion spec API.
Operational impact
None
Test plan
The data loader will be tested and improved through user testing
Future work
Assuming the above proposed project is successful it would be beneficial to adjust the existing data loading documentation to be focused more on the data loading flow. Furthermore if the logical model of the data loader flow above intuitively connects with people it could foster changes to the ingestion spec API itself.
The text was updated successfully, but these errors were encountered:
Motivation
Druid’s ingestion system is very versatile however currently the ingestion specs in Druid are complicated and require a meticulous examination of the documentation to craft. Some of the complexity in the ingestion specs is due to the fact that they evolved naturally over the lifetime of the project, over time supporting more and more features. Another difficulty is that there is no way to know if an ingestion spec will work as intended without submitting it to Druid and seeing if it works.
A well made, step-by-step GUI with a helpful wizard would not only make it easier to get data loaded into Druid but can also serve as an educational tool for all the features of the ingestion system. There would be less frustration and fewer failed tasks if a user could iteratively get feedback at every step of the process.
Proposed changes
Web interface
The proposed change is to build on top of the Druid web console to create a GUI spec editor / data loader wizard.
The specific ‘design direction’ is like that of TurboTax (or similar software). The ingestion spec will always be available for viewing / editing but the user is expected to interact with it through a series of steps that are arranged in a specific logical flow building one on top of the other. The web console (GUI) change will also be accompanied by a Druid based “sampler” that will be able to accept a partial spec and preview the resulting Druid data structure that will be generated. The UI would make repeated calls to the sampler module showing the user a progressively more refined preview as they progress through the steps (just like TurboTax gives a refund preview).
At the high level the data loader would guide the user through the following steps:
Here are some potential designs for some of the steps:
A user selects the input source (configuring the
ioConfig
) and get a preview of the raw data in that source:A best effort is then made to chose a parser. The user can override the parser as needed:
Once the parser is chosen the timestamp parsing can be configured and previewed:
The transforms apply on the parsed data:
Then the user can preview the schema and add dimensions and metrics (and turn rollup on and off) as needed:
And so on through all the steps.
At any point the user can jump to see / directly edit the full spec:
And then they are satisfied click “Submit” safe in the knowledge that everything will work.
Data Sampler
The data loader will be powered by a new sampler implementation that will be added to Druid’s core codebase. Some of the primary classes/interfaces that will be added/modified:
SamplerSpec
SamplerResource
Adds an endpoint on the overlord at
POST /druid/indexer/v1/sampler
and receives an object in the request body that implements theSampler
interface.IndexTaskSamplerSpec
This is an example of a sampler spec that can be used with any input source compatible with a native indexing task (i.e.
Firehose
based). Note that it takes a completeIndexIngestionSpec
, meaning that in addition to supporting arbitrary firehoses, it can also apply arbitrary parseSpecs, transformSpecs, aggregators, and apply query granularity (the latter two are for previewing the effects of rollup).Non-firehose based ingestion methods (e.g. Kafka and Kinesis indexing) would add additional
SamplerSpec
implementations.SamplerConfig
SamplerResponse
SamplerResponseRow
Note that in addition to providing the
parsed
row after it has been parsed and run through anIncrementalIndex
, it also includes theraw
row which is helpful for the initial stages of the data loader, to determine that you are reading the intended source data and applying the correct parseSpec and timestampSpec.Firehose
The Firehose interface needs to be modified to allow implementations to return the raw row (where applicable) in addition to the InputRow. A default method that does not return the raw row will handle implementations that do not support this:
FirehoseFactory
The FirehoseFactory interface may need to be modified to add a connection method suitable for a sampler. This means indicating to the sampler that we only need a limited amount of data and not to pre-cache an entire directory of files (
PrefretchableTextFilesFirehoseFactory
I'm looking at you!). We may be able to skip this and document to the API consumer to use appropriate Firehose configurations for the sampler:Other considerations
A desirable design goal for the sampler is that it should return "processed" results on the same set of raw data every time, and the data should maintain a consistent ordering whenever possible (ideally the order it was read by the Firehose). This will make for a better user experience as they go through the different pages of the data loader (raw data -> parsed -> timestamp column identified -> transformed -> filtered -> column data types applied). The proposal is to add a temporary internal column as a 'metric' that we can use to sort the results we read back out of the
IncrementalIndex
. Especially for streaming-based sources, we would also want to cache the raw data so that we can continually feed the same raw data into the parser so the user can see the effects of their changes. We will useorg.apache.druid.client.cache.Cache
, so whatever cache implementation gets bound usingdruid.cache.type
(default Caffeine).Rationale
This would greatly simplify the complexity of on-boarding data into Druid.
A potential alternative to this would be to simply invest effort into simplifying the ingestion spec API.
Operational impact
None
Test plan
The data loader will be tested and improved through user testing
Future work
Assuming the above proposed project is successful it would be beneficial to adjust the existing data loading documentation to be focused more on the data loading flow. Furthermore if the logical model of the data loader flow above intuitively connects with people it could foster changes to the ingestion spec API itself.
The text was updated successfully, but these errors were encountered: