[Proposal] Data loader GUI #7502

vogievetsky · 2019-04-18T01:08:55Z

Motivation

Druid’s ingestion system is very versatile however currently the ingestion specs in Druid are complicated and require a meticulous examination of the documentation to craft. Some of the complexity in the ingestion specs is due to the fact that they evolved naturally over the lifetime of the project, over time supporting more and more features. Another difficulty is that there is no way to know if an ingestion spec will work as intended without submitting it to Druid and seeing if it works.

A well made, step-by-step GUI with a helpful wizard would not only make it easier to get data loaded into Druid but can also serve as an educational tool for all the features of the ingestion system. There would be less frustration and fewer failed tasks if a user could iteratively get feedback at every step of the process.

Proposed changes

Web interface

The proposed change is to build on top of the Druid web console to create a GUI spec editor / data loader wizard.

The specific ‘design direction’ is like that of TurboTax (or similar software). The ingestion spec will always be available for viewing / editing but the user is expected to interact with it through a series of steps that are arranged in a specific logical flow building one on top of the other. The web console (GUI) change will also be accompanied by a Druid based “sampler” that will be able to accept a partial spec and preview the resulting Druid data structure that will be generated. The UI would make repeated calls to the sampler module showing the user a progressively more refined preview as they progress through the steps (just like TurboTax gives a refund preview).

At the high level the data loader would guide the user through the following steps:

Connect and parse raw data
- input config - configure the input location (kafka, s3, local, e.t.c)
- parser - configure the parser (json, csv, tsv, regex, custom)
- timestamp - configure the timestampSpec
Transform and configure schema
- transform - configure the transforms
- filter - configure the ingest time filter (if any)
- dimensions + metrics - configure the dimensions and metrics
Parameters
- partition - partitioning options like segment granularity, max segment size, secondary partitioning, and other options
- tuning - define the tuning config
Full spec - see the full spec (like seeing the full tax return in TurboTax)

Here are some potential designs for some of the steps:

A user selects the input source (configuring the ioConfig) and get a preview of the raw data in that source:

A best effort is then made to chose a parser. The user can override the parser as needed:

Once the parser is chosen the timestamp parsing can be configured and previewed:

The transforms apply on the parsed data:

Then the user can preview the schema and add dimensions and metrics (and turn rollup on and off) as needed:

And so on through all the steps.

At any point the user can jump to see / directly edit the full spec:

And then they are satisfied click “Submit” safe in the knowledge that everything will work.

Data Sampler

The data loader will be powered by a new sampler implementation that will be added to Druid’s core codebase. Some of the primary classes/interfaces that will be added/modified:

SamplerSpec

@JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "type")
public interface SamplerSpec
{
  SamplerResponse sample();
}

SamplerResource

Adds an endpoint on the overlord at POST /druid/indexer/v1/sampler and receives an object in the request body that implements the Sampler interface.

IndexTaskSamplerSpec

This is an example of a sampler spec that can be used with any input source compatible with a native indexing task (i.e. Firehose based). Note that it takes a complete IndexIngestionSpec, meaning that in addition to supporting arbitrary firehoses, it can also apply arbitrary parseSpecs, transformSpecs, aggregators, and apply query granularity (the latter two are for previewing the effects of rollup).

public class IndexTaskSamplerSpec implements SamplerSpec
{
  @JsonCreator
  public IndexTaskSamplerSpec(
      @JsonProperty("spec") final IndexTask.IndexIngestionSpec ingestionSpec,
      @JsonProperty("samplerConfig") final SamplerConfig samplerConfig,
      @JacksonInject FirehoseSampler firehoseSampler
  )

Non-firehose based ingestion methods (e.g. Kafka and Kinesis indexing) would add additional SamplerSpec implementations.

SamplerConfig

public class SamplerConfig
{
  private static final int DEFAULT_NUM_ROWS = 200;
  private static final boolean DEFAULT_SKIP_CACHE = false;

  private final Integer numRows;
  private final String cacheKey;
  private final boolean skipCache;
}

SamplerResponse

public class SamplerResponse
{
  private final String cacheKey;
  private final Integer numRowsRead;
  private final Integer numRowsIndexed;
  private final List<SamplerResponseRow> data;
}

SamplerResponseRow

public class SamplerResponseRow
{
  private final String raw;
  private final Map<String, Object> parsed;
  private final Boolean unparseable;
  private final String error;
}

Note that in addition to providing the parsed row after it has been parsed and run through an IncrementalIndex, it also includes the raw row which is helpful for the initial stages of the data loader, to determine that you are reading the intended source data and applying the correct parseSpec and timestampSpec.

Firehose

The Firehose interface needs to be modified to allow implementations to return the raw row (where applicable) in addition to the InputRow. A default method that does not return the raw row will handle implementations that do not support this:

  /**
   * Returns an InputRowPlusRaw object containing the InputRow plus the raw, unparsed data corresponding to the next row
   * available. Used in the sampler to provide the caller with information to assist in configuring a parse spec. If a
   * ParseException is thrown by the parser, it should be caught and returned in the InputRowPlusRaw so we will be able
   * to provide information on the raw row which failed to be parsed. Should only be called if hasMore returns true.
   *
   * @return an InputRowPlusRaw which may contain any of: an InputRow, the raw data, or a ParseException
   */
  default InputRowPlusRaw nextRowWithRaw()
  {
    try {
      return InputRowPlusRaw.of(nextRow(), null);
    }
    catch (ParseException e) {
      return InputRowPlusRaw.of(null, e);
    }
  }

FirehoseFactory

The FirehoseFactory interface may need to be modified to add a connection method suitable for a sampler. This means indicating to the sampler that we only need a limited amount of data and not to pre-cache an entire directory of files (PrefretchableTextFilesFirehoseFactory I'm looking at you!). We may be able to skip this and document to the API consumer to use appropriate Firehose configurations for the sampler:

  default Firehose connectForSampler(T parser, @Nullable File temporaryDirectory)  throws IOException, ParseException
  {
    return connect(parser, temporaryDirectory);
  }

Other considerations

A desirable design goal for the sampler is that it should return "processed" results on the same set of raw data every time, and the data should maintain a consistent ordering whenever possible (ideally the order it was read by the Firehose). This will make for a better user experience as they go through the different pages of the data loader (raw data -> parsed -> timestamp column identified -> transformed -> filtered -> column data types applied). The proposal is to add a temporary internal column as a 'metric' that we can use to sort the results we read back out of the IncrementalIndex. Especially for streaming-based sources, we would also want to cache the raw data so that we can continually feed the same raw data into the parser so the user can see the effects of their changes. We will use org.apache.druid.client.cache.Cache, so whatever cache implementation gets bound using druid.cache.type (default Caffeine).

Rationale

This would greatly simplify the complexity of on-boarding data into Druid.

A potential alternative to this would be to simply invest effort into simplifying the ingestion spec API.

Operational impact

None

Test plan

The data loader will be tested and improved through user testing

Future work

Assuming the above proposed project is successful it would be beneficial to adjust the existing data loading documentation to be focused more on the data loading flow. Furthermore if the logical model of the data loader flow above intuitively connects with people it could foster changes to the ingestion spec API itself.

The text was updated successfully, but these errors were encountered:

vogievetsky · 2019-05-06T15:25:38Z

The batch parts of this proposal have been merged and are working nicely together. Now we need to add the streaming components also.

vogievetsky · 2019-06-28T23:09:55Z

this is shipped in 0.15.0 🎆

vogievetsky added Design Review Proposal labels Apr 18, 2019

vogievetsky mentioned this issue Apr 18, 2019

Is there a way ingest a datasource base on a query result #7431

Closed

dclim added the Area - Web Console label Apr 19, 2019

This was referenced Apr 23, 2019

Data loader (sampler component) #7531

Merged

Data loader (sampler component) - Kafka/Kinesis samplers #7566

Merged

vogievetsky mentioned this issue Apr 29, 2019

Data loader (GUI component) #7572

Merged

This was referenced May 10, 2019

Add ability for task and supervisor to recall the exact spec that was sent to them #7631

Closed

Web console, adding Apache Kafka and AWS Kinesis to the data loader #7643

Merged

vogievetsky closed this as completed Jun 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Data loader GUI #7502

[Proposal] Data loader GUI #7502

vogievetsky commented Apr 18, 2019 •

edited by dclim

Loading

vogievetsky commented May 6, 2019

vogievetsky commented Jun 28, 2019

[Proposal] Data loader GUI #7502

[Proposal] Data loader GUI #7502

Comments

vogievetsky commented Apr 18, 2019 • edited by dclim Loading

Motivation

Proposed changes

Web interface

Data Sampler

SamplerSpec

SamplerResource

IndexTaskSamplerSpec

SamplerConfig

SamplerResponse

SamplerResponseRow

Firehose

FirehoseFactory

Other considerations

Rationale

Operational impact

Test plan

Future work

vogievetsky commented May 6, 2019

vogievetsky commented Jun 28, 2019

vogievetsky commented Apr 18, 2019 •

edited by dclim

Loading