Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Data loader GUI #7502

Closed
vogievetsky opened this issue Apr 18, 2019 · 2 comments
Closed

[Proposal] Data loader GUI #7502

vogievetsky opened this issue Apr 18, 2019 · 2 comments

Comments

@vogievetsky
Copy link
Contributor

vogievetsky commented Apr 18, 2019

Motivation

Druid’s ingestion system is very versatile however currently the ingestion specs in Druid are complicated and require a meticulous examination of the documentation to craft. Some of the complexity in the ingestion specs is due to the fact that they evolved naturally over the lifetime of the project, over time supporting more and more features. Another difficulty is that there is no way to know if an ingestion spec will work as intended without submitting it to Druid and seeing if it works.

A well made, step-by-step GUI with a helpful wizard would not only make it easier to get data loaded into Druid but can also serve as an educational tool for all the features of the ingestion system. There would be less frustration and fewer failed tasks if a user could iteratively get feedback at every step of the process.

Proposed changes

Web interface

The proposed change is to build on top of the Druid web console to create a GUI spec editor / data loader wizard.

The specific ‘design direction’ is like that of TurboTax (or similar software). The ingestion spec will always be available for viewing / editing but the user is expected to interact with it through a series of steps that are arranged in a specific logical flow building one on top of the other. The web console (GUI) change will also be accompanied by a Druid based “sampler” that will be able to accept a partial spec and preview the resulting Druid data structure that will be generated. The UI would make repeated calls to the sampler module showing the user a progressively more refined preview as they progress through the steps (just like TurboTax gives a refund preview).

At the high level the data loader would guide the user through the following steps:

  • Connect and parse raw data
    • input config - configure the input location (kafka, s3, local, e.t.c)
    • parser - configure the parser (json, csv, tsv, regex, custom)
    • timestamp - configure the timestampSpec
  • Transform and configure schema
    • transform - configure the transforms
    • filter - configure the ingest time filter (if any)
    • dimensions + metrics - configure the dimensions and metrics
  • Parameters
    • partition - partitioning options like segment granularity, max segment size, secondary partitioning, and other options
    • tuning - define the tuning config
  • Full spec - see the full spec (like seeing the full tax return in TurboTax)

Here are some potential designs for some of the steps:

A user selects the input source (configuring the ioConfig) and get a preview of the raw data in that source:

localhost_18081_unified-console html(Doc Screenshot) (2)

A best effort is then made to chose a parser. The user can override the parser as needed:

localhost_18081_unified-console html(Doc Screenshot) (3)

Once the parser is chosen the timestamp parsing can be configured and previewed:

localhost_18081_unified-console html(Doc Screenshot) (5)

The transforms apply on the parsed data:

localhost_18081_unified-console html(Doc Screenshot) (4)

Then the user can preview the schema and add dimensions and metrics (and turn rollup on and off) as needed:

localhost_18081_unified-console html(Doc Screenshot) (6)

And so on through all the steps.

At any point the user can jump to see / directly edit the full spec:

localhost_18081_unified-console html(Doc Screenshot) (7)

And then they are satisfied click “Submit” safe in the knowledge that everything will work.

Data Sampler

The data loader will be powered by a new sampler implementation that will be added to Druid’s core codebase. Some of the primary classes/interfaces that will be added/modified:

SamplerSpec
@JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "type")
public interface SamplerSpec
{
  SamplerResponse sample();
}
SamplerResource

Adds an endpoint on the overlord at POST /druid/indexer/v1/sampler and receives an object in the request body that implements the Sampler interface.

IndexTaskSamplerSpec

This is an example of a sampler spec that can be used with any input source compatible with a native indexing task (i.e. Firehose based). Note that it takes a complete IndexIngestionSpec, meaning that in addition to supporting arbitrary firehoses, it can also apply arbitrary parseSpecs, transformSpecs, aggregators, and apply query granularity (the latter two are for previewing the effects of rollup).

public class IndexTaskSamplerSpec implements SamplerSpec
{
  @JsonCreator
  public IndexTaskSamplerSpec(
      @JsonProperty("spec") final IndexTask.IndexIngestionSpec ingestionSpec,
      @JsonProperty("samplerConfig") final SamplerConfig samplerConfig,
      @JacksonInject FirehoseSampler firehoseSampler
  )

Non-firehose based ingestion methods (e.g. Kafka and Kinesis indexing) would add additional SamplerSpec implementations.

SamplerConfig
public class SamplerConfig
{
  private static final int DEFAULT_NUM_ROWS = 200;
  private static final boolean DEFAULT_SKIP_CACHE = false;

  private final Integer numRows;
  private final String cacheKey;
  private final boolean skipCache;
}
SamplerResponse
public class SamplerResponse
{
  private final String cacheKey;
  private final Integer numRowsRead;
  private final Integer numRowsIndexed;
  private final List<SamplerResponseRow> data;
}
SamplerResponseRow
public class SamplerResponseRow
{
  private final String raw;
  private final Map<String, Object> parsed;
  private final Boolean unparseable;
  private final String error;
}

Note that in addition to providing the parsed row after it has been parsed and run through an IncrementalIndex, it also includes the raw row which is helpful for the initial stages of the data loader, to determine that you are reading the intended source data and applying the correct parseSpec and timestampSpec.

Firehose

The Firehose interface needs to be modified to allow implementations to return the raw row (where applicable) in addition to the InputRow. A default method that does not return the raw row will handle implementations that do not support this:

  /**
   * Returns an InputRowPlusRaw object containing the InputRow plus the raw, unparsed data corresponding to the next row
   * available. Used in the sampler to provide the caller with information to assist in configuring a parse spec. If a
   * ParseException is thrown by the parser, it should be caught and returned in the InputRowPlusRaw so we will be able
   * to provide information on the raw row which failed to be parsed. Should only be called if hasMore returns true.
   *
   * @return an InputRowPlusRaw which may contain any of: an InputRow, the raw data, or a ParseException
   */
  default InputRowPlusRaw nextRowWithRaw()
  {
    try {
      return InputRowPlusRaw.of(nextRow(), null);
    }
    catch (ParseException e) {
      return InputRowPlusRaw.of(null, e);
    }
  }
FirehoseFactory

The FirehoseFactory interface may need to be modified to add a connection method suitable for a sampler. This means indicating to the sampler that we only need a limited amount of data and not to pre-cache an entire directory of files (PrefretchableTextFilesFirehoseFactory I'm looking at you!). We may be able to skip this and document to the API consumer to use appropriate Firehose configurations for the sampler:

  default Firehose connectForSampler(T parser, @Nullable File temporaryDirectory)  throws IOException, ParseException
  {
    return connect(parser, temporaryDirectory);
  }
Other considerations

A desirable design goal for the sampler is that it should return "processed" results on the same set of raw data every time, and the data should maintain a consistent ordering whenever possible (ideally the order it was read by the Firehose). This will make for a better user experience as they go through the different pages of the data loader (raw data -> parsed -> timestamp column identified -> transformed -> filtered -> column data types applied). The proposal is to add a temporary internal column as a 'metric' that we can use to sort the results we read back out of the IncrementalIndex. Especially for streaming-based sources, we would also want to cache the raw data so that we can continually feed the same raw data into the parser so the user can see the effects of their changes. We will use org.apache.druid.client.cache.Cache, so whatever cache implementation gets bound using druid.cache.type (default Caffeine).

Rationale

This would greatly simplify the complexity of on-boarding data into Druid.

A potential alternative to this would be to simply invest effort into simplifying the ingestion spec API.

Operational impact

None

Test plan

The data loader will be tested and improved through user testing

Future work

Assuming the above proposed project is successful it would be beneficial to adjust the existing data loading documentation to be focused more on the data loading flow. Furthermore if the logical model of the data loader flow above intuitively connects with people it could foster changes to the ingestion spec API itself.

@vogievetsky
Copy link
Contributor Author

The batch parts of this proposal have been merged and are working nicely together. Now we need to add the streaming components also.

@vogievetsky
Copy link
Contributor Author

this is shipped in 0.15.0 🎆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants