User provided schema (LogMessage/Trace.Span) discussion #791

mansu · 2024-03-04T02:37:25Z

mansu
Mar 4, 2024

WIP - still updating.

Current Ingestion path

Currently, the ingestion path is as follows. The pre-processor ingests an event from Kafka or via the _bulk API ingestion point. Currently, these messages are parsed and we write out Trace.Span object to Kafka topic that the indexer consumes from.

The indexer can ingest data from Kafka and build Lucene indexes out of them. Currently, the indexer can ingest data in multiple formats. However, there are 2 formats that are most valuable: LogMessage format and Trace.span format.

LogMessage class can take on any generic JSON message. Such a message can contain nested fields, lists and other such objects. Trace.span on the other hand is a subset of log message that contains a structured log message. The indexer can consume both messages and index them today into Lucene.

Current schema design

Currently, the schema works as follows. When a message is indexed, the schema is updated using a mechanism similar do dynamic field mapping in ES with one exception. In case of ES, when a field conflict occurs the message is discarded. In Kaldb, we resolve the field conflict using a field conflict resolution policy. All of this code is implemented in SchemaAwareLogDocumentBuilderImpl.

Implementing user provided Schema

Why:

Currently, we dynamically create the fields for a message in indexer. While this works great, we need user provided schemas to ensure that user defined types are enforced. The user provided goals serve a few specific needs:

We ensure that the type of a field is as defined by the user. We can avoid a field with wrong types.
We can reject fields with incorrect types.
We can optimize how fields are indexed and also define how fields are queried.

Questions:

Is user defined schema a must have feature or nice to have?
Since Kaldb is a system for logs, how strong of a guarantees do we need on the schemas?

Implementation

Assuming we need schemas, the simplest implementation of schemas is as follows:

Describe a schema definition. A schema consists of a field name, a field type and if the field is mandatory or optional.
Associate the schema definition to the dataset object.
In the pre-processor read the schema definition from the dataset object and build the Trace.Span or LogMessage object.
The indexer needs no changes in this model since all of the schema enforcement is done in the pre-processor before a message is indexed.

Relevant Links

Summarize discussions on this thead: WIP: Remove LogMessage and use Trace.Span as the carrier object for documents #781 (comment)

vthacker · 2024-03-04T18:39:27Z

vthacker
Mar 4, 2024

#784 - Here is the PR that I am working on to remove LogMessage. Again these are DRAFT PRs so not ready to review yet, but gives you an idea

0 replies

vthacker · 2024-03-04T18:41:49Z

vthacker
Mar 4, 2024

The main motivation is we need to support a user schema and respect that schema. Today the preprocessor writes out Trace.Proto and that's where we will add schema info on a per field basis

That get's lost in LogMessage since that's an intermediary object. Then we use that object to construct a LuceneDocument.

So why have multiple translations? Just one format Trace.proto which can then evolve over time seems more convenient

1 reply

mansu Mar 5, 2024
Author

Thanks for answering the questions here. I will update the discussion summary with this info and add my thoughts.

vthacker · 2024-03-06T00:05:05Z

vthacker
Mar 6, 2024

Some points to why removing LogMessage is a good idea - We have lots of tests and code around handling nested fields i.e Maps and not lists etc . We support it in LogMessage today. However the preprocessor writes our Trace.Span objects that the indexer reads and converts to a log message (SpanFormatter#toLogMessage) . Trace.Span doesn't support nested fields today so we've never used the feature and never tested how does SpanFormatter#toLogMessage convert nested Trace.Span

Having 2 data structures and translating between them means we need to add tests at both layers and also test how the code that translates between the two works correctly.

1 reply

mansu Mar 6, 2024
Author

LogMessage is a generic data structure like JSON. Trace.span OTOH is more structured. I think we need both and maintain both structures. Converting Trace.span into LogMessage would mean trace span is no different than a generic JSON message which is not ideal. I think keeping a couple of parsers and unit tests is a reasonable trade off for this flexibility.

mansu · 2024-03-06T03:48:04Z

mansu
Mar 6, 2024
Author

Thanks for the comments and the discussion.

I sense we may be talking about different problems here. If so, are these the problems we are trying to solve or are there others also?

How do we enforce user defined schemas in KalDB?
How do we handle nested fields?

I updated the description describing the current state and the solution for the schemas. Once you can confirm the other problems we can refine the solution. PTAL.

0 replies

vthacker · 2024-03-06T17:02:15Z

vthacker
Mar 6, 2024

@mansu I'd love to drill down on these two points in the implementation part

1. In the pre-processor read the schema definition from the dataset object and build the Trace.Span or LogMessage object.
2. The indexer needs no changes in this model since all of the schema enforcement is done in the pre-processor before a message is indexed.

We want users to be able to define - this is a "keyword" field. This is a "text" field. This is a float / half float / int / long / binary field. Essentially all the fields that OpenSearch supports.

So today like you also mentioned in the description we use the _bulk API in the preprocessor and then write out Trace.Span objects. This means we are doing "In the pre-processor read the schema definition from the dataset object and build the Trace.Span"

Now I am not sure how "The indexer needs no changes in this model since all of the schema enforcement is done in the pre-processor before a message is indexed." can happen.

How can the indexer take a Trace.Span and convert it to a LogMessage and be able to preserve the info that field "my_custom_message_field" is a text field and not a string field. The conversion happens in SpanFormatter#toLogMessage and we loose all the info about text vs string, float vs half float vs double etc

5 replies

mansu Mar 7, 2024
Author

Thanks for the comment. I am reading through code to look into this. Need a day or so to figure out the best way. I will add one more possible solution to the description as well.

mansu Mar 7, 2024
Author

@vthacker Can you please also answer the questions:

Questions:

Is user defined schema a must have feature or nice to have?
Since Kaldb is a system for logs, how strong of a guarantees do we need on the schemas?

mansu Mar 7, 2024
Author

@vthacker Do you have any alternative designs in mind? If so, can you please add them here.

vthacker Mar 8, 2024

Is user defined schema a must have feature or nice to have?

It's important. Our users want it. Being able to decide between a text field vs a keyword field, explicitly defining fields to be numeric are critcal

vthacker Mar 8, 2024

@vthacker Do you have any alternative designs in mind? If so, can you please add them here.

It's the 5 part PRs that I'm working on. From the PR description

Kaldb Schema PR breakdown - 5 part series to introduce schema configs to Kaldb

introduce a concept of int and float to trace.proto
remove the concept of index transformer - it's always trace.span
use trace.span to create lucene documents instead of LogMessage
insert schema info to trace.proto in the preprocessor
leverage schema in the indexer

We'll be moving forward with 3 and 4 shortly since it's very close to ready.

mansu · 2024-03-09T00:44:30Z

mansu
Mar 9, 2024
Author

@vthacker I am a bit worried that we are making as large change without thinking this design through. For example, the current implementation adds the schema to every message indexed into KalDB. That can potentially double the size of the message written to input kafka topic and add a lot of overhead to parse out these fields on the indexer. It may double the cpu processing and reduce the indexer throughput.

A simpler way to build this is as follows:

Add a schema in the to the dataset.
Add a dataset tag to every message. I think we already do this as part of service_name so this step may not be needed.
Look up the dataset schema in indexer when adding the message.

This approach is much more efficient, since you are adding far less metadata to every message indexed by the indexer. Further, it is much smaller in terms of code change both on the indexer and the pre-processor. I think if we put our heads together further optimizations are possible.

Further, we also need to think about schema evolution and how to handle it. What happens if schema changes? When will the new schema go into effect?

Thinking more deeply about this option, it seems that we should keep only the LogMessage interface and get rid of the Trace.Span on the indexer. If the dataset tag on message can carry the schema, we no longer need to encode the type on the message since it will be in the schema. Schema can contain all the info about how to index a specific field. This will reduce message size and the amount of data written to indexers kafka topic. It will also make the indexer more efficient since it has to process a much smaller data set.

Further, since LogMessage is json, we can also get additional features like nested fields for free in future.

The current spans can be a schema on the dataset as well.

I would advise pausing on the implementation before we proceed further since it's not very efficient.

0 replies

vthacker · 2024-03-13T05:44:56Z

vthacker
Mar 13, 2024

. It may double the cpu processing and reduce the indexer throughput.

Do you actually believe this to be true?

Let's assume we add another enum(int) to each tag. Even then it won't double will it? We have 3 fields already in the tag field

message KeyValue {
  string    key       = 1;
  ValueType v_type    = 2;
  string    v_str     = 3;

Subsequent PRs can remove ValueType since field type info will take over on the tag. So there will be 0% extra data sent to kafka and 0% change to throughput.

Anyways! I'll play around with your design idea as well and try to see if that works better. Stay tuned on #778

A simpler way to build this is as follows:
Add a schema in the to the dataset.
Add a dataset tag to every message. I think we already do this as part of service_name so this step may not be needed.
Look up the dataset schema in indexer when adding the message.

Thanks for the feedback on the design! I think we have enough info for now. As always PRs welcome if you also want to work on this :)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User provided schema (LogMessage/Trace.Span) discussion #791

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

User provided schema (LogMessage/Trace.Span) discussion #791

WIP - still updating.

Current Ingestion path

Current schema design

Implementing user provided Schema

Why:

Implementation

Relevant Links

Replies: 7 comments · 7 replies

mansu Mar 5, 2024 Author

mansu Mar 6, 2024 Author

mansu Mar 6, 2024 Author

mansu Mar 7, 2024 Author

mansu Mar 7, 2024 Author

mansu Mar 7, 2024 Author

mansu Mar 9, 2024 Author

Replies: 7 comments 7 replies

mansu Mar 5, 2024
Author

mansu Mar 6, 2024
Author

mansu
Mar 6, 2024
Author

mansu Mar 7, 2024
Author

mansu Mar 7, 2024
Author

mansu Mar 7, 2024
Author

mansu
Mar 9, 2024
Author