Replies: 7 comments 7 replies
-
#784 - Here is the PR that I am working on to remove |
Beta Was this translation helpful? Give feedback.
-
The main motivation is we need to support a user schema and respect that schema. Today the preprocessor writes out That get's lost in LogMessage since that's an intermediary object. Then we use that object to construct a LuceneDocument. So why have multiple translations? Just one format |
Beta Was this translation helpful? Give feedback.
-
Some points to why removing LogMessage is a good idea - We have lots of tests and code around handling nested fields i.e Maps and not lists etc . We support it in LogMessage today. However the preprocessor writes our Trace.Span objects that the indexer reads and converts to a log message (SpanFormatter#toLogMessage) . Trace.Span doesn't support nested fields today so we've never used the feature and never tested how does SpanFormatter#toLogMessage convert nested Trace.Span Having 2 data structures and translating between them means we need to add tests at both layers and also test how the code that translates between the two works correctly. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the comments and the discussion. I sense we may be talking about different problems here. If so, are these the problems we are trying to solve or are there others also?
I updated the description describing the current state and the solution for the schemas. Once you can confirm the other problems we can refine the solution. PTAL. |
Beta Was this translation helpful? Give feedback.
-
@mansu I'd love to drill down on these two points in the implementation part
We want users to be able to define - this is a "keyword" field. This is a "text" field. This is a float / half float / int / long / binary field. Essentially all the fields that OpenSearch supports. So today like you also mentioned in the description we use the Now I am not sure how "The indexer needs no changes in this model since all of the schema enforcement is done in the pre-processor before a message is indexed." can happen. How can the indexer take a Trace.Span and convert it to a LogMessage and be able to preserve the info that field "my_custom_message_field" is a text field and not a string field. The conversion happens in SpanFormatter#toLogMessage and we loose all the info about text vs string, float vs half float vs double etc |
Beta Was this translation helpful? Give feedback.
-
@vthacker I am a bit worried that we are making as large change without thinking this design through. For example, the current implementation adds the schema to every message indexed into KalDB. That can potentially double the size of the message written to input kafka topic and add a lot of overhead to parse out these fields on the indexer. It may double the cpu processing and reduce the indexer throughput. A simpler way to build this is as follows:
This approach is much more efficient, since you are adding far less metadata to every message indexed by the indexer. Further, it is much smaller in terms of code change both on the indexer and the pre-processor. I think if we put our heads together further optimizations are possible. Further, we also need to think about schema evolution and how to handle it. What happens if schema changes? When will the new schema go into effect? Thinking more deeply about this option, it seems that we should keep only the LogMessage interface and get rid of the Trace.Span on the indexer. If the dataset tag on message can carry the schema, we no longer need to encode the type on the message since it will be in the schema. Schema can contain all the info about how to index a specific field. This will reduce message size and the amount of data written to indexers kafka topic. It will also make the indexer more efficient since it has to process a much smaller data set. Further, since LogMessage is json, we can also get additional features like nested fields for free in future. The current spans can be a schema on the dataset as well. I would advise pausing on the implementation before we proceed further since it's not very efficient. |
Beta Was this translation helpful? Give feedback.
-
Do you actually believe this to be true? Let's assume we add another enum(int) to each tag. Even then it won't double will it? We have 3 fields already in the tag field
Subsequent PRs can remove Anyways! I'll play around with your design idea as well and try to see if that works better. Stay tuned on #778
Thanks for the feedback on the design! I think we have enough info for now. As always PRs welcome if you also want to work on this :) |
Beta Was this translation helpful? Give feedback.
-
WIP - still updating.
Current Ingestion path
Currently, the ingestion path is as follows. The pre-processor ingests an event from Kafka or via the
_bulk
API ingestion point. Currently, these messages are parsed and we write outTrace.Span
object to Kafka topic that the indexer consumes from.The indexer can ingest data from Kafka and build Lucene indexes out of them. Currently, the indexer can ingest data in multiple formats. However, there are 2 formats that are most valuable:
LogMessage
format andTrace.span
format.LogMessage
class can take on any generic JSON message. Such a message can contain nested fields, lists and other such objects.Trace.span
on the other hand is a subset of log message that contains a structured log message. The indexer can consume both messages and index them today into Lucene.Current schema design
Currently, the schema works as follows. When a message is indexed, the schema is updated using a mechanism similar do dynamic field mapping in ES with one exception. In case of ES, when a field conflict occurs the message is discarded. In Kaldb, we resolve the field conflict using a field conflict resolution policy. All of this code is implemented in
SchemaAwareLogDocumentBuilderImpl
.Implementing user provided Schema
Why:
Currently, we dynamically create the fields for a message in indexer. While this works great, we need user provided schemas to ensure that user defined types are enforced. The user provided goals serve a few specific needs:
Questions:
Implementation
Assuming we need schemas, the simplest implementation of schemas is as follows:
dataset
object.Trace.Span
orLogMessage
object.Relevant Links
Beta Was this translation helpful? Give feedback.
All reactions