-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
K8s filelog collection: put the log record in the Body as a string #2851
Comments
@pmm-sumo @djaglowski this is the bug that I mentioned about when were looking at k8s collection. Can one you take this? |
@sumo-drosiek would you be able to validate open-telemetry/opentelemetry-collector#2873? |
@djaglowski what do you think, should we be good with just using open-telemetry/opentelemetry-collector#2873 or perhaps it would be better to make a deeper change in the source code? |
@pmm-sumo There are probably some things we can do to improve the usability of the parsing operations, but I think you are maybe referring to whether or not we should need the Detailed VersionNote that this regex operation is capturing multiple values, - type: regex_parser
id: parser-containerd
regex: '^(?P<time>[^ ^Z]+Z) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) (?P<log>.*)$'
... At this point, the entry looks something like this (showing only the parts relevant to this example): {
"record": {
"time": "...",
"stream": "...",
"log": "The rest of the log line",
},
"attributes": { },
"timestamp": "default,
...
} Then some additional processing is done based on the other values: ...
timestamp:
parse_from: time # parses "time" to timestamp
...
- type: metadata
attributes:
stream: 'EXPR($.stream)' # moves "stream" to attributes Now we have: {
"record": {
"log": "The rest of the log line",
},
"attributes": {
"stream": "stderr",
},
"timestamp": "2021-03-25T...",
...
} Now we can clean up the record:
This leaves us with: {
"record": "The rest of the log line",
"attributes": {
"stream": "stderr",
},
"timestamp": "2021-03-25T...",
...
} There are a few minor things that I would like to improve here though, such as this and this, but I'm not sure what specifically we should do to improve this parsing pipeline. |
@djaglowski a slightly different approach that possibly could work is for regex operator to extract value directly into "attributes" and also have an option to specify which extracted value becomes the new value of the "record" (or "Body" in otel terminology). For example: type: regex_parser
id: parser-containerd
regex: '^(?P<time>[^ ^Z]+Z) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) (?P<log>.*)$'
set_body_from: log # this refers to an extracted field name in the line above Now once this is applied the data will look like this: {
"record": "The rest of the log line",
"attributes": {
"time": "...",
"stream": "...",
},
"timestamp": "default,
...
} I don't know how feasible this, just an idea. |
I think when you consider a wider set of log parsing cases, you end up needing the ability to parse into an arbitrary object that you can then manipulate as needed. You might need to parse further. You might keep or discard any field at all. This isn't really a regex specific problem either - it's a general question of how parsing operations should behave. To me, the primary action you are taking with any parsing operation is to structure to the log entry. Once the values are individually accessible, you can parse them further or move them into the appropriate special fields (resource, attributes, timestamp, severity). All that said, there are many scenarios such as this that are much more constrained. Taking inspiration from @tigrannajaryan's suggestion, I think there is potentially a way in which we can support cases such as this one, while still allowing arbitrary field manipulations if needed. We already support a concept of "nested parsers" for - type: regex_parser
id: my_regex
regex: '^(?P<time>[^ ^Z]+Z) (?P<stream>stdout|stderr) (?P<sev>[^ ]*) (?P<uuid>[^ ]*) (?P<log>.*)$'
timestamp: # this is a timestamp_parser embedded within the regex_parser
parse_from: time
layout: '%Y-%m-%dT%H:%M:%S.%LZ'
severity: # this is a severity_parser embedded within the regex_parser
parse_from: sev What is really happening here is essentially the same thing as if you specified: - type: regex_parser
id: my_regex
regex: '^(?P<time>[^ ^Z]+Z) (?P<stream>stdout|stderr) (?P<sev>[^ ]*) (?P<uuid>[^ ]*) (?P<log>.*)$'
- type: timestamp_parser # separate operator
parse_from: time
layout: '%Y-%m-%dT%H:%M:%S.%LZ'
- type: severity_parser # separate operator
parse_from: sev So working with the same idea in mind, would it be helpful to embed additional operators into all parsers? The - type: regex_parser
id: my_regex
regex: '^(?P<time>[^ ^Z]+Z) (?P<stream>stdout|stderr) (?P<sev>[^ ]*) (?P<uuid>[^ ]*) (?P<log>.*)$'
timestamp:
parse_from: time
layout: '%Y-%m-%dT%H:%M:%S.%LZ'
severity:
parse_from: sev
attributes:
- stream
resource:
- uuid Tigran's suggested - type: regex_parser
id: my_regex
regex: '^(?P<time>[^ ^Z]+Z) (?P<stream>stdout|stderr) (?P<sev>[^ ]*) (?P<uuid>[^ ]*) (?P<log>.*)$'
timestamp:
parse_from: time
layout: '%Y-%m-%dT%H:%M:%S.%LZ'
severity:
parse_from: sev
attributes:
- stream
resource:
- uuid
record: log If instead, - type: regex_parser
id: my_regex
regex: '^(?P<time>[^ ^Z]+Z) (?P<stream>stdout|stderr) (?P<sev>[^ ]*) (?P<uuid>[^ ]*) (?P<log>.*)$'
timestamp:
parse_from: time
layout: '%Y-%m-%dT%H:%M:%S.%LZ'
severity:
parse_from: sev
attributes:
- stream
resource:
- uuid
- type: regex_parser
id: more_parsing
parse_from: log
regex: '^(?P<some_id>[^ ^Z]+Z) (?P<the_rest>.*)$'
resource:
- some_id
record: the_rest There are some nuances here as well, such as how ordering is applied, but these seem manageable. Thoughts on this? |
I concur with @tigrannajaryan note on using There is the technical part, which as described above, requires having a common framework for making all the transformations, which base on transforming fields present in My understanding (might be overly simplified) is that we are in principle dealing with three types of log sources:
I think that in principle, only the Structured case has strong arguments to keep the map in the body (as this was the original model). Even then, the distinction between Additionally, I think there are fields that should be always moved to I think I understand the reasons behind it, but also I believe that eventually the operators should be dealing with |
@pmm-sumo You make some really good points here. Differentiating between the structured and other types of log sources is very useful. Following what you've laid out, I think there are some additional design points to clarify:
|
Thank you @djaglowski One quick note:
Technically we can use a collection of key values, where the latter are of AnyValue type
You are right, I forgot it became part of the model. I was trying to illustrate there are some attributes that should be present on the |
I think we may be talking about OTel vs "stanza" Entry format. My note on this point was basically just acknowledging that we need to update the old stanza format to support AnyValue. |
Related discussion here : #14718 |
Describe the bug
#2266 (comment)
log record is wrapped in a json with key
log
. it should be just plain string.Steps to reproduce
Read files using filelog
What did you expect to see?
"EPS: 98"
What did you see instead?
{"log":"EPS: 98\n"}
What version did you use?
0.22.0
The text was updated successfully, but these errors were encountered: