Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pkg/ottl] Support json flatten operations #29283

Closed
danelson opened this issue Nov 15, 2023 · 23 comments · Fixed by #30455
Closed

[pkg/ottl] Support json flatten operations #29283

danelson opened this issue Nov 15, 2023 · 23 comments · Fixed by #30455
Assignees
Labels
enhancement New feature or request pkg/ottl priority:p2 Medium

Comments

@danelson
Copy link
Contributor

danelson commented Nov 15, 2023

Component(s)

  • pkg/ottl
  • processor/transform

Is your feature request related to a problem? Please describe.

Original discussion in https://cloud-native.slack.com/archives/C01N6P7KR6W/p1699999733202679

Some backends do not support json slices/maps. It would be nice if there were functions to support flattening the data. Consider the following

{
  "name": "test",
  "address": {
    "street": "first",
    "house": 1234
  },
  "occupants": ["user 1", "user 2"]
}

Possible outputs

  1. Flatten to stringified json

    {
        "name": "test",
        "address": "{\"street\":\"first\", \"house\": 1234}",
        "occupants": "[\"user 1\", \"user 2\"]"
    }
  2. Flatten to modified attribute names

    {
        "name": "test",
        "address.street": "first",
        "address.house": 1234,
        "occupants.0": "user 1",
        "occupants.1": "user 2"
    }

Initial questions:

  • Can/should style 2 above support multiple styles (dot, underscore, path, etc.)?
  • What if the backend supports slices but not maps? Can we only apply this to 1 or the other?
  • Should there be a limit on depth?

Describe the solution you'd like

A function such as FlattenJSON, FlattenMap, FlattenSlice

@danelson danelson added enhancement New feature or request needs triage New item requiring triage labels Nov 15, 2023
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@TylerHelmuth
Copy link
Member

I definitely prefer format 2. I hadn't thought about flattening slices like that. Is the normal? I expected slices to be left alone but your suggestion makes sense.

@bryan-aguilar
Copy link
Contributor

Is there any standard here when it comes to flattening? Follow up question, would we want to support both types of flattening?

@ImDevinC
Copy link

I think the second solution is probably better, that's how Cloudwatch would parse these logs. The first one only exists because I wrote a simple processor for this on my own and that was the easier solution for me at the time.

@danelson
Copy link
Contributor Author

danelson commented Nov 15, 2023

I also prefer option 2.

I hadn't thought about flattening slices like that. Is the normal?

I am not sure if it is desirable. A long array would cause an explosion of attributes which may cause other issues.

That being said, some backends only support arrays of primitive types so

Works

{
  "items": ["value1", "value2"]
}

Does not work

{
  "items":
  [
    {
      "key": "value1"
    },
    {
      "key": "value2"
    }
  ]
}

It would be nice to have some options to handle this.

@bryan-aguilar
Copy link
Contributor

bryan-aguilar commented Nov 15, 2023

A long array would cause an explosion of attributes which may cause other issues.

I think this is a good point but I think this is an inherent risk when flattening data like this.

With option 2 wouldn't this

{
  "items":
  [
    {
      "key": "value1"
    },
    {
      "key": "value2"
    }
  ]
}

become

{
  items.0.key: value1
  items.1.key: value2
}

?

@TylerHelmuth
Copy link
Member

@bryan-aguilar I feel like option 1 is more of a stringify solution that flattening. I'd rather see that accomplished via set[attributes["address"], String(attributes["address"])).

I agree with your interpretation of how the slice would be flattened.

@TylerHelmuth TylerHelmuth added priority:p2 Medium and removed needs triage New item requiring triage labels Nov 15, 2023
@danelson
Copy link
Contributor Author

danelson commented Nov 15, 2023

I think this is a good point but I think this is an inherent risk when flattening data like this.

If the desire is to prevent backends from rejecting data then this might trading 1 problem for another. For instance my backend doesn't supported json maps or more than 255 attributes. If I had (psuedo json)

{
  "key": "value",
  "slice": [ <300 items in here> ]
}

then my backend would drop "slice" but allow the rest of the payload. But if we turned this into

{
  "key": "value",
  "slice.0": "<item0>",
  "slice.1": "<item1>",
  ...
  "slice.299": "<item299"
}

Now my backend would drop the entire payload.

Could we support some type of depth parameter for nested levels and some kind of limit parameter for slice length to help guard against this?

I think you can use limit but I don't know if you could be targeted in your approach

@bryan-aguilar
Copy link
Contributor

bryan-aguilar commented Nov 15, 2023

@danelson this is the same issue with objects though right? You could have an object with 350 nested fields that would flatten to 350 separate fields.

{
  "name": "test",
  "bigobj": {
    "field1": 1,
    "field2": 2,
    ....,
    "field350":  350
  }
}

@ImDevinC
Copy link

ImDevinC commented Nov 15, 2023

@danelson wouldn't the existing truncate_all be another solution for removing those excess attributes and not dropping everything? So you could flattenJSON and then truncate_all to remove excess that would be dropped?

@danelson
Copy link
Contributor Author

danelson commented Nov 15, 2023

this is the same issue with objects though right?

Yes, you are correct. I guess I wasn't thinking about that since it hasn't been an issue for us.

wouldn't the existing truncate_all be another solution for removing those excess attributes and not dropping everything? So you could flattenJSON and then truncate_all to remove excess that would be dropped?

I think you mean limit? I would like to give precedence to attributes that do not originate in maps/slices. I don't think that is possible. Since I don't know the names of the incoming attributes I have to apply flatten and then limit which will not work. My admittedly somewhat janky approach (because it isn't deterministic in ensuring the attribute limit is not breached) would be to support a limit option in the flatten function so I choose up to n items.

@danelson
Copy link
Contributor Author

danelson commented Nov 16, 2023

I realize my use case may not be standard. Just thought it would help the discussion.

I would be happy if there was support for the following which I think solves a lot of generic use cases

  • Flattening maps only
  • Flattening maps and slices
  • A depth option for maps so that deeply nested structures can be ignored

@bryan-aguilar
Copy link
Contributor

@danelson I wonder if your use case should be considered a complete separate concern. Should flatten function only worry about flattening. You should have the ability to size down your slices/maps before or after the flatten operation takes place.

Does the functionality to trim down the json object before or after flattening already?

@TylerHelmuth
Copy link
Member

TylerHelmuth commented Nov 16, 2023

A depth option for maps so that deeply nested structures can be ignored

It is best to have these types of checks in the Where clause instead of the function itself. To enable this use case we could have a Depth converter that takes a map and returns its depth.

@danelson
Copy link
Contributor Author

It is best to have these types of checks in the Where clause instead of the function itself.

This makes sense to me.

This may veering off track now, but is there a way to write transforms that apply this to subsets of data?

If I want to flatten a specific map I think we are saying that this proposal would allow for something like

flatten(attributes["maybe_deep_map"]) where Depth(attributes["maybe_deep_map"]) < 10

But would it be able flatten all maps that less than some depth when I don't know the name?

flatten(attributes) where Depth(attributes) < 10

@TylerHelmuth
Copy link
Member

But would it be able flatten all maps that less than some depth when I don't know the name?

At the moment you would need to know the name.

@ImDevinC
Copy link

@TylerHelmuth just to clarify, we're saying that flatten(attributes) wouldn't work? That was the original problem I was trying to solve for, as dynamic attributes make it very difficult to know the name.

@danelson
Copy link
Contributor Author

I think we are saying that flatten(attributes) would flatten everything. I was looking for more control which we think would be best provided via other checks and would be outside the scope of this issue.

@evan-bradley
Copy link
Contributor

I think if we have a high degree of confidence that most users will want to limit map depth or list length when flattening a map of arbitrary keys, adding optional parameters would be an okay move here since it would simplify the overall UX for this functionality. If we can't do that, I think we could use #29289 to solve this case. Some combination of the below statements/conditions could be used to limit depth or remove keys that won't pass validation.

# NOTE: needs #29289 to work
for key, _ in attributes:
  flatten(attributes[key]) where IsMap(attributes[key]) and Depth(attributes[key]) < 10
  flatten(attributes[key]) where IsList(attributes[key]) and Len(attributes[key]) < 256
  delete_key(attributes, key) where IsMap(attributes[key]) and Depth(attributes[key]) >= 10
  delete_key(attributes, key) where IsList(attributes[key]) and Len(attributes[key]) >= 256

@puckpuck
Copy link

There should be an ability to prefix all the keys in the resulting flattened map.

{
  "name": "test",
  "address": {
    "street": "first",
    "house": 1234
  },
  "occupants": ["user 1", "user 2"]
}

results in something like this:

{
    "app.name": "test",
    "app.address.street": "first",
    "app.address.house": 1234,
    "app.occupants.0": "user 1",
    "app.occupants.1": "user 2"
}

This keeps my keys neatly packaged in a namespace so as to not be confused with other attributes.

@TylerHelmuth
Copy link
Member

TylerHelmuth commented Dec 14, 2023

I can see that being an optional parameter. It gets more complicated if you want to namespace the nested maps. I think in that case the best solution is to call flatten multiple times with the different prefixes. I don't want to support namespacing the sub maps.

@DaveWK
Copy link

DaveWK commented Dec 29, 2023

I'd just like to chime in that I am looking for this as well.

My particular use case is that I have JSON-encoded logs that are being scraped by OTEL using the journald receiver and want to collapse all the log elements that currently are shipped under Body.XXX to root-level elements, and then also re-parse the Body.MESSAGE element into sub-elements (since it's a json-formatted log) -- since there are arbitrary json elements within the MESSAGE log the current static mapping solution is not maintainable

@TylerHelmuth
Copy link
Member

I've created #30455 to add this function. All interested parties please review.

TylerHelmuth added a commit that referenced this issue Jan 17, 2024
**Description:**
Adds a `flatten` function that allows flattening maps.

I went with an editor instead of a converter, but I'm open to debate.
Using an editor means that a user can do `flatten(body) where
IsMap(body)` instead of `set(body, Flatten(body)) where IsMap(body).
When using ParseJson you have to do:

```
- merge_maps(cache, ParseJSON(body), "upsert")
- flatten(cache)
```

instead of `merge_maps(cache, Flatten(ParseJSON(body)), "upsert")`.
Ultimately I went with an editor for similar reasons that `merge_maps`
is an editor: chaining too many functions together is messy and updating
maps is very fast with pdata.

The function supports 2 optional parameters, `prefix` and `depth`. Use
`prefix` to add a "namespace" to the values that are being flattened.
Use `depth` to prevent trying to flatten maps that are too deep. See the
function doc for examples.

**Link to tracking Issue:** <Issue number if applicable>

Closes
#29283

**Testing:** <Describe what testing was performed and which tests were
added.>
Added new unit and e2e tests.  Please scrutinize.

**Documentation:** <Describe the documentation added.>
Added function doc.

---------

Co-authored-by: Evan Bradley <[email protected]>
mfyuce pushed a commit to mfyuce/opentelemetry-collector-contrib that referenced this issue Jan 18, 2024
**Description:**
Adds a `flatten` function that allows flattening maps.

I went with an editor instead of a converter, but I'm open to debate.
Using an editor means that a user can do `flatten(body) where
IsMap(body)` instead of `set(body, Flatten(body)) where IsMap(body).
When using ParseJson you have to do:

```
- merge_maps(cache, ParseJSON(body), "upsert")
- flatten(cache)
```

instead of `merge_maps(cache, Flatten(ParseJSON(body)), "upsert")`.
Ultimately I went with an editor for similar reasons that `merge_maps`
is an editor: chaining too many functions together is messy and updating
maps is very fast with pdata.

The function supports 2 optional parameters, `prefix` and `depth`. Use
`prefix` to add a "namespace" to the values that are being flattened.
Use `depth` to prevent trying to flatten maps that are too deep. See the
function doc for examples.

**Link to tracking Issue:** <Issue number if applicable>

Closes
open-telemetry#29283

**Testing:** <Describe what testing was performed and which tests were
added.>
Added new unit and e2e tests.  Please scrutinize.

**Documentation:** <Describe the documentation added.>
Added function doc.

---------

Co-authored-by: Evan Bradley <[email protected]>
cparkins pushed a commit to AmadeusITGroup/opentelemetry-collector-contrib that referenced this issue Feb 1, 2024
**Description:**
Adds a `flatten` function that allows flattening maps.

I went with an editor instead of a converter, but I'm open to debate.
Using an editor means that a user can do `flatten(body) where
IsMap(body)` instead of `set(body, Flatten(body)) where IsMap(body).
When using ParseJson you have to do:

```
- merge_maps(cache, ParseJSON(body), "upsert")
- flatten(cache)
```

instead of `merge_maps(cache, Flatten(ParseJSON(body)), "upsert")`.
Ultimately I went with an editor for similar reasons that `merge_maps`
is an editor: chaining too many functions together is messy and updating
maps is very fast with pdata.

The function supports 2 optional parameters, `prefix` and `depth`. Use
`prefix` to add a "namespace" to the values that are being flattened.
Use `depth` to prevent trying to flatten maps that are too deep. See the
function doc for examples.

**Link to tracking Issue:** <Issue number if applicable>

Closes
open-telemetry#29283

**Testing:** <Describe what testing was performed and which tests were
added.>
Added new unit and e2e tests.  Please scrutinize.

**Documentation:** <Describe the documentation added.>
Added function doc.

---------

Co-authored-by: Evan Bradley <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request pkg/ottl priority:p2 Medium
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants