Skip to content

Commit

Permalink
[pkg/ottl] Add ParseSimplifiedXML Converter (#35421)
Browse files Browse the repository at this point in the history
This adds a converter called `ParseSimplifiedXML`. This serves as the
final step described in
#35281,
which will allow users to parse any arbitrary XML document into
user-friendly result, by first transforming the document in place with
other functions (e.g. #35328 and #35364) and then calling this function.

---------

Co-authored-by: Evan Bradley <[email protected]>
  • Loading branch information
djaglowski and evan-bradley authored Oct 15, 2024
1 parent 41f6b0a commit d4e17be
Show file tree
Hide file tree
Showing 6 changed files with 576 additions and 0 deletions.
27 changes: 27 additions & 0 deletions .chloggen/ottl-parse-simple-xml.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Use this changelog template to create an entry for release notes.

# One of 'breaking', 'deprecation', 'new_component', 'enhancement', 'bug_fix'
change_type: enhancement

# The name of the component, or a single word describing the area of concern, (e.g. filelogreceiver)
component: pkg/ottl

# A brief description of the change. Surround your text with quotes ("") if it needs to start with a backtick (`).
note: Add ParseSimplifiedXML Converter

# Mandatory: One or more tracking issues related to the change. You can use the PR number here if no issue exists.
issues: [35421]

# (Optional) One or more lines of additional information to render under the primary note.
# These lines will be padded with 2 spaces and then inserted directly into the document.
# Use pipe (|) for multiline entries.
subtext:

# If your change doesn't affect end users or the exported elements of any package,
# you should instead start your pull request title with [chore] or use the "Skip Changelog" label.
# Optional: The change log or logs in which this entry should be included.
# e.g. '[user]' or '[user, api]'
# Include 'user' if the change is relevant to end users.
# Include 'api' if there is a change to a library API.
# Default: '[user]'
change_logs: []
9 changes: 9 additions & 0 deletions pkg/ottl/e2e/e2e_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -663,6 +663,15 @@ func Test_e2e_converters(t *testing.T) {
tCtx.GetLogRecord().Attributes().PutStr("test", "k1=v1 k2=\"v2=v3\"")
},
},
{
statement: `set(attributes["test"], ParseSimplifiedXML("<Log><id>1</id><Message>This is a log message!</Message></Log>"))`,
want: func(tCtx ottllog.TransformContext) {
attr := tCtx.GetLogRecord().Attributes().PutEmptyMap("test")
log := attr.PutEmptyMap("Log")
log.PutStr("id", "1")
log.PutStr("Message", "This is a log message!")
},
},
{
statement: `set(attributes["test"], ParseXML("<Log id=\"1\"><Message>This is a log message!</Message></Log>"))`,
want: func(tCtx ottllog.TransformContext) {
Expand Down
127 changes: 127 additions & 0 deletions pkg/ottl/ottlfuncs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -449,6 +449,7 @@ Available Converters:
- [ParseCSV](#parsecsv)
- [ParseJSON](#parsejson)
- [ParseKeyValue](#parsekeyvalue)
- [ParseSimplifiedXML](#parsesimplifiedxml)
- [ParseXML](#parsexml)
- [RemoveXML](#removexml)
- [Seconds](#seconds)
Expand Down Expand Up @@ -1335,6 +1336,132 @@ Examples:
- `ParseKeyValue("k1!v1_k2!v2_k3!v3", "!", "_")`
- `ParseKeyValue(attributes["pairs"])`

### ParseSimplifiedXML

`ParseSimplifiedXML(target)`

The `ParseSimplifiedXML` Converter returns a `pcommon.Map` struct that is the result of parsing the target string without preservation of attributes or extraneous text content.

The goal of this Converter is to produce a more user-friendly representation of XML data than the `ParseXML` Converter.
This Converter should be preferred over `ParseXML` when minor semantic details (e.g. order of elements) are not critically important, when subsequent processing or querying of the result is expected, or when human-readability is a concern.

This Converter disregards certain aspects of XML, specifically attributes and extraneous text content, in order to produce
a direct representation of XML data. Users are encouraged to simplify their XML documents prior to using `ParseSimplifiedXML`.

See other functions which may be useful for preparing XML documents:

- `ConvertAttributesToElementsXML`
- `ConvertTextToElementsXML`
- `RemoveXML`
- `InsertXML`
- `GetXML`

#### Formal Definitions

A "Simplified XML" document contains no attributes and no extraneous text content.

An element has "extraneous text content" when it contains both text and element content. e.g.

```xml
<foo>
bar <!-- extraneous text content -->
<hello>world</hello> <!-- element content -->
</foo>
```

#### Parsing logic

1. Declaration elements, attributes, comments, and extraneous text content are ignored.
2. Elements which contain a value are converted into key/value pairs.
e.g. `<foo>bar</foo>` becomes `"foo": "bar"`
3. Elements which contain child elements are converted into a key/value pair where the value is a map.
e.g. `<foo> <bar>baz</bar> </foo>` becomes `"foo": { "bar": "baz" }`
4. Sibling elements that share the same tag will be combined into a slice.
e.g. `<a> <b>1</b> <c>2</c> <c>3</c> </foo>` becomes `"a": { "b": "1", "c": [ "2", "3" ] }`.
5. Empty elements are dropped, but they can determine whether a value should be a slice or map.
e.g. `<a> <b>1</b> <b/> </a>` becomes `"a": { "b": [ "1" ] }` instead of `"a": { "b": "1" }`

#### Examples

Parse a Simplified XML document from the body:

```xml
<event>
<id>1</id>
<user>jane</user>
<details>
<time>2021-10-01T12:00:00Z</time>
<description>Something happened</description>
<cause>unknown</cause>
</details>
</event>
```

```json
{
"event": {
"id": 1,
"user": "jane",
"details": {
"time": "2021-10-01T12:00:00Z",
"description": "Something happened",
"cause": "unknown"
}
}
}
```

Parse a Simplified XML document with unique child elements:

```xml
<x>
<y>1</y>
<z>2</z>
</x>
```

```json
{
"x": {
"y": "1",
"z": "2"
}
}
```

Parse a Simplified XML document with multiple elements of the same tag:

```xml
<a>
<b>1</b>
<b>2</b>
</a>
```

```json
{
"a": {
"b": ["1", "2"]
}
}
```

Parse a Simplified XML document with CDATA element:

```xml
<a>
<b>1</b>
<b><![CDATA[2]]></b>
</a>
```

```json
{
"a": {
"b": ["1", "2"]
}
}
```

### ParseXML

Expand Down
134 changes: 134 additions & 0 deletions pkg/ottl/ottlfuncs/func_parse_simplified_xml.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
// Copyright The OpenTelemetry Authors
// SPDX-License-Identifier: Apache-2.0

package ottlfuncs // import "github.com/open-telemetry/opentelemetry-collector-contrib/pkg/ottl/ottlfuncs"

import (
"context"
"fmt"

"github.com/antchfx/xmlquery"
"go.opentelemetry.io/collector/pdata/pcommon"

"github.com/open-telemetry/opentelemetry-collector-contrib/pkg/ottl"
)

type ParseSimplifiedXMLArguments[K any] struct {
Target ottl.StringGetter[K]
}

func NewParseSimplifiedXMLFactory[K any]() ottl.Factory[K] {
return ottl.NewFactory("ParseSimplifiedXML", &ParseSimplifiedXMLArguments[K]{}, createParseSimplifiedXMLFunction[K])
}

func createParseSimplifiedXMLFunction[K any](_ ottl.FunctionContext, oArgs ottl.Arguments) (ottl.ExprFunc[K], error) {
args, ok := oArgs.(*ParseSimplifiedXMLArguments[K])

if !ok {
return nil, fmt.Errorf("ParseSimplifiedXML args must be of type *ParseSimplifiedXMLAguments[K]")
}

return parseSimplifiedXML(args.Target), nil
}

// The `ParseSimplifiedXML` Converter returns a `pcommon.Map` struct that is the result of parsing the target
// string without preservation of attributes or extraneous text content.
func parseSimplifiedXML[K any](target ottl.StringGetter[K]) ottl.ExprFunc[K] {
return func(ctx context.Context, tCtx K) (any, error) {
var doc *xmlquery.Node
if targetVal, err := target.Get(ctx, tCtx); err != nil {
return nil, err
} else if doc, err = parseNodesXML(targetVal); err != nil {
return nil, err
}

docMap := pcommon.NewMap()
parseElement(doc, &docMap)
return docMap, nil
}
}

func parseElement(parent *xmlquery.Node, parentMap *pcommon.Map) {
// Count the number of each element tag so we know whether it will be a member of a slice or not
childTags := make(map[string]int)
for child := parent.FirstChild; child != nil; child = child.NextSibling {
if child.Type != xmlquery.ElementNode {
continue
}
childTags[child.Data]++
}
if len(childTags) == 0 {
return
}

// Convert the children, now knowing whether they will be a member of a slice or not
for child := parent.FirstChild; child != nil; child = child.NextSibling {
if child.Type != xmlquery.ElementNode || child.FirstChild == nil {
continue
}

leafValue := leafValueFromElement(child)

// Slice of the same element
if childTags[child.Data] > 1 {
// Get or create the slice of children
var childrenSlice pcommon.Slice
childrenValue, ok := parentMap.Get(child.Data)
if ok {
childrenSlice = childrenValue.Slice()
} else {
childrenSlice = parentMap.PutEmptySlice(child.Data)
}

// Add the child's text content to the slice
if leafValue != "" {
childrenSlice.AppendEmpty().SetStr(leafValue)
continue
}

// Parse the child to make sure there's something to add
childMap := pcommon.NewMap()
parseElement(child, &childMap)
if childMap.Len() == 0 {
continue
}

sliceValue := childrenSlice.AppendEmpty()
sliceMap := sliceValue.SetEmptyMap()
childMap.CopyTo(sliceMap)
continue
}

if leafValue != "" {
parentMap.PutStr(child.Data, leafValue)
continue
}

// Child will be a map
childMap := pcommon.NewMap()
parseElement(child, &childMap)
if childMap.Len() == 0 {
continue
}

childMap.CopyTo(parentMap.PutEmptyMap(child.Data))
}
}

func leafValueFromElement(node *xmlquery.Node) string {
// First check if there are any child elements. If there are, ignore any extraneous text.
for child := node.FirstChild; child != nil; child = child.NextSibling {
if child.Type == xmlquery.ElementNode {
return ""
}
}

// No child elements, so return the first text or CDATA content
for child := node.FirstChild; child != nil; child = child.NextSibling {
switch child.Type {
case xmlquery.TextNode, xmlquery.CharDataNode:
return child.Data
}
}
return ""
}
Loading

0 comments on commit d4e17be

Please sign in to comment.