Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pkg/ottl] Parse uri string to url.* SemConv attributes #32433

Closed
michalpristas opened this issue Apr 16, 2024 · 11 comments
Closed

[pkg/ottl] Parse uri string to url.* SemConv attributes #32433

michalpristas opened this issue Apr 16, 2024 · 11 comments
Labels
discussion needed Community discussion needed pkg/ottl

Comments

@michalpristas
Copy link
Contributor

michalpristas commented Apr 16, 2024

Component(s)

pkg/ottl

Is your feature request related to a problem? Please describe.

parsing URI possible only using complex RegEx

Describe the solution you'd like

Intended converter parses a Uniform Resource Identifier (URI) string and extracts its components into URL SemConv attributes.

Example

input:

"sample": "http://myusername:[email protected]:80/foo.gif?key1=val1&key2=val2#fragment"

Result:

{
  "url": {
    "path" : "/foo.gif",
    "fragment" : "fragment",
    "extension" : "gif",
    "password" : "mypassword",
    "original" : "http://myusername:[email protected]:80/foo.gif?key1=val1&key2=val2#fragment",
    "scheme" : "http",
    "port" : 80,
    "user_info" : "myusername:mypassword",
    "domain" : "www.example.com",
    "query" : "key1=val1&key2=val2",
    "username" : "myusername"
  }
}

Describe alternatives you've considered

No response

Additional context

No response

@michalpristas michalpristas added enhancement New feature or request needs triage New item requiring triage labels Apr 16, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@michalpristas michalpristas changed the title [pkg/ottl] Add option to convert uri string into uri object/map [pkg/ottl] Parse uri string to url.* SemConv attributes Apr 16, 2024
@TylerHelmuth
Copy link
Member

@michalpristas I suspect this is doable via regular expression, and any converter would be doing regex internally. Can this be accomplished with ExtractPatterns?

@TylerHelmuth TylerHelmuth added discussion needed Community discussion needed and removed enhancement New feature or request needs triage New item requiring triage labels Apr 17, 2024
@michalpristas
Copy link
Contributor Author

removed my previous comment, as it was for other issue 🤦

it is surely possible to do so. but this would mean we require every user to use and compile regex on their own.
my goal here is ease of use, this one is pretty frequent in our pipelines and i see using ExtractPattern with regex as very error prone. having something working and tested (using UT) will surely benefit our users

@evan-bradley
Copy link
Contributor

I would be okay adding this. We can use the url.Parse stdlib function instead of having regular expressions either in a statement or in our code. In general I think the fewer regexes we and our users have, the better.

@TylerHelmuth
Copy link
Member

TylerHelmuth commented May 16, 2024

I would be okay adding this. We can use the url.Parse stdlib function instead of having regular expressions either in a statement or in our code. In general I think the fewer regexes we and our users have, the better.

I see the value this provides to our end-users, but we don't have a good way right now to decide when a new function should be accepted vs an existing function should be used. We really need some criteria for how new functions get added. @michalpristas you've been really active in OTTL recently providing a lot of good feedback and ideas. Any willingness to try and right up those rules? It would really help with several of the function additions you've requested.

@michalpristas
Copy link
Contributor Author

michalpristas commented May 17, 2024

Missing functionality is a strong yes e.g Sorting array, Append, missing converters for SHA512 or MD5.

For other cases I try to assume non technical user, so yes for me is when new function significantly improves UX or provides a more performant way of achieving same result. (Also a bit opinionated, if I have to write regex that is more than 10 characters I don't want to do that)

Example of improved UX is probably #32593
This issue here I think provides better UX and does not use regex so maybe even perf is better (I say maybe because don't have measurement)

If UX is same but assumes less technical user (better naming, terminology...) I'd say it's a maybe. This should be evaluated per use-case and there should be only a few requests/issues like that.

When UX is worse or performance is troublesome, this should be a strong no and there should be no issues like that.

Other way I can think about that is that this should be treated like a code. We should aim for readability first and performance second (meaning performant solution can be achieved by using more low level pieces)

With readability same things as with code should be expected, few important that come to mind

  • short, meaningful and descriptive names
  • naming consistent across functions
  • avoiding deep nesting fn1(target, fn2(fn3(fn4...)))) for frequent use cases
  • single responsibility & DRY

@TylerHelmuth
Copy link
Member

@michalpristas can you open a PR with these suggestions in ottlfunc README at the bottom so we can continue the discussion and codify the rules.

@michalpristas
Copy link
Contributor Author

michalpristas commented May 20, 2024

I opened a pr. I will be on pto for the next 10 days though so my response time may be slightly unresponsive

@djaglowski
Copy link
Member

A couple thoughts:

  1. We have a stanza URI parser already. We've been able to share code between OTTL functions and stanza parsers in some other cases so I would like to consider what it looks like to do this here as well.
  2. The stanza parser may have been written before the semantic conventions. If there's a mismatch, I support migrating that code to match semantic conventions with a feature gate to allow users a chance to prepare.
  3. The result proposed in this issue is structured but the semantic conventions are flat. I believe elastic considers these equivalent but OTel does not. (I may be wrong but this is my understanding.) Therefore I believe the correct result would be:
{
  "url.path" : "/foo.gif",
  "url.fragment" : "fragment",
  "url.extension" : "gif",
  "url.password" : "mypassword",
  "url.original" : "http://myusername:[email protected]:80/foo.gif?key1=val1&key2=val2#fragment",
  "url.scheme" : "http",
  "url.port" : 80,
  "url.user_info" : "myusername:mypassword",
  "url.domain" : "www.example.com",
  "url.query" : "key1=val1&key2=val2",
  "url.username" : "myusername"
}

@cdanis
Copy link

cdanis commented Jun 12, 2024

I've been writing a lot of OTTL lately and I found myself wanting exactly this functionality as a user convenience.

Fully-compliant URL parsing is definitely non-trivial, and IMO worth implementing centrally.

evan-bradley added a commit that referenced this issue Jun 14, 2024
**Description:** 
As discussed
[here](#32433 (comment))
we should agree on some guidelines when it comes to creating proposals
for new functions and converters

cc @TylerHelmuth 

**Link to tracking Issue:** -

**Testing:** -

**Documentation:** -

---------

Co-authored-by: Evan Bradley <[email protected]>
Co-authored-by: Tyler Helmuth <[email protected]>
Co-authored-by: Daniel Jaglowski <[email protected]>
djaglowski pushed a commit that referenced this issue Jun 17, 2024
**Description:** 
This ottl converter converts uri string into SemConv attributes.
Instead of using regex it uses `uri.Parse` providing nicer shorthand.

**Link to tracking Issue:** #32433

**Testing:** unit test added

**Documentation:** readme updated with:

### Uri

`Uri(uri_string)`

Parses a Uniform Resource Identifier (URI) string and extracts its
components as an object.
This URI object includes properties for the URI’s domain, path,
fragment, port, query, scheme, user info, username, and password.

`original`, `domain`, `scheme` and `path` are always present, other are
present only if they have corresponding values.

`uri_string` is a `string`.

- `Uri("http://www.example.com")`

results in 
```
  "original": "http://www.example.com",
  "scheme":   "http",
  "domain":   "www.example.com",
  "path":     "",
```

-
`Uri("http://myusername:mypassword@www.example.com:80/foo.gif?key1=val1&key2=val2#fragment")`

results in 
```
  "path":      "/foo.gif",
  "fragment":  "fragment",
  "extension": "gif",
  "password":  "mypassword",
  "original":  "http://myusername:mypassword@www.example.com:80/foo.gif?key1=val1&key2=val2#fragment",
  "scheme":    "http",
  "port":      80,
  "user_info": "myusername:mypassword",
  "domain":    "www.example.com",
  "query":     "key1=val1&key2=val2",
  "username":  "myusername",
```

---------

Co-authored-by: Tiffany Hrabusa <[email protected]>
Co-authored-by: Evan Bradley <[email protected]>
@michalpristas
Copy link
Contributor Author

Closing. Addressed in #32906

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion needed Community discussion needed pkg/ottl
Projects
None yet
Development

No branches or pull requests

5 participants