Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URI parts ingest processor #65150

Merged
merged 9 commits into from
Nov 20, 2020
Merged

Conversation

danhermann
Copy link
Contributor

@danhermann danhermann commented Nov 17, 2020

Adds a new uri_parts processor that decomposes a URI into its constituent parts. E.g.:

POST _ingest/pipeline/_simulate?verbose
{
  "pipeline": {
    "processors": [
      {
        "url_parts": {
          "field": "uri_field",
          "target_field": "url",
          "keep_original": true,
          "remove_if_successful": true
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "uri_field": "http://user:[email protected]:80/blarg.gif#ref"
      }
    }
  ]
}

results in:

"processor_type" : "uri_parts",
"status" : "success",
"doc" : {
  "_index" : "_index",
  "_id" : "_id",
  "_source" : {
    "url" : {
      "path" : "/blarg.gif",
      "fragment" : "ref",
      "extension" : "gif",
      "password" : "pw",
      "original" : "http://user:[email protected]:80/blarg.gif#ref",
      "scheme" : "http",
      "port" : 80,
      "user_info" : "user:pw",
      "domain" : "www.google.com",
      "username" : "user"
    }
  }

The processor relies on the java.net.URI class to parse the URI and attempts to map the parts into ECS fields. Some ECS fields are not part of the URI spec, so see the table below for how those are handled:

URL Parts Processor ECS java.net.URI Comments
domain url.domain getHost()
extension url.extension This is not part of the URI spec and is manually parsed out of the path element on a best-effort basis if a . exists in the path
fragment url.fragment getFragment()
url.full
original url.original The processor includes an option to retain the original URL
password url.password The URI spec defines an "authority" field but does not define either username or password though they are commonly presented with the username:password convention. The username and password fields are parsed out of the user_info field on a best-effort basis if a : exists.
path url.path getPath()
port url.port getPort()
query url.query getQuery()
url.registered_domain
scheme url.scheme getProtocol()
url.top_level_domain
username url.username See comment on password above
user_info getUserInfo() Corresponds to the "authority" field of the URI spec without the domain

Also introduces a new module for ingest processors.

Closes #57481

@danhermann danhermann added :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v8.0.0 v7.11.0 labels Nov 17, 2020
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Nov 17, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (Team:Core/Features)

@danhermann
Copy link
Contributor Author

@leehinman and @andrewkroh, I would be interested in your feedback (and/or others, as appropriate) on whether this provides the functionality you described in #57481.

Copy link
Member

@andrewkroh andrewkroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this will make writing pipelines simpler for us.


URL url;
try {
url = new URL(value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there limitations on which schemes that this can parse? Would these parse?

  • ftp://ftp.is.co.za/rfc/rfc1808.txt
  • ldap://[2001:db8::7]/c=GB?objectClass?one
  • telnet://192.0.2.16:80/

Perhaps using java.net.URI would be more forgiving and not require a URLStreamHandler to be loaded.

Copy link
Member

@andrewkroh andrewkroh Nov 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly parts of the URL are required for parsing to work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewkroh, thanks for looking it over and commenting. I switched over to java.net.URI which does support more schemes including all three of the examples you list above.

Right now, no parts of a URI are required beyond what java.net.URI needs to construct an instance. Is that what you would prefer?

throw new IllegalArgumentException("unable to parse URL [" + value + "]");
}
var urlParts = new HashMap<String, Object>();
urlParts.put("domain", url.getHost());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ECS isn't clear on what's correct here, but what does getHost return for bracked IPv6 addresses?

@elastic/ecs Should url.domain include the brackets that are required when using IPv6 addresses in URLs? https://www.ietf.org/rfc/rfc2732.txt

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per this source, getHost() will return the IPv6 address enclosed in the brackets.

Since brackets are required with a literal IPv6 address, url.domain should include the brackets. We can improve the description of url.domain in the ECS docs to clarify.

Copy link
Member

@andrewkroh andrewkroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. 👍

@webmat
Copy link

webmat commented Nov 18, 2020

This is great @danhermann 👍

Note that I think it's fine if this processor doesn't populate the domain breakdown fields url.registered_domain and url.top_level_domain. Properly populating those to consider all the edge case effective TLDs requires lookups in a DB like Mozilla's. Which I think should be a distinct processor (like Beat's), and is therefore out of scope here.

@andrewkroh
Copy link
Member

There's a request for a registered domain processor (#57476). Perhaps if that gets implemented this one could internally use it.

@danhermann
Copy link
Contributor Author

There's a request for a registered domain processor (#57476). Perhaps if that gets implemented this one could internally use it.

I can add that on this processor as an option once the registered domain processor is completed.

@danhermann
Copy link
Contributor Author

danhermann commented Nov 18, 2020

Last functional question on this one -- should the processor be renamed to uri_parts since it's parsing schemes that aren't supported by the URL spec?

Edit: Though I do see that the ECS field names are all prefixed with url rather than uri. 😦

@leehinman
Copy link

This is awesome. Thank You.

@webmat
Copy link

webmat commented Nov 18, 2020

should the processor be renamed to uri_parts

I don't really mind either way, but I think it'll indeed be clearer to users that they can use it on all sorts of URIs if it's named uri_parts. So I'd be in favor of that :-)

@danhermann danhermann changed the title URL parts ingest processor URI parts ingest processor Nov 18, 2020
@danhermann
Copy link
Contributor Author

danhermann commented Nov 18, 2020

@elasticmachine run elasticsearch-ci/2

@danhermann
Copy link
Contributor Author

cc: @elastic/es-ui in case Kibana auto-complete needs to be updated with this new processor.

Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for adding this processor @danhermann

Left a minor comment

if (userInfo.contains(":")) {
int colonIndex = userInfo.indexOf(":");
uriParts.put("username", userInfo.substring(0, colonIndex));
uriParts.put("password", colonIndex < userInfo.length() ? userInfo.substring(colonIndex + 1) : "");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this fail with IndexOutOfBounds for http://user:@www.google.com:80/blarg.gif#ref ? (no password)
Shall we add a test for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, the password is set to an empty string. I'll add another test case to make that clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, @andreidan!

@alisonelizabeth
Copy link
Contributor

cc: @elastic/es-ui in case Kibana auto-complete needs to be updated with this new processor.

Thanks for the heads up @danhermann! I've opened elastic/kibana#83915 to add support for this in the UI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >feature Team:Data Management Meta label for data/management team v7.11.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add URL parts to url decode processor
9 participants