Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove src/dst.hostname, rename url.hostname to url.domain. #175

Merged
merged 7 commits into from
Nov 7, 2018

Conversation

webmat
Copy link
Contributor

@webmat webmat commented Nov 7, 2018

domain is now the place to store the host address under source, destination and url.

Also in this PR:

  • Remove ambiguous mention in url description talking about reuse. Since we
    haven't officially made url reuseable yet, I took it out for now. People
    are free to reuse everything everywhere, as it's just them defining new fields
    around the official ECS fields. But the spec making this statement about it
    in a different manner than the official reuseable objects is needlessly confusing.
  • Tweak the url example a bit.
  • Bring source description in line with destination's

Closes #166, and addresses part of #84.

@webmat webmat requested review from ruflin and MikePaquette November 7, 2018 16:16
@webmat webmat self-assigned this Nov 7, 2018
@webmat
Copy link
Contributor Author

webmat commented Nov 7, 2018

If you folks want me to separate the hostname work from the various cleanups included here, I can do that as well. The cleanups seemed straightforward, though.

@webmat webmat mentioned this pull request Nov 7, 2018
26 tasks
@webmat webmat force-pushed the domain-not-hostname branch from bd73a5b to 893defb Compare November 7, 2018 17:50
@webmat webmat merged commit bf3271e into elastic:master Nov 7, 2018
@webmat webmat deleted the domain-not-hostname branch November 7, 2018 19:45
| <a name="url.scheme"></a>url.scheme | Scheme of the request, such as "https".<br/>Note: The `:` is not part of the scheme. | extended | keyword | `https` |
| <a name="url.hostname"></a>url.hostname | Hostname of the request, such as "elastic.co".<br/>In some cases a URL may refer to an IP and/or port directly, without a domain name. In this case, the IP address would go to the `hostname` field. | extended | keyword | `elastic.co` |
| <a name="url.domain"></a>url.domain | Domain of the request, such as "www.elastic.co".<br/>In some cases a URL may refer to an IP and/or port directly, without a domain name. In this case, the IP address would go to the `domain` field. | extended | keyword | `www.elastic.co` |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still the case? should IP addresses be stored in the domain field?
I think that having sources that don't differentiate names from IPs is quite frequent when parsing logs, having IPs stored in a field called domain can be a bit confusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsoriano the url.domain field should generally not be populated with an IP address, however, when an event or log indicates that an IP address was used in a URL as the host/hostname, such as in http://15.73.192.108/index.html, then the string 15.73.192.108 would be entered into the url.domain field. This is useful, for example, for a security analyst to know that a user or system is attempting to bypass the DNS infrastructure to access a web resource. In some cases, this could be an indicator of malware activity. So searching through all values of url.domain and looking for values that have the form of an IP address can be a useful analysis.

Copy link
Contributor Author

@webmat webmat Nov 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsoriano We were in a tough spot and couldn't actually call it host, as this concept of "name or IP" is generally called. In ECS, host is a field set that can be reused in multiple places.

Just to be clear, the concept of "hostname" is often misused in exactly the same way. The definition of a "host" is "IP address or hostname" (check out the RFC survey I summarized in #166).

An IP address doesn't belong in a "hostname" either. Using "hostname" this way is a mistake that's more common, though, so it doesn't stick out quite as much when people see it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanations, I see the problematics and the reasoning, but using domain to store IPs still sounds a bit weird to me. In any case I don't have a better proposal if we cannot use host, but I'm afraid this can be a bit confusing also for users.
Maybe after moving some beat modules to ECS and playing a bit more with it I see it in a different way :)

Will it be the same for source and destination? Maybe we should add some explanations about this also there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsoriano Under source and destination, we actually have both domain and ip. Having started one transition for the Traefik module here elastic/beats#9005 outlines a difference in approach.

Most of the modules dealt with this indirectly and named a keyword field something like remote_ip and attempted to do GeoIP on it. In other words, the duality of this value being either an IP or a hostname was not addressed. Hopefully IN's GeoIP just fails gracefully when it encounters a hostname (I haven't checked). So I think in most of the modules right now, the lie is the other way around ;-) The remote_ip field can contain either an IP or a hostname, the field is keyword to support this, which ultimately means you can't do CIDR searches on it & so on.

The approach I took in the PR linked above is to add a tiny bit of logic to the module in this conversion. The ambiguous value is stored in a custom field, then via Grok I try to grab an IP out of the field, and store that in source.ip. If that grok fails, the fallback is to grab whatever's there and store it in source.domain. This happens here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, but having hosts in two fields can be a source of problems:

  • Queries on aggregations of both fields will be more complicated, and they can be a common use case (for example to do a visualization that shows top N hosts, including ips and domains).
  • Every module is obligated to implement a logic like this one and use more complicated queries to comply with ECS, even if the author doesn't care so much if a host is an IP or a domain. Or they can use a custom field, but I think this is going to be a common use case, so it'd be nice to be a field covered by ECS.

We could have a field for "ambiguous" hosts in addition to domain and ip, so all cases can be covered. Modules that try to parse the host as domain or ip could still keep the original in the ambiguous field, modules that don't care can use the less-featured ambiguous keyword field.
Even if ambiguous and not fully compliant with standards this field can still be common and useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree that having a field for the ambiguous information would be useful.

It used to be hostname, which was an incorrect name, as demonstrated in #166 (see the RFC survey in the issue body).

And we're back to square 1: the "correct" name according to the RFCs is "host" which is defined as either an IP or a hostname. And we don't want to use "host" because it will conflict with the current host field set, if people want to enrich their stream by nesting known host information there.

My best name for the ambiguous field for now is host_address at this time. But we haven't gotten together yet to plan the next steps towards ECS Beta 2.

Note also that the simplest way to avoid this problem now is to store the ambiguous "host" in a custom field. To get back to my Traefik module example, I'm leaving the incorrectly named traefik.access.remote_ip field there, with the exact same meaning as before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants