-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove src/dst.hostname, rename url.hostname to url.domain. #175
Conversation
If you folks want me to separate the hostname work from the various cleanups included here, I can do that as well. The cleanups seemed straightforward, though. |
Removing this part of the description for now, until we decide otherwise.
bd73a5b
to
893defb
Compare
| <a name="url.scheme"></a>url.scheme | Scheme of the request, such as "https".<br/>Note: The `:` is not part of the scheme. | extended | keyword | `https` | | ||
| <a name="url.hostname"></a>url.hostname | Hostname of the request, such as "elastic.co".<br/>In some cases a URL may refer to an IP and/or port directly, without a domain name. In this case, the IP address would go to the `hostname` field. | extended | keyword | `elastic.co` | | ||
| <a name="url.domain"></a>url.domain | Domain of the request, such as "www.elastic.co".<br/>In some cases a URL may refer to an IP and/or port directly, without a domain name. In this case, the IP address would go to the `domain` field. | extended | keyword | `www.elastic.co` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still the case? should IP addresses be stored in the domain
field?
I think that having sources that don't differentiate names from IPs is quite frequent when parsing logs, having IPs stored in a field called domain
can be a bit confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsoriano the url.domain
field should generally not be populated with an IP address, however, when an event or log indicates that an IP address was used in a URL as the host/hostname, such as in http://15.73.192.108/index.html
, then the string 15.73.192.108
would be entered into the url.domain
field. This is useful, for example, for a security analyst to know that a user or system is attempting to bypass the DNS infrastructure to access a web resource. In some cases, this could be an indicator of malware activity. So searching through all values of url.domain
and looking for values that have the form of an IP address can be a useful analysis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsoriano We were in a tough spot and couldn't actually call it host
, as this concept of "name or IP" is generally called. In ECS, host
is a field set that can be reused in multiple places.
Just to be clear, the concept of "hostname" is often misused in exactly the same way. The definition of a "host" is "IP address or hostname" (check out the RFC survey I summarized in #166).
An IP address doesn't belong in a "hostname" either. Using "hostname" this way is a mistake that's more common, though, so it doesn't stick out quite as much when people see it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanations, I see the problematics and the reasoning, but using domain
to store IPs still sounds a bit weird to me. In any case I don't have a better proposal if we cannot use host
, but I'm afraid this can be a bit confusing also for users.
Maybe after moving some beat modules to ECS and playing a bit more with it I see it in a different way :)
Will it be the same for source and destination? Maybe we should add some explanations about this also there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsoriano Under source and destination, we actually have both domain
and ip
. Having started one transition for the Traefik module here elastic/beats#9005 outlines a difference in approach.
Most of the modules dealt with this indirectly and named a keyword
field something like remote_ip
and attempted to do GeoIP on it. In other words, the duality of this value being either an IP or a hostname was not addressed. Hopefully IN's GeoIP just fails gracefully when it encounters a hostname (I haven't checked). So I think in most of the modules right now, the lie is the other way around ;-) The remote_ip
field can contain either an IP or a hostname, the field is keyword
to support this, which ultimately means you can't do CIDR searches on it & so on.
The approach I took in the PR linked above is to add a tiny bit of logic to the module in this conversion. The ambiguous value is stored in a custom field, then via Grok I try to grab an IP out of the field, and store that in source.ip
. If that grok fails, the fallback is to grab whatever's there and store it in source.domain
. This happens here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, but having hosts in two fields can be a source of problems:
- Queries on aggregations of both fields will be more complicated, and they can be a common use case (for example to do a visualization that shows top N hosts, including ips and domains).
- Every module is obligated to implement a logic like this one and use more complicated queries to comply with ECS, even if the author doesn't care so much if a host is an IP or a domain. Or they can use a custom field, but I think this is going to be a common use case, so it'd be nice to be a field covered by ECS.
We could have a field for "ambiguous" hosts in addition to domain
and ip
, so all cases can be covered. Modules that try to parse the host as domain
or ip
could still keep the original in the ambiguous field, modules that don't care can use the less-featured ambiguous keyword field.
Even if ambiguous and not fully compliant with standards this field can still be common and useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree that having a field for the ambiguous information would be useful.
It used to be hostname
, which was an incorrect name, as demonstrated in #166 (see the RFC survey in the issue body).
And we're back to square 1: the "correct" name according to the RFCs is "host" which is defined as either an IP or a hostname. And we don't want to use "host" because it will conflict with the current host
field set, if people want to enrich their stream by nesting known host information there.
My best name for the ambiguous field for now is host_address
at this time. But we haven't gotten together yet to plan the next steps towards ECS Beta 2.
Note also that the simplest way to avoid this problem now is to store the ambiguous "host" in a custom field. To get back to my Traefik module example, I'm leaving the incorrectly named traefik.access.remote_ip
field there, with the exact same meaning as before.
domain
is now the place to store the host address undersource
,destination
andurl
.Also in this PR:
url
description talking about reuse. Since wehaven't officially made
url
reuseable yet, I took it out for now. Peopleare free to reuse everything everywhere, as it's just them defining new fields
around the official ECS fields. But the spec making this statement about it
in a different manner than the official reuseable objects is needlessly confusing.
source
description in line withdestination
'sCloses #166, and addresses part of #84.