-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Doing geoip on more than one address... #9
Comments
One example of this is the Logstash Netflow module which puts GeoIP information in the Also, the filebeat-apache-access use case in this repository mentions this scenario. |
Hrmm... That's not bad... but you'd want the part after the underscore to match the top level object it was related to like It's also very convenient when coding when you can take a JSON, parse it into a dictionary/map and build a class out of one of the spots in the tree. +$0.02 |
Agreed! My previous comment was mostly meant to support making the case for multiple |
Oh totes, no worries mate! Just thinking it through |
The idea of the geoip fields is that they are reusable at every level. The same applies for example also for We should update the |
Oh, I see, that does help. Your suggestion is where I think I was going anyway! Thanks Are all the documented structures stackable like that? |
You can definitively stack all the structures together but I does not make sense for all prefixes I think. |
If we were to think of ECS as stackable structures, |
Yes, and it's one of the reasons I'm struggling with the two. Should it be |
Hypothetically, if ECS was an elasticsearch plugin that added these fields as data type (like Instead of...
you could say
And define source.host as ecs-host, etc. |
Interesting idea, it's like defining mini templates for types. |
I wouldn't use the Logstash Netflow Module as a example. It represents old thinking on how to best solve these problems. |
@robcowart Can you share some thoughts on how you would change the approach today? |
I think it makes the most sense for the I understand that |
@megadevx Agree on what you proposed for geoip, see #9 (comment) Assuming we would put source / destination under one structure, any ides on naming? |
@ruflin Just as a suggestion, they could both be put underneath the event structure. As the the |
Even in cases where geoip always makes sense only for the same address (remote/client in the case of web server logs), saving geolocation data in So I actually think that the official stance should be "you should save your geoip under the appropriate structure (source or destination)". This approach will work for all situations. We can give leeway and say "if your use case only ever has one of the two structures where geoip is appropriate, you can save the geoip data at the top level". This approach can be interesting because it offers a shorthand, in a way. Field names will be shorter, helping make visualizations more compact. |
@webmat That would be true about the visualizations being more compact. Still, I see the geolocation data being underneath the corresponding structure ( |
Why even include the term "geoip"? This makes an assumption about the source of the data filling this field. While it could come from a GeoIP DB like Maxmind, it could just as easily come from an inventory system or other source (like a Logstash What is also missing here is the fact that the source and destination are actually attributes of a higher level object... a network connection. In our solution - which is already proven by our dozens of out-of-the-box integrations, upon which we provide a single fully-integrated Kibana, ML and Alerting experience across all data sources, and which is used successfully in production at dozens of customers - these fields are named Since nesting was discussed in this thread I will say that |
I like the idea of dropping @MikePaquette What is your take attaching source and destination to @robcowart We are on the same page that ECS should be for any data and not only ECS. Obviously it was inspired by it but goes further. The heavy nesting often happens with metrics as they are very specific but I don't expect many metrics to end up in ECS. An example you might like is that |
|
Please allow me to comment here, and if necessary, we can fork further discussion into multiple issues: GeoIP - there are really two questions here:
Source, Destination, and Connections: Too.many.dots and nesting
|
I am curious, how many integrations of different data sources have been developed and tested to validate the assumptions on which ECS, in its current form, is based? In this case "tested" would mean that a single dashboard or ML job can work seamlessly across all data normalized to the model. |
@robcowart can't say for sure, but certainly a small number at this point. Our goal is to grow that number with the help of contributors like yourself, making appropriate changes along the way, so that we can bring this to a v1.0 release. |
I think this is your challenge at the moment. Until you take at least 15-18 different data sources and try to make them all work together seamlessly across multiple use-cases, you can't even begin to understand whether the schema is headed down the right path. I am certain that if this is done, you will feel differently about many of the things discussed above. I say this because nearly 24 months ago, when I was near the stage Elastic is now, I was making some of the same assumptions, and these assumptions turned out to be bad ideas. Believe me it really stinks when you realize from integration #15 that to do it right you now have to go back and refactor integrations #1-14. Been there... done that. But that is how to create a schema that will let users do some incredibly useful things. |
The reason ECS is open source because I don't think something like ECS should be developed behind closed doors as the most interesting use cases have our users. The feedback and contributions from the community and having exactly the discussions we have in this thread are crucial for the success of ECS. I strongly believe this is what makes ECS to work in many different ways as we not only have a few implementations relying on it. @robcowart I really appreciate all your feedback on ECS and the parts you shared from your schema. I hope long term the two schemas potentially converge or at least we can use the upcoming alias types to point one to the other. Lets get this thread back to the initial discussion: What should we do if we have multiple geoip fields? |
I'm actually backtracking on the idea that we need to have multiple geoip fields. A common security analytics use case which flows from and to a local infrastructure would be better suited with a single geoip field, because it makes a lot of sense to display both ingress and egress traffic using the same visualisations. Differentiation in directionality can then be made by filtering the visualisation on some other field. |
The flow direction is determined by the initiator of the connection, AFAIK. So I would expect both pools of locations not to be closely related. In most cases my public destinations should essentially be where my service providers are*, while my public sources would be the clients of my systems. * Exception: cases like my system is offering webhooks. So for this reason, I actually think it's important to maintain both pools independently. But I do see also why being able to filter for "all Russian traffic" can be interesting too. Having two fields only makes it slightly more annoying, though, not impossible. Your point about using common visualizations for both is a good one, however. @robcowart, does your approach determining the server & client side (going beyond src/dest) help with sharing visualizations? I would guess the same problem exists there. |
@webmat You explained what I was showing. If for example you need a single donut visualization to show all locations regardless of which end of the connection, use the Basically src and dst are just pulled from the packet and protocol headers. With unidirectional flows src and dst will be for that direction. Bidirectional records can be less consistent, especially where packets are sampled as opposed to looking at every packet. In a bi-directional sampled record src and dst will usually be from the headers of the first packet observed, which might not be the initiating packet. Regarding client/server, most of our dashboards are based on client/server not src/dst, although we provide both options where it makes sense. So there are client and server fields as well: I don't really care for having to create all of this duplicate data. Potentially a flag field could be used such as |
The more I think about what @robcowart is doing and having everything under connection instead of network the more I like it. Connection is in a lot of ways more fitting then network, for example also here: https://github.com/elastic/ecs/pull/43/files#diff-644c2fc5e13de920c8ffaa74ec66aac4R30 Connection could also be a good place for destination and source and potentially client and server if we need it. Lets assume we have a
So the same geo object would be used in all places. The same could be applied to the
This means basically where the I used abbreviations here to be more similar to what Rob proposed even though ECS currently recommends not to do it (other discussion). |
Inspired by this thread I opened an issue to potentially renamed |
This discussion raises the issue that we need to agree upon the conceptual schema we are using for ECS, including defining our entity classes. In various issues in this repo, we've casually referred to the concepts of "top level" field sets and "re-usable" field sets, but we have not yet specified formal relationship assertions for these various object types, which are going to be vital if we want ECS to be a useful and long-lasting data model. The reason I bring this up here is because I've been assuming (and documented in early pre-repo drafts of ECS) that the I will create a new issue in this repo, and propose a conceptual schema that I think will help us gain clarity in these individual discussions. In that issue, we can continue the discussion of whether we add a new |
There have been recently several discussions around source, destination and connection recently, especially in elastic#9. The conclusion from my side is that source and destination normally belongs to a connection and we actually miss a connection prefix. Also some information from network like `forward_ip` more belong to a connection. An additional change I made to source and destination is that they both contain now a host prefix. All the fields in source and destination also exist in `host`. The host prefix can be reused here too. This makes ECS very predictable that every time `host.*` shows up it will contain the same fields. Also source and destination could contain additional data like the location, see elastic#50 for more details. The connection fields now look as following: ``` | Field | Description | Type | Multi Field | Example | |---|---|---|---|---| | <a name="connection.destination.host.ip"></a>`connection.destination.host.ip` | IP address of the destination.<br/>Can be one or multiple IPv4 or IPv6 addresses. | ip | | | | <a name="connection.destination.host.name"></a>`connection.destination.host.name` | Hostname of the destination. | keyword | | | | <a name="connection.destination.host.port"></a>`connection.destination.host.port` | Port of the destination. | long | | | | <a name="connection.destination.host.mac"></a>`connection.destination.host.mac` | MAC address of the destination. | keyword | | | | <a name="connection.destination.host.domain"></a>`connection.destination.host.domain` | Destination domain. | keyword | | | | <a name="connection.destination.host.subdomain"></a>`connection.destination.host.subdomain` | Destination subdomain. | keyword | | | | <a name="connection.source.host.ip"></a>`connection.source.host.ip` | IP address of the source.<br/>Can be one or multiple IPv4 or IPv6 addresses. | ip | | | | <a name="connection.source.host.name"></a>`connection.source.host.name` | Hostname of the source. | keyword | | | | <a name="connection.source.host.port"></a>`connection.source.host.port` | Port of the source. | long | | | | <a name="connection.source.host.mac"></a>`connection.source.host.mac` | MAC address of the source. | keyword | | | | <a name="connection.source.host.domain"></a>`connection.source.host.domain` | Source domain. | keyword | | | | <a name="connection.source.host.subdomain"></a>`connection.source.host.subdomain` | Source subdomain. | keyword | | | | <a name="connection.direction"></a>`connection.direction` | Direction of the network traffic.<br/>Recommended values are:<br/> * inbound<br/> * outbound<br/> * unknown | keyword | | `inbound` | | <a name="connection.forwarded_ip"></a>`connection.forwarded_ip` | Host IP address when the source IP address is the proxy. | ip | | `192.1.1.2` | ``` I opened a PR to discuss this instead of an issue as it will allow us to discuss the high level parts as comment but also details directly in the code.
There have been recently several discussions around source, destination and connection recently, especially in elastic#9. The conclusion from my side is that source and destination normally belongs to a connection and we actually miss a connection prefix. Also some information from network like `forward_ip` more belong to a connection then network. An additional change I made to source and destination is that they both contain now a host prefix. All the fields in source and destination also exist in `host`. The host prefix can be reused here too. This makes ECS very predictable that every time `host.*` shows up it will contain the same fields. Also source and destination could contain additional data like the location, see elastic#50 for more details. The connection fields now look as following: | Field | Description | Type | |---|---|---|---|---| | <a name="connection.destination.host.ip"></a>`connection.destination.host.ip` | IP address of the destination.<br/>Can be one or multiple IPv4 or IPv6 addresses. | ip | | <a name="connection.destination.host.name"></a>`connection.destination.host.name` | Hostname of the destination. | keyword | | <a name="connection.destination.host.port"></a>`connection.destination.host.port` | Port of the destination. | long | | <a name="connection.destination.host.mac"></a>`connection.destination.host.mac` | MAC address of the destination. | keyword | | <a name="connection.destination.host.domain"></a>`connection.destination.host.domain` | Destination domain. | keyword | | <a name="connection.destination.host.subdomain"></a>`connection.destination.host.subdomain` | Destination subdomain. | keyword | | <a name="connection.source.host.ip"></a>`connection.source.host.ip` | IP address of the source.<br/>Can be one or multiple IPv4 or IPv6 addresses. | ip | | <a name="connection.source.host.name"></a>`connection.source.host.name` | Hostname of the source. | keyword | | <a name="connection.source.host.port"></a>`connection.source.host.port` | Port of the source. | long | | <a name="connection.source.host.mac"></a>`connection.source.host.mac` | MAC address of the source. | keyword | | <a name="connection.source.host.domain"></a>`connection.source.host.domain` | Source domain. | keyword | | <a name="connection.source.host.subdomain"></a>`connection.source.host.subdomain` | Source subdomain. | keyword | | <a name="connection.direction"></a>`connection.direction` | Direction of the network traffic.<br/>Recommended values are:<br/> * inbound<br/> * outbound<br/> * unknown | keyword | | <a name="connection.forwarded_ip"></a>`connection.forwarded_ip` | Host IP address when the source IP address is the proxy. | ip | I opened a PR to discuss this instead of an issue as it will allow us to discuss the high level parts as comment but also details directly in the code.
I just opened #51 to continue the discussion around a connection prefix. @MikePaquette I hope that fits well into your proposal. |
The original intent here has been solved. If anyone feels like the discussion should continue, please open a new issue 😄 |
Here's another topic for discussion!
We have events with more than one IP and want to do geoip on all of them. Instead of being "top level" field, what about
source.geoip
and such? Attach the geoip block to the thing it belongs to.The text was updated successfully, but these errors were encountered: