Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doing geoip on more than one address... #9

Closed
ave19 opened this issue May 30, 2018 · 35 comments
Closed

Doing geoip on more than one address... #9

ave19 opened this issue May 30, 2018 · 35 comments

Comments

@ave19
Copy link

ave19 commented May 30, 2018

Here's another topic for discussion!

We have events with more than one IP and want to do geoip on all of them. Instead of being "top level" field, what about source.geoip and such? Attach the geoip block to the thing it belongs to.

@praseodym
Copy link
Contributor

praseodym commented May 30, 2018

One example of this is the Logstash Netflow module which puts GeoIP information in the geoip_src and geoip_dst groups for source and destination addresses, respectively.

Also, the filebeat-apache-access use case in this repository mentions this scenario.

@ave19
Copy link
Author

ave19 commented May 30, 2018

Hrmm... That's not bad... but you'd want the part after the underscore to match the top level object it was related to like geoip_source. But I think a geoip lookup based on source.ip should be in source.geoip to keep things together in the structure. Since you already have source as an object type and ip under (inside?) it, you've got room for all the other things you want to attach to that source point like source.whois and source.fqdn etc.

It's also very convenient when coding when you can take a JSON, parse it into a dictionary/map and build a class out of one of the spots in the tree. { "source": { ... } } becomes a Source class object and everything you need to know about Source is below that spot in the structure. If you put it over in geoip_src you have to know where data about your source might be stashed around the structure. It may not be obvious they're related.

+$0.02

@praseodym
Copy link
Contributor

Agreed!

My previous comment was mostly meant to support making the case for multiple geoip groups in a single event. I wasn’t trying to imply that using geoip_src is a good thing, just that the Logstash Netflow module is a use case that should be taken into account. Sorry for any confusion!

@ave19
Copy link
Author

ave19 commented May 30, 2018

Oh totes, no worries mate! Just thinking it through

@ruflin
Copy link
Contributor

ruflin commented May 31, 2018

The idea of the geoip fields is that they are reusable at every level. The same applies for example also for user or url fields. On the url fields it is mentioned on how it could be used: https://github.com/elastic/ecs#url

We should update the geoip description to make it clear that these are "reusable". So the structure I would suggest for your use case would be source.geoip.* and destination.geoip.*. Does that make sense?

@ave19
Copy link
Author

ave19 commented May 31, 2018

Oh, I see, that does help. Your suggestion is where I think I was going anyway! Thanks

Are all the documented structures stackable like that?

@ruflin
Copy link
Contributor

ruflin commented Jun 1, 2018

You can definitively stack all the structures together but I does not make sense for all prefixes I think.

@praseodym
Copy link
Contributor

If we were to think of ECS as stackable structures, source and destination are definitely the same structure.

@ruflin
Copy link
Contributor

ruflin commented Jun 1, 2018

Yes, and it's one of the reasons I'm struggling with the two. Should it be source.host.* and destination.host.* for example?

@ave19
Copy link
Author

ave19 commented Jun 1, 2018

Hypothetically, if ECS was an elasticsearch plugin that added these fields as data type (like host and geoip are ecs-host and ecs-geoip data types) then you could define them inside a structure where they occur. Would all the sub-fields come along? Have you cake and eat it too?

Instead of...

          "network": {
            "properties": {
              "outbound": {
                "properties": {
                  "bytes": {
                    "type": "double"
                  }
                }
              },
              "protocol": {
                "type": "keyword"               
              }
            }
...etc...

you could say

{
   "network": {
      "type": "ecs-network"
  }
}

And define source.host as ecs-host, etc.

@ruflin
Copy link
Contributor

ruflin commented Jun 4, 2018

Interesting idea, it's like defining mini templates for types.

@robcowart
Copy link

I wouldn't use the Logstash Netflow Module as a example. It represents old thinking on how to best solve these problems.

@ruflin
Copy link
Contributor

ruflin commented Jun 12, 2018

@robcowart Can you share some thoughts on how you would change the approach today?

@devinbfergy
Copy link

I think it makes the most sense for the geoip to be under it's corresponding information. I am for the source.geoip and destination.geoip.

I understand that source and destination are the same structure, but it also seems that they are different in the aspect of one is moving towards the other. Thus, they could be both under a higher structure that has underneath it source and/or destination.

@ruflin
Copy link
Contributor

ruflin commented Jun 29, 2018

@megadevx Agree on what you proposed for geoip, see #9 (comment)

Assuming we would put source / destination under one structure, any ides on naming?

@devinbfergy
Copy link

@ruflin Just as a suggestion, they could both be put underneath the event structure. As the the source and destination are aspects of the event occurring describing the source of the event and the destination of the event. Again, I'm not totally sure this will always make the most sense.

@webmat
Copy link
Contributor

webmat commented Jun 29, 2018

Even in cases where geoip always makes sense only for the same address (remote/client in the case of web server logs), saving geolocation data in source.geoip still makes sense.

So I actually think that the official stance should be "you should save your geoip under the appropriate structure (source or destination)". This approach will work for all situations.

We can give leeway and say "if your use case only ever has one of the two structures where geoip is appropriate, you can save the geoip data at the top level". This approach can be interesting because it offers a shorthand, in a way. Field names will be shorter, helping make visualizations more compact.

@devinbfergy
Copy link

@webmat That would be true about the visualizations being more compact. Still, I see the geolocation data being underneath the corresponding structure (source/destination) with the ip used to do that lookup as more sensible as opposed to being top level. I think that if someone wants the geolocation data to be top level then that should be something they could choose to do, but won't be standard within ECS.

@robcowart
Copy link

Why even include the term "geoip"? This makes an assumption about the source of the data filling this field. While it could come from a GeoIP DB like Maxmind, it could just as easily come from an inventory system or other source (like a Logstash translate filter). Don't forget GeoIP DBs only include public IP addresses. However there are plenty of users interested in populating such fields for private IP addresses as well. For this reason, I would recommend something as simple as: source.city, source.country, etc.

What is also missing here is the fact that the source and destination are actually attributes of a higher level object... a network connection. In our solution - which is already proven by our dozens of out-of-the-box integrations, upon which we provide a single fully-integrated Kibana, ML and Alerting experience across all data sources, and which is used successfully in production at dozens of customers - these fields are named conn.src_city, conn.src_country, etc. This same method is also used in ElastiFlow, albeit with a simplified schema only for flow data. ElastiFlow is deployed successfully by 1000+ users, so I feel pretty confident that this is the way to go.

Since nesting was discussed in this thread I will say that i.would.be.wary.of.the.overuse.of.nested.fields, which I see in Beats and ECS. It feels like some kind of containment model is being forced onto the data when there is nothing in the stack that can actually take advantage of it. Consider this... as a user, would you rather work with a field called system.cpu.total.norm.pct or simply cpu.util? Our experience, which includes solutions other than just Elastic, the preference would clearly be for the latter. In my opinion ECS should not be leaning on the ideas from Beats, rather ECS should be designed to apply to ANY data source, and Beats should be modified to support it. While there could be potential use-cases for some degree of nesting. There would have to be features added in other components to take advantage of it (if this is planned, let's hear about it). Even then the nesting is arguable still excessive in some places.

@ruflin
Copy link
Contributor

ruflin commented Jul 2, 2018

I like the idea of dropping geoip as I agree it's too specific and the fields can come form different tools, not only GeoIP. One thing I would like to keep is grouping them together so they can be reused. Something like location.city or geo.city?

@MikePaquette What is your take attaching source and destination to connection? Do we have any use cases where source / destination would not be related to a connection?

@robcowart We are on the same page that ECS should be for any data and not only ECS. Obviously it was inspired by it but goes further. The heavy nesting often happens with metrics as they are very specific but I don't expect many metrics to end up in ECS. An example you might like is that process.id is on the top level: https://github.com/elastic/ecs/blob/master/schemas/process.yml In Beats it is under system but it was definitively too hidden there and system was not generic enough. Beats will migrate to ECS in the long run, but parts will be breaking changes and parts we will be able to cover with the upcoming field alias feature.

@ave19
Copy link
Author

ave19 commented Jul 2, 2018

@robcowart

  • First, you're right, source and dest apply to packets. Most of our tools capture sessions (aka flows) between endpoints.
  • Second, wow. Send me a note, I'd like to give you my .mil email so we can chat about your thoughts.

@MikePaquette
Copy link
Contributor

Please allow me to comment here, and if necessary, we can fork further discussion into multiple issues:

GeoIP - there are really two questions here:

  1. In which ECS fields do we put IP-based location information when it does not come from GeoIP? @robcowart makes a good point - we should have fields for location-based information under our existing source and destination field sets. I would be +1 for source.city, source.country, and destination.city, destination.country to be added to ECS. These fields would be populated whether the log data is enriched with Geo IP or some other source of IP-based location info, and would allow us to develop location-based analytics regardless of which enrichment source was used.
  2. Where do we put actual GeoIP information if we do have it? - I think @ruflin is correct in Doing geoip on more than one address... #9 (comment) saying that the geoip field set is meant to be “reusable” and can be positioned in its entirety “under” the source or destination field sets as source.geoip.* and destination.geoip.*. Note that in the case where actual Geo IP information is available, the city and country information would be copied from the original Geo IP fields to the appropriate new source and/or destination city and country fields described above.

Source, Destination, and Connections:
While I agree that source and destination literally apply to network packets, and that indeed many common data sources will capture flows or connections, in ECS, we want to make available the source.* and destination.* fields for analytics even when they come from other or multiple types of events, such as host-based events or network device-based events that are not connections. So we propose not including flow. or conn. explicitly, but rather making them implicit - thus “promoting” source. and destination. as top level field sets. So I would not be in favor of attaching source and destination to connection for ECS.

Too.many.dots and nesting
@robcowart I hear you loud and clear on this one. (Interestingly, the first draft of ECS had zero dots in common field names - all underscores.) However we decided to take advantage of Elasticsearch objects to benefit storage and convenience, and introduced dots. I personally reconcile the how many dots are used? issue by thinking of ECS as having two classes of field names:

  1. The common fields that users are likely to use in analytics (searches, dashboards, alerts, ML jobs, reports, etc.)? Ideally we limit these to just one dot. e.g., source.city, source.country with a limited number of common field sets, so they will be easy for users/analysts to remember.
  2. Fields that are created by extending an existing, potentially multi-level, nested structure under one of the ECS field sets. These can have any number of dots, but would still map under an ECS field set. e.g., source.geoip.city_name. This allows implementers/users/analysts who are familiar with a nested structure to still access it for detailed information that may not be contained in the common fields above.

@robcowart
Copy link

I am curious, how many integrations of different data sources have been developed and tested to validate the assumptions on which ECS, in its current form, is based? In this case "tested" would mean that a single dashboard or ML job can work seamlessly across all data normalized to the model.

@MikePaquette
Copy link
Contributor

@robcowart can't say for sure, but certainly a small number at this point. Our goal is to grow that number with the help of contributors like yourself, making appropriate changes along the way, so that we can bring this to a v1.0 release.

@robcowart
Copy link

I think this is your challenge at the moment. Until you take at least 15-18 different data sources and try to make them all work together seamlessly across multiple use-cases, you can't even begin to understand whether the schema is headed down the right path. I am certain that if this is done, you will feel differently about many of the things discussed above. I say this because nearly 24 months ago, when I was near the stage Elastic is now, I was making some of the same assumptions, and these assumptions turned out to be bad ideas. Believe me it really stinks when you realize from integration #15 that to do it right you now have to go back and refactor integrations #1-14. Been there... done that. But that is how to create a schema that will let users do some incredibly useful things.

@ruflin
Copy link
Contributor

ruflin commented Jul 3, 2018

The reason ECS is open source because I don't think something like ECS should be developed behind closed doors as the most interesting use cases have our users. The feedback and contributions from the community and having exactly the discussions we have in this thread are crucial for the success of ECS. I strongly believe this is what makes ECS to work in many different ways as we not only have a few implementations relying on it. @robcowart I really appreciate all your feedback on ECS and the parts you shared from your schema. I hope long term the two schemas potentially converge or at least we can use the upcoming alias types to point one to the other.

Lets get this thread back to the initial discussion: What should we do if we have multiple geoip fields?

@praseodym
Copy link
Contributor

I'm actually backtracking on the idea that we need to have multiple geoip fields. A common security analytics use case which flows from and to a local infrastructure would be better suited with a single geoip field, because it makes a lot of sense to display both ingress and egress traffic using the same visualisations. Differentiation in directionality can then be made by filtering the visualisation on some other field.

@robcowart
Copy link

We found that you need both options to handle multiple use-cases. It looks like this:

screen shot 2018-07-03 at 20 09 07

@webmat
Copy link
Contributor

webmat commented Jul 3, 2018

The flow direction is determined by the initiator of the connection, AFAIK. So I would expect both pools of locations not to be closely related.

In most cases my public destinations should essentially be where my service providers are*, while my public sources would be the clients of my systems.

* Exception: cases like my system is offering webhooks.

So for this reason, I actually think it's important to maintain both pools independently. But I do see also why being able to filter for "all Russian traffic" can be interesting too. Having two fields only makes it slightly more annoying, though, not impossible. source.country_code:RU or destination.country_code:RU.

Your point about using common visualizations for both is a good one, however.

@robcowart, does your approach determining the server & client side (going beyond src/dest) help with sharing visualizations? I would guess the same problem exists there.

@robcowart
Copy link

@webmat You explained what I was showing. If for example you need a single donut visualization to show all locations regardless of which end of the connection, use the conn.city field. However if you want to create a sankey diagram with Vega to show traffic from source cities to destination cities, they need to be separate fields.

Basically src and dst are just pulled from the packet and protocol headers. With unidirectional flows src and dst will be for that direction. Bidirectional records can be less consistent, especially where packets are sampled as opposed to looking at every packet. In a bi-directional sampled record src and dst will usually be from the headers of the first packet observed, which might not be the initiating packet.

Regarding client/server, most of our dashboards are based on client/server not src/dst, although we provide both options where it makes sense. So there are client and server fields as well:

screen shot 2018-07-03 at 20 31 35

I don't really care for having to create all of this duplicate data. Potentially a flag field could be used such as conn.server_end which would be either src or dst. Unfortunately this doesn't work in Kibana. While a scripted field could be used to populate to populate the client/server fields, not all visualizations support scripted fields. There is also the issue of how all of the scripted fields would impact query performance.

@ruflin
Copy link
Contributor

ruflin commented Jul 12, 2018

The more I think about what @robcowart is doing and having everything under connection instead of network the more I like it. Connection is in a lot of ways more fitting then network, for example also here: https://github.com/elastic/ecs/pull/43/files#diff-644c2fc5e13de920c8ffaa74ec66aac4R30

Connection could also be a good place for destination and source and potentially client and server if we need it.

Lets assume we have a geo (instead of geoip) object. The example above from Rob would then look as following:

geo.city: All cities existing in one field if needed
conn.src.geo.city
conn.dest.geo.city
conn.server.geo.city
...

So the same geo object would be used in all places. The same could be applied to the host object:

host.ip: All ips in one fields
conn.src.host.ip
conn.dest.host.ip
conn.server.host.ip
...

This means basically where the host object is used, it looks the same. The same for any other object that is reused.

I used abbreviations here to be more similar to what Rob proposed even though ECS currently recommends not to do it (other discussion).

@ruflin
Copy link
Contributor

ruflin commented Jul 17, 2018

Inspired by this thread I opened an issue to potentially renamed geoip: #50 Any feedback welcome.

@MikePaquette
Copy link
Contributor

This discussion raises the issue that we need to agree upon the conceptual schema we are using for ECS, including defining our entity classes. In various issues in this repo, we've casually referred to the concepts of "top level" field sets and "re-usable" field sets, but we have not yet specified formal relationship assertions for these various object types, which are going to be vital if we want ECS to be a useful and long-lasting data model.

The reason I bring this up here is because I've been assuming (and documented in early pre-repo drafts of ECS) that the host.*, network.*, source.*, and destination.* objects are all "top level" objects/field sets, but the examples above, e.g., conn.dest.host.ip, break those assumptions in several places.

I will create a new issue in this repo, and propose a conceptual schema that I think will help us gain clarity in these individual discussions. In that issue, we can continue the discussion of whether we add a new connection.* object, and if so, which entity class it falls into within the ECS conceptional schema. I'll tag this issue once the new issue is created - trying to get this in before PTO, but if not, next week!

ruflin added a commit to ruflin/ecs that referenced this issue Jul 17, 2018
There have been recently several discussions around source, destination and connection recently, especially in elastic#9. The conclusion from my side is that source and destination normally belongs to a connection and we actually miss a connection prefix. Also some information from network like `forward_ip` more belong to a connection.

An additional change I made to source and destination is that they both contain now a host prefix. All the fields in source and destination also exist in `host`. The host prefix can be reused here too. This makes ECS very predictable that every time `host.*` shows up it will contain the same fields. Also source and destination could contain additional data like the location, see elastic#50 for more details.

The connection fields now look as following:

```
| Field  | Description  | Type  | Multi Field  | Example  |
|---|---|---|---|---|
| <a name="connection.destination.host.ip"></a>`connection.destination.host.ip`  | IP address of the destination.<br/>Can be one or multiple IPv4 or IPv6 addresses.  | ip  |   |   |
| <a name="connection.destination.host.name"></a>`connection.destination.host.name`  | Hostname of the destination.  | keyword  |   |   |
| <a name="connection.destination.host.port"></a>`connection.destination.host.port`  | Port of the destination.  | long  |   |   |
| <a name="connection.destination.host.mac"></a>`connection.destination.host.mac`  | MAC address of the destination.  | keyword  |   |   |
| <a name="connection.destination.host.domain"></a>`connection.destination.host.domain`  | Destination domain.  | keyword  |   |   |
| <a name="connection.destination.host.subdomain"></a>`connection.destination.host.subdomain`  | Destination subdomain.  | keyword  |   |   |
| <a name="connection.source.host.ip"></a>`connection.source.host.ip`  | IP address of the source.<br/>Can be one or multiple IPv4 or IPv6 addresses.  | ip  |   |   |
| <a name="connection.source.host.name"></a>`connection.source.host.name`  | Hostname of the source.  | keyword  |   |   |
| <a name="connection.source.host.port"></a>`connection.source.host.port`  | Port of the source.  | long  |   |   |
| <a name="connection.source.host.mac"></a>`connection.source.host.mac`  | MAC address of the source.  | keyword  |   |   |
| <a name="connection.source.host.domain"></a>`connection.source.host.domain`  | Source domain.  | keyword  |   |   |
| <a name="connection.source.host.subdomain"></a>`connection.source.host.subdomain`  | Source subdomain.  | keyword  |   |   |
| <a name="connection.direction"></a>`connection.direction`  | Direction of the network traffic.<br/>Recommended values are:<br/>  * inbound<br/>  * outbound<br/>  * unknown  | keyword  |   | `inbound`  |
| <a name="connection.forwarded_ip"></a>`connection.forwarded_ip`  | Host IP address when the source IP address is the proxy.  | ip  |   | `192.1.1.2`  |
```

I opened a PR to discuss this instead of an issue as it will allow us to discuss the high level parts as comment but also details directly in the code.
ruflin added a commit to ruflin/ecs that referenced this issue Jul 17, 2018
There have been recently several discussions around source, destination and connection recently, especially in elastic#9. The conclusion from my side is that source and destination normally belongs to a connection and we actually miss a connection prefix. Also some information from network like `forward_ip` more belong to a connection then network.

An additional change I made to source and destination is that they both contain now a host prefix. All the fields in source and destination also exist in `host`. The host prefix can be reused here too. This makes ECS very predictable that every time `host.*` shows up it will contain the same fields. Also source and destination could contain additional data like the location, see elastic#50 for more details.

The connection fields now look as following:

| Field  | Description  | Type  |
|---|---|---|---|---|
| <a name="connection.destination.host.ip"></a>`connection.destination.host.ip`  | IP address of the destination.<br/>Can be one or multiple IPv4 or IPv6 addresses.  | ip  |
| <a name="connection.destination.host.name"></a>`connection.destination.host.name`  | Hostname of the destination.  | keyword  |
| <a name="connection.destination.host.port"></a>`connection.destination.host.port`  | Port of the destination.  | long  |
| <a name="connection.destination.host.mac"></a>`connection.destination.host.mac`  | MAC address of the destination.  | keyword  |
| <a name="connection.destination.host.domain"></a>`connection.destination.host.domain`  | Destination domain.  | keyword  |
| <a name="connection.destination.host.subdomain"></a>`connection.destination.host.subdomain`  | Destination subdomain.  | keyword  |
| <a name="connection.source.host.ip"></a>`connection.source.host.ip`  | IP address of the source.<br/>Can be one or multiple IPv4 or IPv6 addresses.  | ip  |
| <a name="connection.source.host.name"></a>`connection.source.host.name`  | Hostname of the source.  | keyword  |
| <a name="connection.source.host.port"></a>`connection.source.host.port`  | Port of the source.  | long  |
| <a name="connection.source.host.mac"></a>`connection.source.host.mac`  | MAC address of the source.  | keyword  |
| <a name="connection.source.host.domain"></a>`connection.source.host.domain`  | Source domain.  | keyword  |
| <a name="connection.source.host.subdomain"></a>`connection.source.host.subdomain`  | Source subdomain.  | keyword  |
| <a name="connection.direction"></a>`connection.direction`  | Direction of the network traffic.<br/>Recommended values are:<br/>  * inbound<br/>  * outbound<br/>  * unknown  | keyword  |
| <a name="connection.forwarded_ip"></a>`connection.forwarded_ip`  | Host IP address when the source IP address is the proxy.  | ip  |

I opened a PR to discuss this instead of an issue as it will allow us to discuss the high level parts as comment but also details directly in the code.
@ruflin
Copy link
Contributor

ruflin commented Jul 17, 2018

I just opened #51 to continue the discussion around a connection prefix.

@MikePaquette I hope that fits well into your proposal.

@ebeahan
Copy link
Member

ebeahan commented Aug 4, 2020

The original intent here has been solved. If anyone feels like the discussion should continue, please open a new issue 😄

@ebeahan ebeahan closed this as completed Aug 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants