Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added http request and response content-type field #554

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mbudge
Copy link
Contributor

@mbudge mbudge commented Sep 11, 2019

The content-type field can be useful to identify what type of data is being transferred over http. The value isn't always accurate as a web-server can set the content-type to any value.

For example, security analysts might want to look at http requests to uncommon domains with the following content-type
application/x-www-form-urlencoded

Sometimes generic malware sets this content-type to application/x-www-form-urlencoded
Or the request accepts text and the proxy detects an executable being returned.

The content-type field can be useful to identify what type of data is being transferred over http. The value isn't always accurate as a web-server can set the content-type to any value.
@webmat
Copy link
Contributor

webmat commented Sep 23, 2019

Yes, we want to add support (or at least guidance) for HTTP headers in ECS.

Rather than adding content_type directly under request. and response., I think it would make sense to define a place where all headers go, such as http.request.headers.* and http.response.headers.*.

Elastic APM already does this (field defs, sample event). APM nests the captured headers with their original capitalization. This approach is beautiful in its simplicity, but it conflicts with ECS' principle of using only lowercase letters.

There's a few factors at play in thinking about how we want to approach this.

Capitalization

  • Insisting that HTTP headers be lowercased and use underscores (e.g. "Content-Type" becoming "content_type") will introduce unnecessary difficulties in mapping these header fields to ECS.
  • Related to this, trying to enforce any name change on headers may create situations where one legit header overrides another. It's a bit contrived, but an HTTP request/response could very well have both a custom "content_type" header as well as the standard "Content-Type" header...

Which headers to support

  • As a schema, ECS could simply provide guidance on how and where to properly store header fields, as there's too many possibilities to list them all.
    • This would ensure users are able to capture any header they care about.
    • There may of course be additional guidance on which ones users/integrations should try to capture, to establish a baseline of what's most useful / commonly seen.
  • Or perhaps we don't care about having support for arbitrary headers, and we'd prefer to define a specific list of a few supported headers in ECS?
    • If that's the case, then the problem of headers overwriting one another is goes away.

What do you think @ruflin @MikePaquette?

Personally would be inclined to take the pragmatic route, and make an exception on capitalization for HTTP headers. This makes sense to me, because of the arbitrary and unknowable amount of headers that can be important to people, and the simplicity of implementation. I would actually go with what APM does pretty much as is.

cc @simitt @graphaelli

@MikePaquette
Copy link
Contributor

@webmat I agree with the approach APM has taken, and think the exception to the capitalization guideline is warranted in protocol-specific "extended" fields like this.

👍

@ruflin
Copy link
Member

ruflin commented Sep 24, 2019

I'm also good with this approach, especially as we don't define these fields in ECS but only provide the "box" for it.

@simitt
Copy link
Contributor

simitt commented Sep 24, 2019

Actually the APM server canonicalizes the headers. In case multiple headers are sent up with the same header name but different casing, the values would end up in the same header field.

E.g an event with following headers: { "response": { "headers": { "content-type": "foo","Content-Type": "bar", "MyHeader":"abc" , "myheader": "xyz" }}}

is stored in ES as

{ 
  "http" : {
    "response" : {
      "headers" : {
        "Myheader" : [
          "abc",
          "xyz"
        ],
        "Content-Type" : [
          "foo",
          "bar"
        ]
      }
    }
  }
}

@webmat
Copy link
Contributor

webmat commented Sep 24, 2019

@simitt Thanks for adding this bit of context. This may affect the approach we take.

I'm curious, do agents and servers typically respect non-canonical headers, when they correspond to a canonical header?

In other words, if someone specifies "content-type" without the caps, would a web server respect this as if the agent had specified "Content-Type"?

Actually the underlying question is: is this the reason to canonicalize the headers? If not, why do so?

@dcode
Copy link
Contributor

dcode commented Sep 24, 2019

@webmat per HTTP/1.1 specification on header fields (RFC 7230):

consists of a case-insensitive field name followed
by a colon (":"), optional leading whitespace, the field value, and
optional trailing whitespace.

Specifically from a security analyst perspective, seeing differences in casing is informative. Some malware will actually transform header fields with random capitalization to evade detection from simple text-matching signatures. Additionally, some malware will send two copies of the same header key with a different value for detection evasion. A naïve logging solution will overwrite the first value with the contents of the second value, while the application will likely follow the branch of the first value.

From an APM perspective, I could see canonicalizing the headers as useful, since how your software responds to a given request should be irrelevant to the case in which it is formatted.

All that said, having a standard "box" in which to place client and server headers makes the most sense to me. Allow for a list of key-value pairs that are indexed and searchable, but needn't be defined by the schema, in the same way that HTTP no longer defines what specific headers must be sent. Additionally, this "box" should allow for a list of non-unique key-value pairs, or keys without values. Could also just be a list of strings 🤷‍♂

One other item to note is on content-type. Zeek, in particular, records orig_mime_types and resp_mime_types, which captures the detected mime types of one or more files that are transferred by the originator and responder. This is distinct from what appears in the request or response headers, which could also be logged from a couple of common zeek scripts circulating in the community.

The Zeek detected mime types could be stored in a file attribute, which follows the Zeek logic since that's also in file.mime_type. However, detecting a difference in the declared content type versus the actual file contents would not be directly possible with a query if it was nested in a list of headers, I think.

In general, I think http.request.headers and http.response.headers make sense, but content-type might be a special case like host that warrants a dedicated field.

@simitt
Copy link
Contributor

simitt commented Sep 25, 2019

@webmat the main reason was to make it easier to search for the headers (if users make them searchable). It was also a side effect of using the standard go http.Header functionality, that was used to collect headers (server side) and ensure none of the values is overwritten.

@ruflin
Copy link
Member

ruflin commented Sep 25, 2019

I wonder if we need to cover both cases with the same field. Could we follow APM here but also allow an option to store the raw header blob that then could be looked at if there are odd things around capatilization etc.?

@graphaelli
Copy link
Member

the main reason was to make it easier to search for the headers

Expanding a bit more on that, HTTP headers are case-insensitive, es object keys are case-sensitive. APM users expect them to be stored in the same field so here we are.

Specifically from a security analyst perspective, seeing differences in casing is informative.

Makes sense, as has been stated a few times, this can be left to the implementer. For APM, we'll canonicalize, for some logs maybe that's not appropriate if they're a main source of security data.

Note that intermediate proxies and the like are free to change the case of headers, eg linkerd does this.

Back to the original question, there are some headers that are so useful that extracting the information into dedicated fields is handy, #232 has some examples of using existing ECS outside of http fields for this.

@webmat
Copy link
Contributor

webmat commented Sep 25, 2019

intermediate proxies and the like are free to change the case of headers, eg linkerd does this

I didn't realize that. Thanks for the added context. Found this informative discussion thanks to this input. This discussion led me to the http/2 spec, which states that headers must be lowercased (see section 8.1.2)...

es object keys are case-sensitive

I think this is partially true. Not sure if this is an edge case, actually. A quick experiment on 7.3.1 shows that key names are case insensitive for aggs, but not for searches 🤔

PUT cap-diff/_doc/1
{ "method": "get" }
PUT cap-diff/_doc/2
{ "METHOD": "get" }

GET cap-diff/_search?q=method.keyword:*
# 1 hit

GET cap-diff/_search
{ "aggs" : { "methods" : { "terms" : { "field" : "method.keyword" } } } }
# 2 hits

The resulting mapping contains both the method and METHOD fields.

I think ECS could take the following stance:

  • Headers are recorded under http.request.headers.* and http.response.headers.*
  • Encourage implementations to lowercase the header names, but not mandatory. This is the direction the world is going :-)
  • If a header is passed multiple times, an array of each values is under the key name

This still leaves the question of how / whether to index these values. By default, both whole headers sections should not be indexed. Not only for the performance hit, but also because it opens a vector of attack, where any header passed by an agent now becomes an entry in the mapping.

Given the point about mappings, is it possible to not index headers.*, but selectively override and allow only the most useful headers (e.g. headers.content-type) to be indexed as keyword?

@webmat
Copy link
Contributor

webmat commented Sep 25, 2019

Or perhaps the simple answer is as @ruflin describes, and have a single raw text field for "all the headers", then a curated place for people to extract their most useful headers.

This wouldn't directly solve @dcode's point about "what the headers say" v "what the payload contains". But I think this one can be solved by custom fields for now. I don't think it's a feature/capability that's widespread enough to warrant support directly in ECS.

@neu5ron
Copy link

neu5ron commented Sep 27, 2019

Plus 1 to content type header as mime_type as @dcode has mentioned.. it would be great if mime_type was a nested field.. as I believe mime type will be useful in additional schemas such as file, http, smtp, as well as data sources like AV/sandbox or anytime magic headers come into play. would then allow searching across all mime types *mime_type:$value

@webmat
Copy link
Contributor

webmat commented Sep 30, 2019

@neu5ron Makes sense. Noted for later, as this isn't isn't specifically about HTTP headers.

@coudenysj
Copy link

Any news on the approach that will be used?

@github-actions
Copy link

This PR is stale because it has been open for 60 days with no activity.

@github-actions github-actions bot added the stale Stale issues and pull requests label Feb 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss stale Stale issues and pull requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants