Adds a search field set to represent search events #729

joshdevins · 2020-01-24T10:58:57Z

Search events (query, page, click) have a common set of fields that are
useful to collect to create metrics from. Search metrics can be
calculated with transforms based on this schema. We expect applications
to emit these events, not Elasticsearch itself (for now).

This change is part of a larger project to collect search events and
calculate metrics in a standard way.

Closes: #666

joshdevins · 2020-01-24T10:59:50Z

I'm not sure where (or if) to make these changes, but I would also like to add search as a valid event.category paired with event.actions (or event.type): query, page and click. See inline description of the field set for more details.

joshdevins · 2020-02-10T10:53:20Z

Ping @webmat — can you review/comment or suggest another reviewer?

webmat · 2020-02-17T20:36:34Z

Hey @joshdevins, thanks for opening #666 and this PR. I love the idea of adding support for capturing search analytics in ECS.

I'll start with some high level comments, then suggest a few next steps.

High level comments

Actually, let's start at the highest level possible. How about a bit of philosophy 🙃 When you say "The alternative is to not use ECS at all"; I'd say the design of ECS is specifically to allow you to use the parts of ECS that already make sense, and doing anything else that's not supported yet via custom fields.

As an example, if using service.* and/or host.* makes sense for your use case, but search.* isn't defined in ECS yet, "using ECS" looks like this:

populating host.* and service.* as ECS describes
capturing all details of your search analytics in custom fields

Of course when custom fields capture information that are broadly applicable elsewhere than your project, you're more than welcome to propose adding them to ECS. As you're doing here 👍

When I say "capturing in custom fields", of course there's a risk of conflicting with a future version of ECS that may add the same fields. The next release of ECS lays out a few approaches to adding custom fields while reducing the risk of future conflicts. You can check it out here: Custom Fields in ECS.

So concretely, you could apply some of the thoughts in there and have your search analytics at projectname.search.* for now. This way when we settle on the approach to use in ECS, you'll have to migrate your fields; but you'll be able to do so with "legacy" fields and ECS fields at the same time in your events, which will make the transition smoother.

If your project follows the above recommendations, you're free to proceed right away with implementation. And we're free to take the time to discuss the right way to add this to ECS, without rushing.

Adding this to ECS

Which takes us to how I would see adding search analytics in ECS :-)

I like the search.* field set. I think it's a fundamental type of activity we can observe and analyze. The field set makes sense to me.
- For now I wouldn't define an event category/type for it (or for clicks) just yet. Let's wait until we've fleshed this out and used it more.
- If you need a way to filter for these events, you can use a mix of event.module, event.dataset, and event.action (e.g. "search.query", "search.results"). None of these fields have prescribed values, which frees you up to categorize your events with the content of these fields.
Clicks are also a fundamental type of activity we may want to analyze. As such, I wouldn't nest it under search.*. I'd rather think about it as a full-fledged field set.
- Doing it this way will let us design a field set that's useable in many kinds of user behaviour analysis. Heat maps (click coordinates), search analysis, email engagement, etc.
Continuing on behavioural analysis, I would also hold off on adding anything about AB testing. Just like clicks, this is a fundamental concept we may want to define independently.
Whenever an event has a duration it should be captured at event.duration, so I wouldn't add search.results.took to ECS
There's been a desire to add more support for metrics in ECS (Schema for metrics #474). Either via many specific fields, or general recommendations on "how to do it right". But we haven't been able to dedicate resources to take this to the finish line. So I'd say metrics that are very specific to a field set are 👍. But we don't have a general recommendation on custom metrics yet.

Next steps

My recommendations for a "search" field set:

search.query.id
search.query.value
search.results.page: notice I'm suggesting under .results, was there a reason for it to be under .query initially?
search.results.size
search.results.total
search.results.ids
Please add a mention of tracking search durations at event.duration in the field set description
Just like for field sets, fields also support a short and a description. Any time the field description gets a bit long, please add a short as well. It's used in places like the CSV, may be used in tooltips & so on. Check out everything that's supported in field definitions at schemas/README.md
We recently added a way to mark fields that are expected to be arrays. Please add the following to the .results.ids field:

normalize:
  - array

I do see a need for a "click" field set eventually. But I think I'd wait until we have time to look at this kind of analytics in general in the context of ECS, to formulate a clearer plan. So my recommendation for clicks is custom fields all the way. They will inform our thinking, when we get around to that in ECS.

Conclusion

Please feel free to respond to any of my suggestions above. You have more context on the need than I do.

So clarifications or further discussion are welcome :-)

joshdevins · 2020-02-19T15:51:45Z

High level comments

Actually, let's start at the highest level possible. How about a bit of philosophy

Always appreciated! I'm new to the ECS landscape.

When you say "The alternative is to not use ECS at all"; I'd say the design of ECS is specifically to allow you to use the parts of ECS that already make sense, and doing anything else that's not supported yet via custom fields.

This makes a lot of sense to me and I'm happy to keep things as custom fields until we have a clear way forward to inclusion in ECS (as with clicks).

The part I struggle a bit with is understanding how best to advise or implement the mapping then. Do solutions provide a mapping file that is based on ECS but includes the custom fields they want? Or do they rely on ECS mappings and dynamic mappings for the custom fields?

Adding this to ECS

For now I wouldn't define an event category/type for it

That's fair, it was the first place I saw that looked like it made sense. I will probably try using event.action as it seems to fit the best (I'll validate this works in my test cases, but it should be fine).

I would also hold off on adding anything about AB testing.

Agreed. I had included it in the issue description as an example but removed it when I made the PR. I have been putting these values into labels for now since they are just key/value keyword types.

Next steps

These sound good — I'll make the recommended changes/additions, including removing click fields from this PR 👍🏼

search.results.page: notice I'm suggesting under .results, was there a reason for it to be under .query initially?

The page is a parameter of the query and not the response, so I thought it made logical sense to group it with those fields. For example, you would "ask for" page 1 or 2 of a search results page by setting the limit and offset in the Query DSL. The response is then the results for that requested page.

I'm reviewing the metrics I want to provide based on page and I'm thinking that I might actually drop this completely for now. We can use a custom field (😉) for now and add it later when/if it becomes more standard.

I do see a need for a "click" field set eventually. But I think I'd wait until we have time to look at this kind of analytics in general in the context of ECS...

This makes sense to me, and I want to add some context that maybe helps in this PR discussion or for future discussions about clicks/behavioural events. In the case of search metrics, the proposed search.* (without click) is only useful up to a point. You can calculate some very basic metrics, but nothing that really tells you anything useful about search relevance. This is fine, and we will collect clicks with the help of a custom field set for now, but the trick here is that the two need to work in concert, in order to generate useful metrics. When a click event ocurrs and is captured, it needs to have a search.query.id in it, so that we can group the clicks together with the original query event. From this, we calculate search relevance metrics. There's nothing in ECS that prevents us from having clicks using custom fields, but I'm not sure how well this will sit beside other types of behavioural events, if this generalizes in some way. This is the main rationale to put click fields also under search — they kind of don't mean much on their own 😄.

joshdevins · 2020-02-20T09:29:13Z

I'm just looking at moving click fields out and the other thing that strikes me is that all the fields of click are specific to search: result.id, result.rank. So again, it seems to me that it's so specific to search metrics, I don't know if it generalizes and should be it's own top-level field set.

I think the metrics I'm talking about are not the same as collecting events with metrics in them already (as in #474). We will be taking all these events and doing a transform into another (non-ECS) index. The metrics are then partially stored there and partially calculated in aggregations to show in Kibana. So what we want to collect in click events aren't metrics yet.

joshdevins · 2020-02-20T10:13:54Z

I also recognize that some of the limits on keyword fields will have to go. We don't want to skip storing events or fields if they happen to have long query values, for example. I'll be updating that as well. Happy to hear your thoughts on that change. I know 1024 is the default value if no explicit ignore_above is provided, but is there a way to not set ignore_above for keywords?

joshdevins · 2020-02-20T11:37:05Z

@webmat Have a look again. It's a much smaller diff this time since I've removed all the click subset for now. Have a look at the recent comments as well if you could.

MikePaquette

@joshdevins sorry for the delay on this review.
I am in favor of adding the search.* field set, but have these few questions/comments:

The description suggests optionally populating the ECS source.* fields with context such as user or geo information. Do you expect to commonly have information about the source of the search? For example, user.id does not need to be nested under source.*, and could just be populated as top level user.id. Where would the geo information come from?
I think results should be singular result, even though it is an array.
I think .ids should be singluar .id, even thought it is an array.
Are we agreed upon the datatype for search.query.value being keyword?

joshdevins · 2020-03-04T15:45:44Z

@MikePaquette thanks for having a look, hoping to get this in for 1.5.0 release.

Re singular vs plural: I'm pretty indifferent and am happy to comply with the norms of ECS.

Do you expect to commonly have information about the source of the search? For example, user.id does not need to be nested under source.*, and could just be populated as top level user.id.

I would expect that most apps/sites with logged-in or anonymous users will have that as part of the search context, and it's possibly useful to log. It's really up to the business to decide if it's useful to log user ID's — there is no prerequisite to doing so. I can imagine also just logging a hash so you can do simple metrics like calculating unique search users, for example. There is also no prerequisite to where that information is logged in the schema, so I can also elide it from the comments and let people decide. I liked it in source since it sits beside other user context like geo.

Where would the geo information come from?

Again there is no prerequisite here but it's very common to capture user context information in a search query. The geo information usually comes from geo-IP services that are in a layer above search, or can be from a user profile. Geo information can be important to understand search performance across a range of locations. For example, do my users in North America (and English speaking?) have a better experience than users in CJK language countries? More granular geo information (cities) might be less useful of course. I don't want to put any restrictions on what people log, but being able to view search performance from various angles is what I'm after, and geo is one of those, so I included it in demos and in the comments.

Are we agreed upon the datatype for search.query.value being keyword?

What alternative(s) do you imagine? It's a keyword right now because it needs to be logged as-is and un-analyzed so the user/business can decide how to normalize it for metrics (a stage later). There's no need at query logging time to normalize the query strings and they don't need to be searched over at the event-logging stage. In theory, someone could also log it separately as a custom text field if they had use-cases for that.

I'm confident that we should include query value but it's also possible that a search solution has no query string or multiple even. I don't know what percentage though, but I'd like to include it to enable things like per-query-string metrics which is quite common.

MikePaquette · 2020-03-04T17:30:12Z

@joshdevins thanks for the replies. @webmat is trying to close 1.5.0 today, so I'm not sure we have time to get this finalized and included in 1.5.0.

joshdevins · 2020-03-05T08:05:29Z

@MikePaquette These are seem like pretty small changes that can be made very easily, so I'm happy to jump on Slack/Zoom to just resolve these. I have two teams waiting on this to move forward with other work.

webmat

The part I struggle a bit with is understanding how best to advise or implement the mapping then. Do solutions provide a mapping file that is based on ECS but includes the custom fields they want? Or do they rely on ECS mappings and dynamic mappings for the custom fields?

ECS doesn't mandate anything there. Users who are fine with dynamic mappings are free to use the sample Elasticsearch templates provided in the repo directly (provided they adjust the template settings, which are geared towards experimentation, not production use). But my recommendation is for users to build their templates exactly how they need them, as usual. They should include all of the ECS field definitions they need; they're also free to omit ECS fields they will never use. A good compromise here is to make that decision at the field set level, not for every single field.

One of the ways users can build their templates based on ECS is to leverage the tooling in the repo. This isn't documented yet, but with Python 3:

python scripts/generator.py --help
usage: generator.py [-h] [--intermediate-only]
                    [--include INCLUDE [INCLUDE ...]]
                    [--subset SUBSET [SUBSET ...]] [--out OUT]

optional arguments:
  -h, --help            show this help message and exit
  --intermediate-only   generate intermediary files only
  --include INCLUDE [INCLUDE ...]
                        include user specified directory of custom field
                        definitions
  --subset SUBSET [SUBSET ...]
                        render a subset of the schema
  --out OUT             directory to store the generated files

In short, you can curate your custom fields in yaml files elsewhere, then run something to that effect:

python scripts/generator.py --include ../myproject/fields/ --out ../myproject/ecs-artifacts/

To learn more about curating your own ECS-based artifacts, you can start here #497.

page is a parameter of the query and not the response

Ah yes, makes total sense. I agree with search.query.page if you'd like to add it back 👍

(about search and click events) the two need to work in concert

Totally fair. That's one of the challenges of ECS. There's always a temptation to nest things closely for a very precise use case. But that would lead to eventually having "click" details in many many places in ECS. The approach I think will work best in the long run is to define "click" on its own, and try to consider the various broad use cases where click events are tracked. Then each use case (search being one of them) that correlates with clicks, leverage the subset of click.* that makes sense to them. Clicks in search results, emails, arbitrary web pages (thinking UX heat map) will have different needs, but generally the set of fields that make sense may end up being in a reasonable range, like 10-20. Then each of these kinds of tracking uses those that make sense.

Right now there isn't a good structure to clearly document these relationships between field sets, but using the main field set description like you do is a great start.

is there a way to not set ignore_above for keywords?

Not at this time. Raising it like you did is ok, though. Just to clarify, however, the keyword is optimized for exact matches and aggregations. Supporting 8K long values in there for this purpose seems excessive, no?

A document where this field goes over the ignore_above limit would still be indexed. But queries on the keyword field would simply not return the document. If we do a multi-field (see comments below), we could mitigate this, however.

Now for more mundane, but concrete feedback on the PR :-)

Please enter a changelog
Please wrap text in schemas/search.yml somewhere around 80-100 characters.
See also a few comments below

schemas/search.yml

webmat · 2020-03-05T11:55:00Z

schemas/search.yml

+      ignore_above: 8191
+      short: The query string being searched on.
+      description: >
+        The query string being search on. This field is not analyzed and should not be pre-processed in any way in the event (e.g. normalization list lowercasing). This is useful for search use-cases that use a one-box style search interface. Other interfaces will have to rely on additional custom fields or labels to represent things like filters applied, extra parameters, user context, etc.


Suggested change

The query string being search on. This field is not analyzed and should not be pre-processed in any way in the event (e.g. normalization list lowercasing). This is useful for search use-cases that use a one-box style search interface. Other interfaces will have to rely on additional custom fields or labels to represent things like filters applied, extra parameters, user context, etc.

The query string being searched on. This field is not analyzed and should not be pre-processed in any way in the event (e.g. normalization list lowercasing). This is useful for search use-cases that use a one-box style search interface. Other interfaces will have to rely on additional custom fields or labels to represent things like filters applied, extra parameters, user context, etc.

I'm curious why you wouldn't want the field to be analyzed. I totally get that the source field in the document should be the unmodified search query. Someone investigating issues of their search solution will want this field untouched. 👍

But also having an analyzed field in the Elasticsearch mapping would be great as well, to let users investigate a subset of their search events.

So if you'd like to add an analyzed multi-field, you can do so with:

type: keyword multi_fields: - type: text name: text

You'll notice that ECS follows the reverse convention vs Elasticsearch' dynamic mappings: the keyword field is always the canonical one, and the analyzed field is the nested one.

This is typically done after an aggregation on something like query string or query ID. You wouldn't tend to investigate search queries in their raw form but really only after some aggregation. The kind of normalization/analysis that you would do it also pretty custom to the business, I would argue. The standard analyzer is probably fine for most people of course, but many places will have custom stop words, etc. that they may want to also use to normalize their queries before analysis.

I'm not opposed to adding multi-fields here, but I'm not convinced it's necessary. I'd opt to hold off and add it later if there is a larger observed need. As you say, someone can always index the query in a custom text field. WDYT?

joshdevins · 2020-03-10T09:40:58Z

One of the ways users can build their templates based on ECS is to leverage the tooling in the repo.

This tooling looks great. Can we advise use of this or not yet?

Search events (query, page, click) have a common set of fields that are useful to collect to create metrics from. Search metrics can be calculated with transforms based on this schema. We expect applications to emit these events, not Elasticsearch itself (for now). This change is part of a larger project to collect search events and calculate metrics in a standard way. Closes: #666

Newlines at 80 cols and add the page field back in.

8k was probably excessive so we've shortened it to 4k.

Next CHANGELOG includes a reference to the new field set as well as the PR that introduced the change.

They should be prefixed to make it obvious they are search actions.

webmat

This tooling looks great. Can we advise use of this or not yet?

Yeah I think we can, with the proper caveats in place. Not officially supported, and still missing some features (e.g. adjusting template settings). But I think it's useful already, and that it's a pretty good way to manage one's templates.

Only very minor adjustments left to do, see below. Thanks for the adjustments :-)

schemas/search.yml

Co-Authored-By: Mathieu Martin <[email protected]>

We want to make it clear that these are examples of usage only and not required. Co-Authored-By: Mathieu Martin <[email protected]>

@timestamp

@timestamp corresponds to the timestamp of the actual event at source while event.created is when the first agent picks up the event.

webmat

LGTM

…ugh results (elastic#729)

vbohata · 2020-10-29T21:00:28Z

I think there is one field missing here: search.query.offset which indicates the "record from" number which is not the same as the page.

vbohata · 2020-10-29T21:07:31Z

One another note about search.query.value. We have a use case where we need to do aggregations per search.query.value during analysing one of our app. I do not know if you plan to make it searchable, but in our case we need it if possible. If not there should be .text alternative if the value is too long.

joshdevins · 2020-10-30T08:27:12Z

@vbohata Unfortunately it was decided to remove these fields from ECS before the 1.5 release (#812). We're still talking about bringing them back, but nothing concrete.

joshdevins added enhancement New feature or request discuss labels Jan 24, 2020

joshdevins requested a review from webmat January 24, 2020 10:58

joshdevins requested a review from MikePaquette January 24, 2020 11:07

joshdevins mentioned this pull request Feb 11, 2020

New top-level object: search #666

Closed

joshdevins added 1.5.0 and removed discuss labels Feb 27, 2020

joshdevins self-assigned this Feb 27, 2020

MikePaquette reviewed Mar 2, 2020

View reviewed changes

webmat reviewed Mar 6, 2020

View reviewed changes

joshdevins added 4 commits March 10, 2020 11:06

Fixes formatting and adds page back

5f24245

Newlines at 80 cols and add the page field back in.

Shortens the search query field length

46b2493

8k was probably excessive so we've shortened it to 4k.

Adds comment for search.* fields to next CHANGELOG

e32bc9a

Next CHANGELOG includes a reference to the new field set as well as the PR that introduced the change.

joshdevins added 1.6.0 and removed 1.5.0 labels Mar 10, 2020

joshdevins requested a review from webmat March 10, 2020 11:10

Fixes example action names

7d053ad

They should be prefixed to make it obvious they are search actions.

webmat reviewed Mar 10, 2020

View reviewed changes

schemas/search.yml Outdated Show resolved Hide resolved

schemas/search.yml Show resolved Hide resolved

schemas/search.yml Outdated Show resolved Hide resolved

schemas/search.yml Outdated Show resolved Hide resolved

joshdevins and others added 4 commits March 10, 2020 17:27

Fixes spelling error

3ddf0cd

Co-Authored-By: Mathieu Martin <[email protected]>

Makes event.action naming explicitly examples

a8c2130

We want to make it clear that these are examples of usage only and not required. Co-Authored-By: Mathieu Martin <[email protected]>

Clarifies which timestamp/date field to use

c8fa34d

@timestamp corresponds to the timestamp of the actual event at source while event.created is when the first agent picks up the event.

Adds missing generated files back

0e4026f

webmat approved these changes Mar 11, 2020

View reviewed changes

webmat merged commit 142ab78 into elastic:master Mar 11, 2020

joshdevins deleted the adds-search-fields branch March 12, 2020 07:47

joshdevins mentioned this pull request Apr 9, 2020

Reverts addition of search field set #812

Merged

dcode pushed a commit to dcode/ecs that referenced this pull request Apr 15, 2020

Add a search field set to represent search events and pagination thro…

fa7bd20

…ugh results (elastic#729)

ebeahan removed the 1.6.0 label Aug 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds a search field set to represent search events #729

Adds a search field set to represent search events #729

joshdevins commented Jan 24, 2020 •

edited

Loading

joshdevins commented Jan 24, 2020 •

edited

Loading

joshdevins commented Feb 10, 2020

webmat commented Feb 17, 2020

joshdevins commented Feb 19, 2020

joshdevins commented Feb 20, 2020 •

edited

Loading

joshdevins commented Feb 20, 2020 •

edited

Loading

joshdevins commented Feb 20, 2020 •

edited

Loading

MikePaquette left a comment

joshdevins commented Mar 4, 2020 •

edited

Loading

MikePaquette commented Mar 4, 2020

joshdevins commented Mar 5, 2020

webmat left a comment

webmat Mar 5, 2020

webmat Mar 6, 2020

joshdevins Mar 10, 2020 •

edited

Loading

joshdevins commented Mar 10, 2020

webmat left a comment

webmat left a comment

vbohata commented Oct 29, 2020

vbohata commented Oct 29, 2020

joshdevins commented Oct 30, 2020

	The query string being search on. This field is not analyzed and should not be pre-processed in any way in the event (e.g. normalization list lowercasing). This is useful for search use-cases that use a one-box style search interface. Other interfaces will have to rely on additional custom fields or labels to represent things like filters applied, extra parameters, user context, etc.
	The query string being searched on. This field is not analyzed and should not be pre-processed in any way in the event (e.g. normalization list lowercasing). This is useful for search use-cases that use a one-box style search interface. Other interfaces will have to rely on additional custom fields or labels to represent things like filters applied, extra parameters, user context, etc.

Adds a search field set to represent search events #729

Adds a search field set to represent search events #729

Conversation

joshdevins commented Jan 24, 2020 • edited Loading

joshdevins commented Jan 24, 2020 • edited Loading

joshdevins commented Feb 10, 2020

webmat commented Feb 17, 2020

High level comments

Adding this to ECS

Next steps

Conclusion

joshdevins commented Feb 19, 2020

joshdevins commented Feb 20, 2020 • edited Loading

joshdevins commented Feb 20, 2020 • edited Loading

joshdevins commented Feb 20, 2020 • edited Loading

MikePaquette left a comment

Choose a reason for hiding this comment

joshdevins commented Mar 4, 2020 • edited Loading

MikePaquette commented Mar 4, 2020

joshdevins commented Mar 5, 2020

webmat left a comment

Choose a reason for hiding this comment

webmat Mar 5, 2020

Choose a reason for hiding this comment

webmat Mar 6, 2020

Choose a reason for hiding this comment

joshdevins Mar 10, 2020 • edited Loading

Choose a reason for hiding this comment

joshdevins commented Mar 10, 2020

webmat left a comment

Choose a reason for hiding this comment

webmat left a comment

Choose a reason for hiding this comment

vbohata commented Oct 29, 2020

vbohata commented Oct 29, 2020

joshdevins commented Oct 30, 2020

joshdevins commented Jan 24, 2020 •

edited

Loading

joshdevins commented Jan 24, 2020 •

edited

Loading

joshdevins commented Feb 20, 2020 •

edited

Loading

joshdevins commented Feb 20, 2020 •

edited

Loading

joshdevins commented Feb 20, 2020 •

edited

Loading

joshdevins commented Mar 4, 2020 •

edited

Loading

joshdevins Mar 10, 2020 •

edited

Loading