Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds a search field set to represent search events #729

Merged
merged 9 commits into from
Mar 11, 2020
Merged

Adds a search field set to represent search events #729

merged 9 commits into from
Mar 11, 2020

Conversation

joshdevins
Copy link
Member

@joshdevins joshdevins commented Jan 24, 2020

Search events (query, page, click) have a common set of fields that are
useful to collect to create metrics from. Search metrics can be
calculated with transforms based on this schema. We expect applications
to emit these events, not Elasticsearch itself (for now).

This change is part of a larger project to collect search events and
calculate metrics in a standard way.

Closes: #666

@joshdevins joshdevins added enhancement New feature or request discuss labels Jan 24, 2020
@joshdevins
Copy link
Member Author

joshdevins commented Jan 24, 2020

I'm not sure where (or if) to make these changes, but I would also like to add search as a valid event.category paired with event.actions (or event.type): query, page and click. See inline description of the field set for more details.

@joshdevins
Copy link
Member Author

Ping @webmat — can you review/comment or suggest another reviewer?

@webmat
Copy link
Contributor

webmat commented Feb 17, 2020

Hey @joshdevins, thanks for opening #666 and this PR. I love the idea of adding support for capturing search analytics in ECS.

I'll start with some high level comments, then suggest a few next steps.

High level comments

Actually, let's start at the highest level possible. How about a bit of philosophy 🙃 When you say "The alternative is to not use ECS at all"; I'd say the design of ECS is specifically to allow you to use the parts of ECS that already make sense, and doing anything else that's not supported yet via custom fields.

As an example, if using service.* and/or host.* makes sense for your use case, but search.* isn't defined in ECS yet, "using ECS" looks like this:

  1. populating host.* and service.* as ECS describes
  2. capturing all details of your search analytics in custom fields

Of course when custom fields capture information that are broadly applicable elsewhere than your project, you're more than welcome to propose adding them to ECS. As you're doing here 👍

When I say "capturing in custom fields", of course there's a risk of conflicting with a future version of ECS that may add the same fields. The next release of ECS lays out a few approaches to adding custom fields while reducing the risk of future conflicts. You can check it out here: Custom Fields in ECS.

So concretely, you could apply some of the thoughts in there and have your search analytics at projectname.search.* for now. This way when we settle on the approach to use in ECS, you'll have to migrate your fields; but you'll be able to do so with "legacy" fields and ECS fields at the same time in your events, which will make the transition smoother.

If your project follows the above recommendations, you're free to proceed right away with implementation. And we're free to take the time to discuss the right way to add this to ECS, without rushing.

Adding this to ECS

Which takes us to how I would see adding search analytics in ECS :-)

  • I like the search.* field set. I think it's a fundamental type of activity we can observe and analyze. The field set makes sense to me.
    • For now I wouldn't define an event category/type for it (or for clicks) just yet. Let's wait until we've fleshed this out and used it more.
    • If you need a way to filter for these events, you can use a mix of event.module, event.dataset, and event.action (e.g. "search.query", "search.results"). None of these fields have prescribed values, which frees you up to categorize your events with the content of these fields.
  • Clicks are also a fundamental type of activity we may want to analyze. As such, I wouldn't nest it under search.*. I'd rather think about it as a full-fledged field set.
    • Doing it this way will let us design a field set that's useable in many kinds of user behaviour analysis. Heat maps (click coordinates), search analysis, email engagement, etc.
  • Continuing on behavioural analysis, I would also hold off on adding anything about AB testing. Just like clicks, this is a fundamental concept we may want to define independently.
  • Whenever an event has a duration it should be captured at event.duration, so I wouldn't add search.results.took to ECS
  • There's been a desire to add more support for metrics in ECS (Schema for metrics #474). Either via many specific fields, or general recommendations on "how to do it right". But we haven't been able to dedicate resources to take this to the finish line. So I'd say metrics that are very specific to a field set are 👍. But we don't have a general recommendation on custom metrics yet.

Next steps

My recommendations for a "search" field set:

  • search.query.id
  • search.query.value
  • search.results.page: notice I'm suggesting under .results, was there a reason for it to be under .query initially?
  • search.results.size
  • search.results.total
  • search.results.ids
  • Please add a mention of tracking search durations at event.duration in the field set description
  • Just like for field sets, fields also support a short and a description. Any time the field description gets a bit long, please add a short as well. It's used in places like the CSV, may be used in tooltips & so on. Check out everything that's supported in field definitions at schemas/README.md
  • We recently added a way to mark fields that are expected to be arrays. Please add the following to the .results.ids field:
normalize:
  - array

I do see a need for a "click" field set eventually. But I think I'd wait until we have time to look at this kind of analytics in general in the context of ECS, to formulate a clearer plan. So my recommendation for clicks is custom fields all the way. They will inform our thinking, when we get around to that in ECS.

Conclusion

Please feel free to respond to any of my suggestions above. You have more context on the need than I do.

So clarifications or further discussion are welcome :-)

@joshdevins
Copy link
Member Author

High level comments

Actually, let's start at the highest level possible. How about a bit of philosophy

Always appreciated! I'm new to the ECS landscape.

When you say "The alternative is to not use ECS at all"; I'd say the design of ECS is specifically to allow you to use the parts of ECS that already make sense, and doing anything else that's not supported yet via custom fields.

This makes a lot of sense to me and I'm happy to keep things as custom fields until we have a clear way forward to inclusion in ECS (as with clicks).

The part I struggle a bit with is understanding how best to advise or implement the mapping then. Do solutions provide a mapping file that is based on ECS but includes the custom fields they want? Or do they rely on ECS mappings and dynamic mappings for the custom fields?

Adding this to ECS

For now I wouldn't define an event category/type for it

That's fair, it was the first place I saw that looked like it made sense. I will probably try using event.action as it seems to fit the best (I'll validate this works in my test cases, but it should be fine).

I would also hold off on adding anything about AB testing.

Agreed. I had included it in the issue description as an example but removed it when I made the PR. I have been putting these values into labels for now since they are just key/value keyword types.

Next steps

These sound good — I'll make the recommended changes/additions, including removing click fields from this PR 👍🏼

search.results.page: notice I'm suggesting under .results, was there a reason for it to be under .query initially?

The page is a parameter of the query and not the response, so I thought it made logical sense to group it with those fields. For example, you would "ask for" page 1 or 2 of a search results page by setting the limit and offset in the Query DSL. The response is then the results for that requested page.

I'm reviewing the metrics I want to provide based on page and I'm thinking that I might actually drop this completely for now. We can use a custom field (😉) for now and add it later when/if it becomes more standard.

I do see a need for a "click" field set eventually. But I think I'd wait until we have time to look at this kind of analytics in general in the context of ECS...

This makes sense to me, and I want to add some context that maybe helps in this PR discussion or for future discussions about clicks/behavioural events. In the case of search metrics, the proposed search.* (without click) is only useful up to a point. You can calculate some very basic metrics, but nothing that really tells you anything useful about search relevance. This is fine, and we will collect clicks with the help of a custom field set for now, but the trick here is that the two need to work in concert, in order to generate useful metrics. When a click event ocurrs and is captured, it needs to have a search.query.id in it, so that we can group the clicks together with the original query event. From this, we calculate search relevance metrics. There's nothing in ECS that prevents us from having clicks using custom fields, but I'm not sure how well this will sit beside other types of behavioural events, if this generalizes in some way. This is the main rationale to put click fields also under search — they kind of don't mean much on their own 😄.

@joshdevins
Copy link
Member Author

joshdevins commented Feb 20, 2020

I'm just looking at moving click fields out and the other thing that strikes me is that all the fields of click are specific to search: result.id, result.rank. So again, it seems to me that it's so specific to search metrics, I don't know if it generalizes and should be it's own top-level field set.

I think the metrics I'm talking about are not the same as collecting events with metrics in them already (as in #474). We will be taking all these events and doing a transform into another (non-ECS) index. The metrics are then partially stored there and partially calculated in aggregations to show in Kibana. So what we want to collect in click events aren't metrics yet.

@joshdevins
Copy link
Member Author

joshdevins commented Feb 20, 2020

I also recognize that some of the limits on keyword fields will have to go. We don't want to skip storing events or fields if they happen to have long query values, for example. I'll be updating that as well. Happy to hear your thoughts on that change. I know 1024 is the default value if no explicit ignore_above is provided, but is there a way to not set ignore_above for keywords?

@joshdevins
Copy link
Member Author

joshdevins commented Feb 20, 2020

@webmat Have a look again. It's a much smaller diff this time since I've removed all the click subset for now. Have a look at the recent comments as well if you could.

@joshdevins joshdevins added 1.5.0 and removed discuss labels Feb 27, 2020
@joshdevins joshdevins self-assigned this Feb 27, 2020
Copy link
Contributor

@MikePaquette MikePaquette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joshdevins sorry for the delay on this review.
I am in favor of adding the search.* field set, but have these few questions/comments:

  • The description suggests optionally populating the ECS source.* fields with context such as user or geo information. Do you expect to commonly have information about the source of the search? For example, user.id does not need to be nested under source.*, and could just be populated as top level user.id. Where would the geo information come from?
  • I think results should be singular result, even though it is an array.
  • I think .ids should be singluar .id, even thought it is an array.
  • Are we agreed upon the datatype for search.query.value being keyword?

@joshdevins
Copy link
Member Author

joshdevins commented Mar 4, 2020

@MikePaquette thanks for having a look, hoping to get this in for 1.5.0 release.

Re singular vs plural: I'm pretty indifferent and am happy to comply with the norms of ECS.

Do you expect to commonly have information about the source of the search? For example, user.id does not need to be nested under source.*, and could just be populated as top level user.id.

I would expect that most apps/sites with logged-in or anonymous users will have that as part of the search context, and it's possibly useful to log. It's really up to the business to decide if it's useful to log user ID's — there is no prerequisite to doing so. I can imagine also just logging a hash so you can do simple metrics like calculating unique search users, for example. There is also no prerequisite to where that information is logged in the schema, so I can also elide it from the comments and let people decide. I liked it in source since it sits beside other user context like geo.

Where would the geo information come from?

Again there is no prerequisite here but it's very common to capture user context information in a search query. The geo information usually comes from geo-IP services that are in a layer above search, or can be from a user profile. Geo information can be important to understand search performance across a range of locations. For example, do my users in North America (and English speaking?) have a better experience than users in CJK language countries? More granular geo information (cities) might be less useful of course. I don't want to put any restrictions on what people log, but being able to view search performance from various angles is what I'm after, and geo is one of those, so I included it in demos and in the comments.

Are we agreed upon the datatype for search.query.value being keyword?

What alternative(s) do you imagine? It's a keyword right now because it needs to be logged as-is and un-analyzed so the user/business can decide how to normalize it for metrics (a stage later). There's no need at query logging time to normalize the query strings and they don't need to be searched over at the event-logging stage. In theory, someone could also log it separately as a custom text field if they had use-cases for that.

I'm confident that we should include query value but it's also possible that a search solution has no query string or multiple even. I don't know what percentage though, but I'd like to include it to enable things like per-query-string metrics which is quite common.

@MikePaquette
Copy link
Contributor

@joshdevins thanks for the replies. @webmat is trying to close 1.5.0 today, so I'm not sure we have time to get this finalized and included in 1.5.0.

@joshdevins
Copy link
Member Author

@MikePaquette These are seem like pretty small changes that can be made very easily, so I'm happy to jump on Slack/Zoom to just resolve these. I have two teams waiting on this to move forward with other work.

Copy link
Contributor

@webmat webmat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The part I struggle a bit with is understanding how best to advise or implement the mapping then. Do solutions provide a mapping file that is based on ECS but includes the custom fields they want? Or do they rely on ECS mappings and dynamic mappings for the custom fields?

ECS doesn't mandate anything there. Users who are fine with dynamic mappings are free to use the sample Elasticsearch templates provided in the repo directly (provided they adjust the template settings, which are geared towards experimentation, not production use). But my recommendation is for users to build their templates exactly how they need them, as usual. They should include all of the ECS field definitions they need; they're also free to omit ECS fields they will never use. A good compromise here is to make that decision at the field set level, not for every single field.

One of the ways users can build their templates based on ECS is to leverage the tooling in the repo. This isn't documented yet, but with Python 3:

python scripts/generator.py --help
usage: generator.py [-h] [--intermediate-only]
                    [--include INCLUDE [INCLUDE ...]]
                    [--subset SUBSET [SUBSET ...]] [--out OUT]

optional arguments:
  -h, --help            show this help message and exit
  --intermediate-only   generate intermediary files only
  --include INCLUDE [INCLUDE ...]
                        include user specified directory of custom field
                        definitions
  --subset SUBSET [SUBSET ...]
                        render a subset of the schema
  --out OUT             directory to store the generated files

In short, you can curate your custom fields in yaml files elsewhere, then run something to that effect:

python scripts/generator.py --include ../myproject/fields/ --out ../myproject/ecs-artifacts/

To learn more about curating your own ECS-based artifacts, you can start here #497.

page is a parameter of the query and not the response

Ah yes, makes total sense. I agree with search.query.page if you'd like to add it back 👍

(about search and click events) the two need to work in concert

Totally fair. That's one of the challenges of ECS. There's always a temptation to nest things closely for a very precise use case. But that would lead to eventually having "click" details in many many places in ECS. The approach I think will work best in the long run is to define "click" on its own, and try to consider the various broad use cases where click events are tracked. Then each use case (search being one of them) that correlates with clicks, leverage the subset of click.* that makes sense to them. Clicks in search results, emails, arbitrary web pages (thinking UX heat map) will have different needs, but generally the set of fields that make sense may end up being in a reasonable range, like 10-20. Then each of these kinds of tracking uses those that make sense.

Right now there isn't a good structure to clearly document these relationships between field sets, but using the main field set description like you do is a great start.

is there a way to not set ignore_above for keywords?

Not at this time. Raising it like you did is ok, though. Just to clarify, however, the keyword is optimized for exact matches and aggregations. Supporting 8K long values in there for this purpose seems excessive, no?

A document where this field goes over the ignore_above limit would still be indexed. But queries on the keyword field would simply not return the document. If we do a multi-field (see comments below), we could mitigate this, however.

Now for more mundane, but concrete feedback on the PR :-)

  • Please enter a changelog
  • Please wrap text in schemas/search.yml somewhere around 80-100 characters.
  • See also a few comments below

schemas/search.yml Outdated Show resolved Hide resolved
ignore_above: 8191
short: The query string being searched on.
description: >
The query string being search on. This field is not analyzed and should not be pre-processed in any way in the event (e.g. normalization list lowercasing). This is useful for search use-cases that use a one-box style search interface. Other interfaces will have to rely on additional custom fields or labels to represent things like filters applied, extra parameters, user context, etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The query string being search on. This field is not analyzed and should not be pre-processed in any way in the event (e.g. normalization list lowercasing). This is useful for search use-cases that use a one-box style search interface. Other interfaces will have to rely on additional custom fields or labels to represent things like filters applied, extra parameters, user context, etc.
The query string being searched on. This field is not analyzed and should not be pre-processed in any way in the event (e.g. normalization list lowercasing). This is useful for search use-cases that use a one-box style search interface. Other interfaces will have to rely on additional custom fields or labels to represent things like filters applied, extra parameters, user context, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why you wouldn't want the field to be analyzed. I totally get that the source field in the document should be the unmodified search query. Someone investigating issues of their search solution will want this field untouched. 👍

But also having an analyzed field in the Elasticsearch mapping would be great as well, to let users investigate a subset of their search events.

So if you'd like to add an analyzed multi-field, you can do so with:

      type: keyword
      multi_fields:
      - type: text
        name: text

You'll notice that ECS follows the reverse convention vs Elasticsearch' dynamic mappings: the keyword field is always the canonical one, and the analyzed field is the nested one.

Copy link
Member Author

@joshdevins joshdevins Mar 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is typically done after an aggregation on something like query string or query ID. You wouldn't tend to investigate search queries in their raw form but really only after some aggregation. The kind of normalization/analysis that you would do it also pretty custom to the business, I would argue. The standard analyzer is probably fine for most people of course, but many places will have custom stop words, etc. that they may want to also use to normalize their queries before analysis.

I'm not opposed to adding multi-fields here, but I'm not convinced it's necessary. I'd opt to hold off and add it later if there is a larger observed need. As you say, someone can always index the query in a custom text field. WDYT?

@joshdevins
Copy link
Member Author

One of the ways users can build their templates based on ECS is to leverage the tooling in the repo.

This tooling looks great. Can we advise use of this or not yet?

Search events (query, page, click) have a common set of fields that are
useful to collect to create metrics from. Search metrics can be
calculated with transforms based on this schema. We expect applications
to emit these events, not Elasticsearch itself (for now).

This change is part of a larger project to collect search events and
calculate metrics in a standard way.

Closes: #666
Newlines at 80 cols and add the page field back in.
8k was probably excessive so we've shortened it to 4k.
Next CHANGELOG includes a reference to the new field set as well as the
PR that introduced the change.
@joshdevins joshdevins added 1.6.0 and removed 1.5.0 labels Mar 10, 2020
@joshdevins joshdevins requested a review from webmat March 10, 2020 11:10
They should be prefixed to make it obvious they are search actions.
Copy link
Contributor

@webmat webmat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tooling looks great. Can we advise use of this or not yet?

Yeah I think we can, with the proper caveats in place. Not officially supported, and still missing some features (e.g. adjusting template settings). But I think it's useful already, and that it's a pretty good way to manage one's templates.

Only very minor adjustments left to do, see below. Thanks for the adjustments :-)

schemas/search.yml Outdated Show resolved Hide resolved
schemas/search.yml Show resolved Hide resolved
schemas/search.yml Outdated Show resolved Hide resolved
schemas/search.yml Outdated Show resolved Hide resolved
joshdevins and others added 4 commits March 10, 2020 17:27
Co-Authored-By: Mathieu Martin <[email protected]>
We want to make it clear that these are examples of usage only and not required.

Co-Authored-By: Mathieu Martin <[email protected]>
@timestamp corresponds to the timestamp of the actual event at source
while event.created is when the first agent picks up the event.
Copy link
Contributor

@webmat webmat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@webmat webmat merged commit 142ab78 into elastic:master Mar 11, 2020
@joshdevins joshdevins deleted the adds-search-fields branch March 12, 2020 07:47
dcode pushed a commit to dcode/ecs that referenced this pull request Apr 15, 2020
@ebeahan ebeahan removed the 1.6.0 label Aug 4, 2020
@vbohata
Copy link

vbohata commented Oct 29, 2020

I think there is one field missing here: search.query.offset which indicates the "record from" number which is not the same as the page.

@vbohata
Copy link

vbohata commented Oct 29, 2020

One another note about search.query.value. We have a use case where we need to do aggregations per search.query.value during analysing one of our app. I do not know if you plan to make it searchable, but in our case we need it if possible. If not there should be .text alternative if the value is too long.

@joshdevins
Copy link
Member Author

@vbohata Unfortunately it was decided to remove these fields from ECS before the 1.5 release (#812). We're still talking about bringing them back, but nothing concrete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New top-level object: search
5 participants