-
Notifications
You must be signed in to change notification settings - Fork 418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds a search field set to represent search events #729
Conversation
I'm not sure where (or if) to make these changes, but I would also like to add |
Ping @webmat — can you review/comment or suggest another reviewer? |
Hey @joshdevins, thanks for opening #666 and this PR. I love the idea of adding support for capturing search analytics in ECS. I'll start with some high level comments, then suggest a few next steps. High level commentsActually, let's start at the highest level possible. How about a bit of philosophy 🙃 When you say "The alternative is to not use ECS at all"; I'd say the design of ECS is specifically to allow you to use the parts of ECS that already make sense, and doing anything else that's not supported yet via custom fields. As an example, if using
Of course when custom fields capture information that are broadly applicable elsewhere than your project, you're more than welcome to propose adding them to ECS. As you're doing here 👍 When I say "capturing in custom fields", of course there's a risk of conflicting with a future version of ECS that may add the same fields. The next release of ECS lays out a few approaches to adding custom fields while reducing the risk of future conflicts. You can check it out here: Custom Fields in ECS. So concretely, you could apply some of the thoughts in there and have your search analytics at If your project follows the above recommendations, you're free to proceed right away with implementation. And we're free to take the time to discuss the right way to add this to ECS, without rushing. Adding this to ECSWhich takes us to how I would see adding search analytics in ECS :-)
Next stepsMy recommendations for a "search" field set:
I do see a need for a "click" field set eventually. But I think I'd wait until we have time to look at this kind of analytics in general in the context of ECS, to formulate a clearer plan. So my recommendation for clicks is custom fields all the way. They will inform our thinking, when we get around to that in ECS. ConclusionPlease feel free to respond to any of my suggestions above. You have more context on the need than I do. So clarifications or further discussion are welcome :-) |
Always appreciated! I'm new to the ECS landscape.
This makes a lot of sense to me and I'm happy to keep things as custom fields until we have a clear way forward to inclusion in ECS (as with clicks). The part I struggle a bit with is understanding how best to advise or implement the mapping then. Do solutions provide a mapping file that is based on ECS but includes the custom fields they want? Or do they rely on ECS mappings and dynamic mappings for the custom fields?
That's fair, it was the first place I saw that looked like it made sense. I will probably try using
Agreed. I had included it in the issue description as an example but removed it when I made the PR. I have been putting these values into
These sound good — I'll make the recommended changes/additions, including removing
The page is a parameter of the query and not the response, so I thought it made logical sense to group it with those fields. For example, you would "ask for" page 1 or 2 of a search results page by setting the I'm reviewing the metrics I want to provide based on
This makes sense to me, and I want to add some context that maybe helps in this PR discussion or for future discussions about clicks/behavioural events. In the case of search metrics, the proposed |
I'm just looking at moving I think the metrics I'm talking about are not the same as collecting events with metrics in them already (as in #474). We will be taking all these events and doing a transform into another (non-ECS) index. The metrics are then partially stored there and partially calculated in aggregations to show in Kibana. So what we want to collect in |
I also recognize that some of the limits on |
@webmat Have a look again. It's a much smaller diff this time since I've removed all the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joshdevins sorry for the delay on this review.
I am in favor of adding the search.* field set, but have these few questions/comments:
- The description suggests optionally populating the ECS
source.*
fields with context such as user or geo information. Do you expect to commonly have information about the source of the search? For example,user.id
does not need to be nested undersource.*
, and could just be populated as top leveluser.id
. Where would the geo information come from? - I think
results
should be singularresult
, even though it is an array. - I think
.ids
should be singluar.id
, even thought it is an array. - Are we agreed upon the datatype for
search.query.value
being keyword?
@MikePaquette thanks for having a look, hoping to get this in for 1.5.0 release. Re singular vs plural: I'm pretty indifferent and am happy to comply with the norms of ECS.
I would expect that most apps/sites with logged-in or anonymous users will have that as part of the search context, and it's possibly useful to log. It's really up to the business to decide if it's useful to log user ID's — there is no prerequisite to doing so. I can imagine also just logging a hash so you can do simple metrics like calculating unique search users, for example. There is also no prerequisite to where that information is logged in the schema, so I can also elide it from the comments and let people decide. I liked it in
Again there is no prerequisite here but it's very common to capture user context information in a search query. The geo information usually comes from geo-IP services that are in a layer above search, or can be from a user profile. Geo information can be important to understand search performance across a range of locations. For example, do my users in North America (and English speaking?) have a better experience than users in CJK language countries? More granular geo information (cities) might be less useful of course. I don't want to put any restrictions on what people log, but being able to view search performance from various angles is what I'm after, and geo is one of those, so I included it in demos and in the comments.
What alternative(s) do you imagine? It's a I'm confident that we should include query value but it's also possible that a search solution has no query string or multiple even. I don't know what percentage though, but I'd like to include it to enable things like per-query-string metrics which is quite common. |
@joshdevins thanks for the replies. @webmat is trying to close 1.5.0 today, so I'm not sure we have time to get this finalized and included in 1.5.0. |
@MikePaquette These are seem like pretty small changes that can be made very easily, so I'm happy to jump on Slack/Zoom to just resolve these. I have two teams waiting on this to move forward with other work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The part I struggle a bit with is understanding how best to advise or implement the mapping then. Do solutions provide a mapping file that is based on ECS but includes the custom fields they want? Or do they rely on ECS mappings and dynamic mappings for the custom fields?
ECS doesn't mandate anything there. Users who are fine with dynamic mappings are free to use the sample Elasticsearch templates provided in the repo directly (provided they adjust the template settings, which are geared towards experimentation, not production use). But my recommendation is for users to build their templates exactly how they need them, as usual. They should include all of the ECS field definitions they need; they're also free to omit ECS fields they will never use. A good compromise here is to make that decision at the field set level, not for every single field.
One of the ways users can build their templates based on ECS is to leverage the tooling in the repo. This isn't documented yet, but with Python 3:
python scripts/generator.py --help
usage: generator.py [-h] [--intermediate-only]
[--include INCLUDE [INCLUDE ...]]
[--subset SUBSET [SUBSET ...]] [--out OUT]
optional arguments:
-h, --help show this help message and exit
--intermediate-only generate intermediary files only
--include INCLUDE [INCLUDE ...]
include user specified directory of custom field
definitions
--subset SUBSET [SUBSET ...]
render a subset of the schema
--out OUT directory to store the generated files
In short, you can curate your custom fields in yaml files elsewhere, then run something to that effect:
python scripts/generator.py --include ../myproject/fields/ --out ../myproject/ecs-artifacts/
To learn more about curating your own ECS-based artifacts, you can start here #497.
page is a parameter of the query and not the response
Ah yes, makes total sense. I agree with search.query.page
if you'd like to add it back 👍
(about search and click events) the two need to work in concert
Totally fair. That's one of the challenges of ECS. There's always a temptation to nest things closely for a very precise use case. But that would lead to eventually having "click" details in many many places in ECS. The approach I think will work best in the long run is to define "click" on its own, and try to consider the various broad use cases where click events are tracked. Then each use case (search being one of them) that correlates with clicks, leverage the subset of click.*
that makes sense to them. Clicks in search results, emails, arbitrary web pages (thinking UX heat map) will have different needs, but generally the set of fields that make sense may end up being in a reasonable range, like 10-20. Then each of these kinds of tracking uses those that make sense.
Right now there isn't a good structure to clearly document these relationships between field sets, but using the main field set description like you do is a great start.
is there a way to not set
ignore_above
for keywords?
Not at this time. Raising it like you did is ok, though. Just to clarify, however, the keyword
is optimized for exact matches and aggregations. Supporting 8K long values in there for this purpose seems excessive, no?
A document where this field goes over the ignore_above limit would still be indexed. But queries on the keyword field would simply not return the document. If we do a multi-field (see comments below), we could mitigate this, however.
Now for more mundane, but concrete feedback on the PR :-)
- Please enter a changelog
- Please wrap text in
schemas/search.yml
somewhere around 80-100 characters. - See also a few comments below
schemas/search.yml
Outdated
ignore_above: 8191 | ||
short: The query string being searched on. | ||
description: > | ||
The query string being search on. This field is not analyzed and should not be pre-processed in any way in the event (e.g. normalization list lowercasing). This is useful for search use-cases that use a one-box style search interface. Other interfaces will have to rely on additional custom fields or labels to represent things like filters applied, extra parameters, user context, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The query string being search on. This field is not analyzed and should not be pre-processed in any way in the event (e.g. normalization list lowercasing). This is useful for search use-cases that use a one-box style search interface. Other interfaces will have to rely on additional custom fields or labels to represent things like filters applied, extra parameters, user context, etc. | |
The query string being searched on. This field is not analyzed and should not be pre-processed in any way in the event (e.g. normalization list lowercasing). This is useful for search use-cases that use a one-box style search interface. Other interfaces will have to rely on additional custom fields or labels to represent things like filters applied, extra parameters, user context, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious why you wouldn't want the field to be analyzed. I totally get that the source field in the document should be the unmodified search query. Someone investigating issues of their search solution will want this field untouched. 👍
But also having an analyzed field in the Elasticsearch mapping would be great as well, to let users investigate a subset of their search events.
So if you'd like to add an analyzed multi-field, you can do so with:
type: keyword
multi_fields:
- type: text
name: text
You'll notice that ECS follows the reverse convention vs Elasticsearch' dynamic mappings: the keyword
field is always the canonical one, and the analyzed field is the nested one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is typically done after an aggregation on something like query string or query ID. You wouldn't tend to investigate search queries in their raw form but really only after some aggregation. The kind of normalization/analysis that you would do it also pretty custom to the business, I would argue. The standard
analyzer is probably fine for most people of course, but many places will have custom stop words, etc. that they may want to also use to normalize their queries before analysis.
I'm not opposed to adding multi-fields here, but I'm not convinced it's necessary. I'd opt to hold off and add it later if there is a larger observed need. As you say, someone can always index the query in a custom text
field. WDYT?
This tooling looks great. Can we advise use of this or not yet? |
Search events (query, page, click) have a common set of fields that are useful to collect to create metrics from. Search metrics can be calculated with transforms based on this schema. We expect applications to emit these events, not Elasticsearch itself (for now). This change is part of a larger project to collect search events and calculate metrics in a standard way. Closes: #666
Newlines at 80 cols and add the page field back in.
8k was probably excessive so we've shortened it to 4k.
Next CHANGELOG includes a reference to the new field set as well as the PR that introduced the change.
They should be prefixed to make it obvious they are search actions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tooling looks great. Can we advise use of this or not yet?
Yeah I think we can, with the proper caveats in place. Not officially supported, and still missing some features (e.g. adjusting template settings). But I think it's useful already, and that it's a pretty good way to manage one's templates.
Only very minor adjustments left to do, see below. Thanks for the adjustments :-)
Co-Authored-By: Mathieu Martin <[email protected]>
We want to make it clear that these are examples of usage only and not required. Co-Authored-By: Mathieu Martin <[email protected]>
@timestamp corresponds to the timestamp of the actual event at source while event.created is when the first agent picks up the event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I think there is one field missing here: search.query.offset which indicates the "record from" number which is not the same as the page. |
One another note about search.query.value. We have a use case where we need to do aggregations per search.query.value during analysing one of our app. I do not know if you plan to make it searchable, but in our case we need it if possible. If not there should be .text alternative if the value is too long. |
Search events (query, page, click) have a common set of fields that are
useful to collect to create metrics from. Search metrics can be
calculated with transforms based on this schema. We expect applications
to emit these events, not Elasticsearch itself (for now).
This change is part of a larger project to collect search events and
calculate metrics in a standard way.
Closes: #666