Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Kibana Query Language #12282

Closed
Bargs opened this issue Jun 12, 2017 · 10 comments
Closed

New Kibana Query Language #12282

Bargs opened this issue Jun 12, 2017 · 10 comments

Comments

@Bargs
Copy link
Contributor

Bargs commented Jun 12, 2017

Part of #10789. Motivations and overall goals are described in that ticket. This ticket is only for implementation of the language itself, additional enhancements like autocomplete will be separate.

This new query language will be merged as an experimental feature. It'll likely evolve over future iterations, so these are just our initial ideas.

The new language should have certain characteristics:

  • consistent and simple - it should be easy to get started with the language without needing to learn a lot of complex syntax. Special cases should be rare. Queries should be readable without much prior knowledge.
  • expandable/pluggable - Elasticsearch supports a lot of different types of queries. The language would turn into a mess if we tried to develop a special syntax for each one. We should support the addition of new query types in a scalable manner.
  • embeddable - it should be easy to reuse these queries anywhere queries/filters are supported in Kibana.
  • abstracted from ES syntax - The language should never rely on blindly passing through raw query DSL syntax. There should always be a layer of abstraction that allows us to help users seamlessly migrate their queries. If we need to support some sort of custom query that allows query DSL or lucene query syntax there should be big disclaimers stating that the user is on their own in handling backward compatibility breaks.

With that said, what should the language look like?

Here's my current thinking:

The new filter editor introduced a nice way to build queries in plain english.

screen shot 2017-06-12 at 12 33 56 pm

I think we could mimic this in the query language with a more functional syntax. The general pattern would be <function>(<params>).

is("response", 200)
isNot("response", 404) (or: !is("response", 404))
isOneOf("response", 401, 403)
exists("response")

Support for named parameters:

range("response", gte='400', lt='500')

Provides a better way to support advanced options:

contains("message", "fox quick", proximity=5)

It's easy to read and understand, it follows a consistent pattern, and allows for an infinite number of query types. How might we support a geo bounding box query?

geoBoundingBox("coordinates", "40.73, -74.1", "40.01, -71.12");

If we decide after testing this is too verbose for the simplest cases, we could introduce shorthand aliases for some of the most common queries. : could be an alias for is so that you can still do response:200 for example.

Query Syntax

This is in development, but I'll try to keep this up to date as I flesh out the language.

Queries are represented as functions. Many functions take a field name as their first argument. Extremely common functions have shorthand notations.

is("response", 200) will match documents where the response field matches the value 200.
response:200 does the same thing. : is essentially an alias for the is function.

Multiple search terms are separated by whitespace:

response:200 extension:php will match documents where response matches 200 and extension matches php.

All terms must match by default. The language supports boolean logic with and/or operators. The above query is equivalent to response:200 and extension:php

We can make terms optional by using or.

response:200 or extension:php will match documents where response matches 200, extension matches php, or both.

By default, and has a higher precedence than or.

response:200 and extension:php or extension:css will match documents where response is 200 and extension is php OR documents where extension is css and response is anything.

We can override the default precedence with grouping.

response:200 and (extension:php or extension:css) will match documents where response is 200 and extension is either php or css.

Terms can be inverted by prefixing them with !.

!response:200 will match all documents where response is not 200.

Entire groups can also be inverted.

response:200 and !(extension:php or extension:css)

Some query functions have named arguments.

range("bytes", gt=1000, lt=8000) will match documents where the bytes field is greater than 1000 and less than 8000.

Notes: Terms without fields will be matched against all fields. For example, a query for 200 will search for the value 200 across all fields in your index.

Function Reference

Function name: and
Purpose: Match all given sub-queries
Alias: and as a binary operator
Example: and(response:200, extension:php) or response:200 and extension:php

Function name: or
Purpose: Match one or more sub-queries
Alias: or as a binary operator
Example: or(extension:css, extension:php) or extension:css or extension:php

Function name: not
Purpose: Negates a sub-query
Alias: ! as a prefix operator
Example: not(response:200) or !response:200

Function name: is
Purpose: Matches a field with a given term
Alias: :
Example: is("response", 200) or response:200

Function name: range
Purpose: Match a field against a range of values.
Alias: :[]
Example: range("bytes", gt=1000, lt=8000) or bytes:[1000 to 8000]
Named arguments:
gt - greater than
gte - greater than or equal to
lt - less than
lte - less than or equal to

Function name: exists
Purpose: Match documents where a given field exists
Example: exists("response")

Function name: geoBoundingBox
Purpose: Creates a geo_bounding_box query
Example: geoBoundingBox("coordinates", topLeft="40.73, -74.1", bottomRight="40.01, -71.12")
Named arguments:
topLeft - the top left corner of the bounding box as a "lat, lon" string
bottomRight - the bottom right corner of the bounding box as a "lat, lon" string

Function name: geoPolygon
Purpose: Creates a geo_polygon query given 3 or more points as "lat, lon"
Example: geoPolygon("geo.coordinates", "40.97, -127.26", "24.20, -84.375", "40.44, -66.09")

@trevan
Copy link
Contributor

trevan commented Jun 12, 2017

My 2 cents: I'd rather it not be "functional syntax" but more "natural language". The problem with "functional syntax" is it looks to foreign to non programmers/engineers and so business users will be frightened away from it. The timelion syntax is an example of a "functional syntax" that I have several users that don't like it because it is too "weird"/"complicated" looking.

@weltenwort
Copy link
Member

@trevan it is a good point that we should make sure to make the language as accessible as possible to non-technical users. The main problem when getting close to "natural language" often is the ambiguity of the grammar. But it should be possible to get quite far as long as we stick to something like (<field> <operator> <operands...>) (similar to natural-language subject-predicate-object):

  • (response is 200)
  • (response is not 404) (I always liked that particular python expression, see the syntax reference)
  • (message is between 400 and 500)
  • (response exists) and (tags are one of 401, 403)
  • (coordinates are within 40.73, -74.1 to 40.01, -71.12) (just a thought)

That looks like it should still be generatable by a context-free grammar.

@Bargs
Copy link
Contributor Author

Bargs commented Jun 12, 2017

I agree, it's worth playing around with an even more natural syntax. One thing I don't like about languages that try to go too far in that direction is that they become more difficult to understand at a glance. Without consistent separators like . and () it can be difficult for people to read an expression and quickly tell which parts are fields, which are operators, and which are operands. Maybe we could fix that with syntax highlighting. I think sometimes the extra verbosity alone can make them harder to read as well.

@trevan do the users who dislike timelion's syntax also dislike the lucene query syntax?

@trevan
Copy link
Contributor

trevan commented Jun 12, 2017

@Bargs, is the lucene query syntax https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-string-syntax or https://lucene.apache.org/core/2_9_4/queryparsersyntax.html?

I guess since we have an _all field, most just query for the values that they want and use AND/OR and grouping. Something like "(404 405) AND homepage". That syntax is pretty "natural" or at least Google and other search engines have instructed them in that manner.

I think I've only heard disagreements about the range (field:[x TO y]) and gt/lt syntax (field:x). When I do point out that "exists:field" is possible or "field:value" is possible, they seem to grasp that fairly quickly.

@darkmoon03
Copy link

I agree that most people are comfortable with the Google syntax. In fact, that's the phrase I most use to help my users: "It's not Splunk. Think of it like the Google search bar." That works very well.

I've also been trying to get my users to user +/- instead of AND/OR, which some success.

@darkmoon03
Copy link

How about field comparisons:

(flow.bytes >= (flow.duration *100)) ?

(even if we build natural forms, please keep the arithmetic forms for readability.)

@epixa
Copy link
Contributor

epixa commented Jun 15, 2017

Another way to look at this is that if non-technical users need to fall back to writing a text query in Kibana, then we have a deficiency in our search UI. I agree that the timelion syntax is challenging for non-technical users, but I also think it is highly effective for technical users. If Timelion had a UI to build expressions in addition to the ability to fall back to the query language, then both sets of users might be nearly completely satisfied.

Personally, I'd prefer a powerful, unambiguous query language for raw input alongside a higher level UI for constructing and representing queries.

@lukasolson
Copy link
Member

Here's a list of syntaxes used by other software (for comparison purposes):

Query String
Field names: title:quick, book.\*:quick
Field exists: _exists_:title
Wildcards: qu?ck bro*
Regular expressions: title:/quic?k(gr[ae]y)/
Fuzziness: quikc~ brwn~ foks~, quikc~1
Proximity searches: "fox quick"~5

Simple Query String
AND operation: quick +brown
OR operation: quick brown, quick | brown
Negate: quick -brown
Phrases: "quick brown fox"
Prefix queries: quic*
Precedence: (quick OR brown) +fox
Fuzziness: quick~1
Slop: "quick brown fox"~1

Google Inbox
AND operation: quick brown, quick AND brown
OR operation: quick OR brown
Negate: quick -brown
Phrases: "quick brown fox"
Precedence: (quick OR brown) fox

Google Search
AND operation: quick brown, quick AND brown
OR operation: quick OR brown
Negate: quick -brown
Phrases: "quick brown fox"
Wildcards: "largest * in the world"
Ranges: 50..100
Fields: site:youtube.com

GitHub
AND operation: quick brown, quick AND brown
OR operation: quick OR brown
Negate: hello NOT world, -language:javascript
Phrase: "quick brown fox"
Greater/less than: stars:>10, created:>=2012-04-30
Range: stars:10..50, stars:10..*
Missing: no:assignee

Slack
OR operation: or
Range: before:01/01/2017, after:01/01/2017, on:01/01/2017, during:01/01/2017
Fields: from:@lukas, to:me, in:#kibana, has:link
Phrases: "quick brown fox"
Prefix queries: quic*

@Bargs
Copy link
Contributor Author

Bargs commented Jun 15, 2017

We talked about this on Zoom a bit today. For the first iteration, I'm going to work towards the syntax I outlined in the issue description, along with some shorthand aliases for very common queries, like : for is() and > for greaterThan(). I'll add more detail to the description as I flesh out the syntax.

While even more natural looking queries seem appealing in theory, we think we'd run into a few issues:

  • Too verbose
  • Possible ambiguity in the language
  • Trouble with internationalization (while the more "functional" syntax might also need translations, we're talking about single words instead of entire phrases)
  • Making a system appear smarter than it is can actually reduce user confidence when they run into edge cases.

Note that none of this is set in stone though. Like I said in the issue description, this will be an experimental feature that will change over time. This is just the direction I'll head in first.

Also important to note, I've been developing #11915 in such a way that we should be able to add support for new languages via plugins in the future. If we decide to stay away from natural language queries but someone else really wants them, they could develop their own plugin for it.

@jccq
Copy link

jccq commented Jul 2, 2017

@Bargs in In line with the forthcoming SQL interface in Elasticsearch (and the general trend in big data) i believe it would make a lot of sense if the basic syntax is compatible with the SQL where clause.. SQL like functions could be used for the extras. my2c. :) .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants