Skip to content
This repository has been archived by the owner on Nov 18, 2021. It is now read-only.

Proposal: CUE Querying extension #165

Closed
mpvl opened this issue Oct 24, 2019 · 20 comments
Closed

Proposal: CUE Querying extension #165

mpvl opened this issue Oct 24, 2019 · 20 comments
Labels
Proposal roadmap/language-changes Specific tag for roadmap issue #339

Comments

@mpvl
Copy link
Contributor

mpvl commented Oct 24, 2019

We propose an extension to the query capabilities of CUE. Querying, in this context, is more than just retrieving data. It also means the selection of nodes to which to apply constraints and, as we will explain, interpretation of lists as association lists.

This proposal is meant to give a high-level design for these capabilities, with more detailed design to be flushed out later. The main point of it up to this point is to show the coherency of the syntax and semantics of these new features.

Objective

Add minimal syntax and semantics to CUE to allow:

  1. one-line queries for retrieving data,
  2. more fine-grained control over which nodes in the configuration tree
    have constraints applied,
  3. support for association lists,
  4. a better solution for shadowing issues,
  5. future compatibility with integer map types.

Allowing closer control over pushing out of constraints is needed to support JSON Schema compatibility down the line.

A guideline for CUE is to keep the spec considerably smaller than that of YAML. Ideally, these changes should be added with little or no growth to the spec. We achieve this by 1) exploiting the commonalities between these constructs, and 2) removing constructs that are no longer necessary.

Background

We assume the merits of a query language are clear. What is probably less clear is how a query language relates to two key aspects of CUE’s view on proper configuration and schema definition. In this section, we describe the various aspects of CUE related to querying and how it's possible to design a query language that fits well with the other aspects of the CUE language.

The aspect-oriented nature of CUE

CUE can be seen as an aspect-oriented language in addition to be a constraint-based language. Just as a JSON configuration can be modeled as a collection of path-leaf value pairs, a CUE configuration can be seen as a collection of paths-constraint pairs. So instead of specifying a single concrete value for one path, CUE defines a constraint for many points in a tree at once.

To draw the parallels with aspect-oriented programming, path selections correspond to pointcuts and constraint correspond to advice.

Current limitations

As an aspect-oriented language, CUE's path selection, or pointcut, abilities are currently somewhat lacking.

CUE’s schema model is quite close to OpenAPI and JSON Schema, which makes CUE a good fit for processing those standards. However, CUE currently cannot represent certain JSON Schema constraints, most notably patternProperties. This is a result of CUE's limited pointcut abilities.

Queries to the rescue

A query language for CUE, or any JSON-based query language, selects data from the JSON data tree. Existing languages have become quite powerful at that. Even though it doesn't support querying other than in the form of comprehensions, it already possesses many of the constructs needed to define a query language.

It turns out that the parts where CUE lacks constructs to construct a JSON-query-language-like construct closely correspond to the parts where CUE is lacking in supporting JSON Schema. Keeping the syntax for these queries close to the syntax for selection in label expressions will aid both the user and keep the spec small.

Shadowing

Command-line tools like jq are more useful if queries can be specified on a single line. The same would hold for CUE. A problem in CUE that thwarts the ability to make single-line queries is field shadowing. The current alias construct provides a way around this, but requires multiple
lines to write.

X=foo // make foo visible within struct declared by nested foo
foo: foo: bar: X.baz // access within outer foo

A good querying language will benefit from being able to avoid shadowing.

Abstraction issues and associative lists

CUE solves a longstanding problem in configuration design that plagues most configuration languages: whether to use abstractions or not. The idea behind using an abstraction layer is that it can 1) hide complexity from the user and 2) protect the user against misuse. But an abstraction layer is prone to "drift". The abstraction layer starts lagging as new features are introduced. This poses a maintenance issue. This problem is often so severe that abstraction layers are a bad idea. But not using an abstraction layer and configuring an API directly means exposing the user to potential mistakes.

CUE solves this issue in two ways. CUE allows defining constraints directly on an API, making it unnecessary to introduce abstraction layers in the first place. But where abstraction layers are used, CUE's composition model allows mixing it with direct-API use without foregoing the protections of the abstraction layer. It allows combining the best of both worlds.

For this to work well, though, it is important that defining CUE directly on an API is possible in the first place. This is currently not the case. The problem is lists. In many configurations, lists are interpreted with set-like properties. For instance, a list of strings is interpreted as a set of strings, whereby each element must be unique, such as in:

// Some APIs may expect the following two strings to be merged to [ "foo", "bar", "baz" ]
stringSet: [ "foo", "bar" ]
stringSet: [ "bar", "baz" ]

Or the elements of lists are structs of a certain type, whereby a struct with the same name may appear only once. For instance, merging these two lists

objects: [{
    kind:   "Job"
    name: "frontend"
    val1: 1
}, {
    kind:   "Job"
    name: "backend"
    val2: 2
}]
objects: [{
    kind:   "Job"
    name: "frontend"
    val3: 3
}]

is expected to result in

objects: [{
    kind:   "Job"
    name: "frontend"
    val1: 1
    val3: 3
}, {
    kind:   "Job"
    name: "backend"
    val2: 2
}]

To deal with this effectively in CUE, one currently has to convert such lists to structs and back, violating the assumption that CUE can be applied to the native API.

Proposal

We address all of the issues above by introducing changes to the language in several steps. These steps are grouped by functionality and are chronological.

Changes to field syntax

The first set of changes relates to changing the syntax for defining fields. We assume the adoption of colons (:) instead of spaces separating fields in the shorthand mode (more on this in the discussion).

The semantics of optional and required fields with respect to CUE's value model remains unchanged.

Optional fields

CUE currently allows specifying an optional field using the following notation:

foo?: bar

We introduce the ability to specify optional fields in bulk using the notation

[expr]: value

defining an optional field x: value for any label x that unifies (matches) with expr.

The existing notation for optional fields foo?: bar remains as syntactic sugar for ["foo"]: bar. The meaning of optional fields remains the same otherwise.

As with normal fields, constraints for optional fields are additive.

[=~"^[a-z]$"]: int
[<"r"]:        <75
[>"k"]:        >50

a: 10  // <75
m: 60  // >50 & <75
z: 100 // >50

Note that even though it is not presented as such in the spec, CUE currently already allows specifying bulk optional fields using the <Name>: value notation. This would become [string]: value in this proposal (modulo the Name alias, more on that later).

Required fields

CUE currently allows specifying required fields using the following syntax:

foo: value

We introduce the ability to specify a single required field using the notation:

(expr): value

with a label resulting from evaluating expr and value value. In this case foo: bar is syntactic sugar for ("foo"): bar.

NOTE: we may or may not actually introduce this notation, the "\(x)": bar notation gives most of the benefits and may be clearer. However, this notation helps to understand the consistency of the resulting syntax and where the language may develop.

Aliases

We introduce the ability to alias labels and field values. The notation X=foo: bar binds X to the same value as foo, namely bar, and is scoped using the same rules as for foo.

An alias used within parentheses or square brackets ((Y=expr): bar or [Y=expr]: bar) binds to the label value and is visible within the value of the field.

Example:

// This is equivalent to <Name>: { name: Name } in current CUE 
[Name=string]: {
    name: Name
}

jobs: {
    X="main-service": {
        replicas: 2
    }
    sidecar: {
        replicas: X.replicas // set to same number of replicas as “main-service”
    }
}

Elimination of quoted identifiers

CUE currently allows backticks to create identifiers that are otherwise not referenceable. With the proposed aliasing construction this is no longer necessary. Back-quoted identifiers were necessary to allow referring to fields with keyword names or to declare definitions with names that are not valid identifiers (a[x] cannot be used to look up definitions).

With aliases, instead of referencing a field with an invalid identifier as name with a quoted identifier, one can just alias this field:

X="foo-bar"
a: X.baz

This does not solve the case of accessing a field with an invalid identifier name within another reference. Currently, these can be referenced using quoted identifiers:

x.`foo-bar`

The aliasing technique doesn’t apply here (aliases are local only and are not accessible from outside the scope in which they are declared). A solution to this is to allow strings as selectors. In this case, the field called foo-bar could be referenced as:

x."foo-bar"

Unlike using x["foo-bar"], this would also work for definitions.

To be consistent, we should at this point also allow x.0 to select the first element in a list. This proposal makes almost any expression possible as a selector, and it would be strange to exclude integers.Also, if we allow constraints to be applied to list indices on the LHS, it is consistent to allow selection using such indices as well. To make things even more consistent with the LHS, allowing foo.(x+y).bar would make selection almost fully symmetric with the LHS.

Examples

A map type from strings to integers is written as

[string]: int

Assuming a definition for JobID that defines valid IDs for jobs, and a definition Job for valid jobs. A map of jobs can be defined as:

jobs: [JobID]: Job

The old bind or template syntax (<Name>: { name: Name }) will be phased out. It can be replaced with

[Name=string]: { name: Name }

or

[Name=_]: { name: Name }

A field with a label that is not a valid identifier, or with just a really long label, can be aliased using a more convenient identifier:

X="left-hand": value

where the X here refers to value.

This example also shows why an alias for a field value is at the start of a field and not at the value:

FB="foo-bar": settings: { name: FB.name }

If the notation had been, "foo-bar": FB=settings: { name: FB.name }, it would have appeared that FB were solely bound to the settings section, and not the whole struct. It also would suggest that FB only binds to that value, and not the result of the entire field, which may be
a more restricted definition as the result of other declarations for the same field.

The common notation for writing single-value string interpolations for generated fields

"\(k)": foo

can now be written as

(k): foo

Field aliases can also be used to unshadow a reference.

X=foo: {
   foo: {
       outer: X.bar
       inner: foo.bar
   }
}

or, alternatively, noting that labels with string literals are not referenceable:

foo: {
   X="foo": {
       outer: foo.bar
       inner: X.bar
   }
}

The proposal also allows selectively applying constraints based on the value of a label. For instance, to apply a constraint only to fields that end with “Test”, one could write:

[=~"Test$"]: T

A mapping from JSON Schema to CUE would require supporting such a capability, which is cumbersome to implement in CUE with the current capabilities.

Unifying the concepts of lists and structs

In CUE lists and structs are strikingly similar at the implementation level. CUE stores lists as integer maps, for which there must be a field corresponding to each element in the list up to the largest defined index. This may not be the most efficient, but it greatly simplifies many aspects.

However, there remain significant differences in syntax and functionality. It is worth considering the benefits of unifying lists and structs at the syntactic level as well.

Types for all elements versus remaining elements

In CUE, [X, Y, ...T] defines a list with two elements that also allows any number of additional elements which must be of type T. However, X and Y themselves need not be of type T. This definition corresponds to the additionalItems keyword in JSON Schema.

There is no equivalent for the corresponding additionalProperties keyword, which applies to structs. As structs already sport the ... notation for open-endedness, it is clear, syntactically at least, what that analogue would be.

Conversely, in structs we allow defining a type that must apply to all elements by means of the generalized optional field syntax: [expr]: T (or previously <Name>: T). There is, however, no equivalent for this in lists. As CUE already treats lists and structs equivalently semantically, such a construct for lists can analogously be written as:

list: [int]: T

Integers in label expressions

Viewing lists as integer maps would extend the same power and flexibility of selecting labels and applying constraints to lists.

For instance:

list: 6: T          // Only apply a constraint to the element at position 6
list: [int]: string
list: [>10]: string // Apply a constraint to elements at positions >10
list: (x+y): value

Access to optional elements

CUE currently doesn't allow access to optional values. For instance, this results in an error:

foo: {
    bar:  int
    baz?: int
}
val: foo.baz // cannot access optional field

We may consider allowing this in the future, either by not making it an error or requiring an added question mark foo.baz?. For structs this would be an easy addition: foo[expr] is guaranteed to only give access to regular required fields, while foo.label gives access to those same fields, definitions and then possibly optional values.

For lists currently only indexing is allowed. Interpreting lists as integer maps, or even allowing integer maps altogether, would make it reasonable to allow also integer positions in selection positions:

list.3.name

This also be more consistent with allowing integer values for labels.

Integer maps

Some of CUE's integrations, most notably YAML and Protocol Buffers, support maps with integral keys. It is not sure CUE should go there, but it is worth keeping this option open.

Value filters

With the notation for optional fields above, we provided a way to apply constraints to a subset of fields based on the field label. One may also want to apply constraints based on the value of a field.
This can currently be done with comprehensions. Comprehensions can be a bit clunky, however, and cannot be used as a query shorthand. We would like a notation that is consistent and convenient for LHS selection and querying.

Consider this definition.

map: [string]: {
    name: string
    port: int
}

We want to now further constrain that any value with a port in the range 50000 to 50100
must start with the name "home-". For this we borrow the [?expr] notation from JMESPath and JSONPath to filter a value based on a boolean expression:

PortMap: [? @.port>=50000 && @.port<50100]: name: =~"^home-"

The ? notation indicates that we are filtering a value. The @ notation is a special identifier that refers to the "current" value under consideration (the value that will land after the colon). Like JSONPath and unlike JMESPath, the use of @ is required in a CUE program to access fields in the RHS value. (This requirement might be dropped for command-line queries.)

Label expressions and value filters can be combined, allowing one of each:

foo: [>"field"][? @.name>"foo")]: value

Examples:
In this example we use value filter to set some qualifiers
to existing data.

AppleKind: string
appleKinds: [AppleKind]: {
    size: "large" | "small"
}

appleBags: [...{
    contents: AppleKind
    numUnits: int

    heavy: *false | true
}]

// We call more than 50 apples heavy.
appleBags: [? @.numUnits>50]: heavy: true

// Less apples go in a bag if it is large.
appleBags: [? appleKinds[@.contents].size=="large"]: numUnits: <=70

If a bag has more than 50 apples, we consider it heavy. And if an apple is large, we reduce the capacity of the bag. As with comprehensions, the constraints are unified after evaluating the value and checking all conditions.

Associative lists

An associative list is a list that defines a key for each of its values, effectively turning it into a map. In CUE terms, elements with the same key are unified and collapsed onto a single element as if it were a struct.

We can introduce associative lists by generalizing the concepts introduced in the previous sections. Most importantly, we allow the @ notation referring to the right-hand side to also be used in a label expression, allowing it to specify how to derive a key from its elements.

Suppose we have the following two lists:

a: [{
    name: "foo"
    val1: 2
}, {
    name: "bar"
    val1: 3
}]
a: [{
    name: "foo"
    val2: 4
}]

What we would like these two lists merge into:

a: [{
    name: "foo"
    val1: 2
    val2: 4
}, {
    name: "bar"
    val1: 3
}]

We can achieve this by defining a as an associative list:

a: [@.name]: {...}

This would tell CUE to interpret the lists as maps keyed by the name field of its struct elements. Theoretically, it could also be used as a validation check on map keys. We intend to initially disallow a mixture of structs and lists to define a value and will only allow this notation for lists.

Associative lists can also be used to define sets. For instance, a set of strings can be defined as

stringSet: [@]: string

such that

stringSet: [@]: string
stringSet: [ "foo", "bar", "baz", "foo" ]

evaluates to

stringSet: [ "foo", "bar", "baz" ]

(consistent with how structs work, it would not be an error).

A slightly more complex example involving environment variables:

// Cmd merges environment variables based on their name.
Cmd :: env: [strings.Split(@, "=")[0]]: string
cmd1: Cmd & {
    env: [ "FOO=3", "BAR=4" ]
}
cmd2: cmd2 & {
    env: [ "BAZ", "QUX=4", "FOO=3" ]
}

yields

cmd1: env: [ "FOO=3", "BAR=4" ]
cmd2: env: [ "BAZ", "QUX=4", "FOO=3", "BAR=4" ]

Indexing into associative lists would only be possible by means of the key value (cmd1.args.FOO); the integer index would be meaningless. Defining an associative list indicates that the value should be output as a list. An open question is whether one should also allow keys to be specified explicitly, such as { FOO: "FOO=3" } or perhaps [ FOO: "FOO=3" ]. This is not planned for an initial implementation.

In some APIs the order of elements in lists is important even if it is interpreted as an associative list. Such semantics would be incompatible with CUE's value model. It is, however, in the realm of possibilities to give some guarantees on topological sorting on a best-effort basis outside of the usual value model. The above example of the command line arguments applied such an algorithm; in that case there was only one possible order.

Querying

In this proposal we have suggested increasing the symmetry between the selector operator and what can be specified as a label. For instance:

Label             Corresponding selector
foo: bar: 2       foo.bar
foo: "a-b": 3     foo."a-b"
foo: (a+b): 4     foo.(a+b)
foo: [5]          foo.0

We propose that the LHS square bracket notation extends analogously to become a query operator.

In this proposal, any selector with an expression containing a square bracket it is a query. The result will be a possibly empty list containing all matching values. Any subsequent selector operating on the result applies to each individual result. This is called projection in JMESPath. If the expression is followed by anything other than a dot, the sequence terminates and a list results. The result is always a list even if it yields a single value.

A query has one of the following forms:

expr.[key_filter]
expr.[key_filter][? value_filter]
expr.key[? value_filter]
expr.[? value_filter]

This makes projections, selector expressions that selects 0 or more entries into a list, easy to recognize: they all start with .[ or include a [?, or both. For clarity, we could disallow the third case. Having a clear syntactic distinction between starting and ending a projection avoids the ambiguities one can find in JMESPath, where foo[:5][:3] and (foo[:5])[:3] mean different things.

Examples:
Data:

a: {
    b: [1, 2, 3]
    c: [4, [5, 6]]]
}

Queries:

a.[]            // [ [1, 2, 3], [4, [5, 6] ]
a.[<="b"]       // [ [1, 2, 3] ]
a.[][? @[0]>3]  // [ [4, [5, 6] ]

a.b.0           // 1
a.["b"].0       // [1]
a.[].0          // [1, 4]
a.[].[]         // [1, 2, 3, 4, [5, 6]]
a.[].[>=1]      // [2, 3, [5, 6]] // if we allow selection on list indices

Flattening

JMESPath has a flattening operator, denoted foo[], which flattens out one layer of lists. For instance, [1, 2, [3, 4]][] could mean [1, 2, 3, 4]. It may be worth considering providing the same. Unlike with JMESPath, such an operator would not start a projection, but rather terminate one. A subsequent query operator can be used to restart projection if needed.

Initially, implementation can just use the struct.FlattenN builtin.

Recursive descent

The proposal currently does not address a recursive descent operator, as available in JSONPath and jq. This may be considered as a future option, but performance considerations and semantic consequences have to be carefully weighed.

Construction

JMESPath and jq allow constructing new values from collected results. The square bracket operator is too overloaded in the proposed query syntax to be useful for construction and it is desirable to not break the simple rule that .[ always starts a projection.

A possibility in CUE is to piggyback on the emit value syntax. This would allow .{expr} to be used for generating values. The obvious use for this is to construct structs, but other values, like lists or strings, can be used using the emit value syntax.

Examples:

names: [ “Jane”, “Jack” ]
msg: names.[].{"Hello \(@)"}  // outputs: [ “Hello Jane”, “Hello Jack” ]

A big advantage of this approach is that it also allows for the construction of strings and other values, not just lists or structs. It also clearly distinguishes the .[] notation for projection, resulting in a list, the .x notation for selecting a single value, and .{} for the creation of values.

Open Question: Aliases in queries

If we allow construction, it becomes useful to allow aliases in queries as well. The proposed syntax looks a bit awkward with the dot notation. For instance in foo.X=bar.baz, it appears as if X is bound to bar.baz. Allowing aliases only in front of brackets, such as in foo.X=[].baz, foo.X=("bar").baz, may mitigate this issue, but it is still does not have the clarity as using the : notation.

It may be sufficient to say that for these cases one would need to use comprehensions. Note that neither JMESPath nor jq allow this kind of flexibility in construction.

Slice

JMESPath, JSONPath and jq all implement a variant of the slice operator. For consistency, it may be worth reintroducing this operator in CUE. It could be introduced as a normal operator as well as a selector operation allowing it in a .[]. Note that the latter is somewhat redundant, as .[] already allows selection on index values. Adopting slices in that situation may make more sense when adopting Python style semantics (allowing negative index and steps, just like JMESPath).

Evaluation order

Although many of the proposed constructs already have an equivalent in existing constructs, it is good to consider once more how all of this is evaluated.

Evaluation of a node starts by evaluating the nodes on the path leading to the respective node and collecting the results in a single expression. Note that CUE evaluates lazily, which is necessary to implement the constraints implied by the value lattice.

The evaluation of such an expression commences as follows:

  1. Collate conjunctions by type specification (for associative lists, mostly), data, and validators.
  2. Unify the type specifications to determine the merge strategy for lists, if applicable.
  3. Unify the data nodes.
  4. Apply constraints from optional fields whose label and value filters match that of the current value.
  5. Verify the resulting value for each of the validators.

To preserve commutativity, constraints applied at Step 4 only consider the value obtained at Step 3. A validator will either pass the current value or return bottom (a failure). Running validators at the end of the validation allows validators to be non-monotonic, allowing CUE some leaway in implementing non-monotonic validation such as the cardinality of a set or a not operator or builtin.

Note that although CUE only evaluates one node at a time, it defines a struct to fail if any of its children fails. So it still needs to descend down all nodes to validate that a configuration is valid for structs.

Discussion

Label expression notation

The notation for referring to a label value on the right-hand side has become slightly more verbose. Previously it was

<Name>: { name: Name }

now it is

[Name=_]: { name: Name }

It is also a little bit awkward.

In practice, however, the need for this construct seems less common than originally anticipated. Moreover, where it is used, it seems to be the result of fitting a native API onto a different CUE mapping when the native API is not convenient in CUE. With associative lists, the need for this construct should lessen more.

On the other hand, the need to constrain keys to a certain value, or to associate a set of constraints only with a certain subset of keys, seems much more common than originally anticipated. The [expr]: foo notation works much better for this use case than the <labelIdent>: foo notation, which did not allow for either case.

Another issue with the angular bracket notation is its parseability. With the introduction of unary comparators, most notable < and >, parsing these labels became complicated. With the introduction of : separators, the use of angular brackets also seems to have become esthetically less pleasing.

The square bracket notation has precedence in other languages. A map definition will look quite similar to one in TypeScript. The square bracket notation is also used in JSONPath, JMESPath, and jq to indicate field selection, very similar to the semantics in the way fields are selected in CUE.

The [expr]: foo notation is somewhat unfortunate for people coming from Jsonnet, where it means the equivalent of (expr): foo proposed here. However, the (expr): foo notation is used in jq and seems to more intuitively fit with the interpolation syntax and CUE as a whole.

The aliasing construct is very powerful and solves very common shadowing issues. For instance, a field called bytes of type bytes. Reserving identifiers starting with __ and backtick identifiers have proven to be insufficient to resolve even common cases.

Consolidation of list and struct concepts

The syntactic regularity that comes from consolidating the list and struct types may be pleasing, but it may also be harder to differentiate between the two cases. The question is how much this matters. For instance, it is already not possible to see whether src is a list or struct in src[x] or even for x, y in src.

The big advantage in consolidating the type is with querying. When using projection, the process of continuing selection on a collection of values, it is useful if it is crystal clear where a projection starts or ends. Having the same syntax for lists and structs means less to learn in this case.

In many cases, though, the type will be clear from the context, as in foo.[>"X"] and foo.[>10]. Where this is not the case, users could always clarify whether operating on list or struct by explicitly mentioning the key type, as in foo.[string] or foo.[int].

Overall, the regularity of the syntax seems to be desirable here.

Comparison to other query languages

Although we've taken elements from jq, JMESPath, and JSONPath in the design of the query language, none was quite suitable for use in CUE.

JMESPath seemed the best candidate, especially because it is well defined. Syntactical elements such as using | for pipe, backticks or single quotes for literal JSON make it incompatible with CUE. JMESPath also allows identifies to resolve to top-level fields in the "current object". This is possible as JMESPath has no other context. In CUE this is not possible, so CUE adopts the convention of JSONPath of always needing to refer to the current object explicitly.

Projection also has some unexpected properties. For instance, in JMESPath

list[:5][:3]

is not the same as

(list[:5])[:3]

Although this outcome is well defined and an understandable outcome, it is not quite a desirable property. A similar effect exists with any of the operations allowed as a projection. This is exasperated by the fact that it may not be clear from the immediate context whether an operator acts on a projection or not. For instance, the outcome of expr[0] differs depending on whether expr is a projection or not. In the proposed query syntax, a projection can only be continued using a selector, no exception. As such the dot in a query corresponds strictly to the colon on the left-hand side.

In the current proposal this effect still exists, but is limited to the selector operator, where such semantics may be expected. We could eliminate the distinction between (s.[x]).[y] and s.[x].[y] by adopting a JSONPath semantics requiring a projection be “captured” by enclosing brackets. This would make the (s.[x]).[y] form illegal, which would have to be written as [s.[x] ].[y]. This approach would allow an alternative syntax for appending lists, namely [ x.[_], y.[_] ] instead of x + y. Given that lists can be open or closed, the latter is a bit unclear: do the constraints of remaining elements of x get applied to y or not? What about vice versa? Is the result open or closed? The [ x.[_], y.[_] ] notation strongly implies the resulting list is closed and that the constraints for remaining elements of x and y do not get applied. To open the list, one would write [ x.[_], y.[_], … ]. Using + for list concatenation could then be deprecated.

There are more reasons to deviate from JMESPath, though. Although the semantics of || and && seem fine for a query language, they seem dangerous for a language like CUE. Furthermore, JMESPath provides an expression type, denoted &expr, that can be used to pass an unevaluated expression to be used as, say, a sort key for a sorting function. The use of & may be confusing in CUE. Using backticks might be a possibility.

Interaction with comprehensions

There is a lot of overlap between comprehensions and the querying syntax. The semantics should subsequently be kept in sync and implementations should implement one in terms of the other. It may be worth considering getting rid of comprehensions, although this seems unfeasible for various reasons.

Value filters

We used boolean expressions here to be consistent with existing query languages. In CUE, however, it would in this case be more natural to say, something like

PortMap: [& {port: >=50000&<50100}]: name: =~"^home-"

The boolean expression, however, is more general and a first goal was to be more familiar with existing query languages. Also, the desired semantics using unification is not eminently clear. For instance, in the above example using plain unification would match any value for which port could be within that value, so it matches values that do not have port defined.

An alternative is to use builtins together with boolean expressions, such as

PortMap: [? is_a({port: >=50000&<50100})]: name: =~"^home-"

This would allow for a builtin with a more expected instance relation and would allow for [?is_a(v1.Service)] to have the expected meaning.

Implementation-wise, boolean expressions are also similar to comprehensions, so focussing on boolean expressions simplifies the implementation initially and allows sharing the spec between them.

Construction syntax

In queries one will often reconstruct a similar value. It is thereby common to replace a field like this: { foo: foo }. In CUE this results in an evaluation cycle. As string labels are not referenceable, a trivial workaround is to write { "foo": foo }. As this is a common case, this gets tedious quickly, though. There should probably be a shorthand for this at some point. As { foo } already means embedding in CUE, a possibility is { :foo }.

Using colons : versus dots versus spaces for paths

A common complaint for CUE was that using spaces as separators for labels as a shorthand syntax for nested structs was confusing. The most common suggestion to fix it was to use dots. This indeed seems like an obvious choice. The proposed query syntax seems to indicate this even more. Given that a CUE program is structured as a collection of <node selection> ':' <constraint> declarations, the query syntax fits in nicely.

However, the dot syntax has its own issues. Firsty, CUE allows the RHS to reference the LHS. For instance,

Foo: {
    field: Foo | null
}

is allowed. This is analogous to Go allowing structs with fields of that struct itself, as long as they are pointers. Given the dot syntax, however, it seems non-obvious that the b after the colon in a.b.c: b.d refers to the b in left-hand side.

Also, the colon in CUE signifies a definition. With the new proposal, a map of map of integers is written intuitively as

IntMap :: [string]: [string]: int

Compare this to using the dot syntax, where this would become

IntMap :: [string].[string]: int

which is arguably even more awkward than using spaces as a separator.

Another big advantage of using colons is the ability to alternate regular values and definitions within a single chain. CUE uses both :: and : for defining values. In CUE, a definition must always be indicated with a double colon. Using colons as separators allows one to write

foo: D:: bar: int

which would not be possible, or at least ambiguous using spaces or dots. The current implementation of CUE disallows such chaining for definitions.

Compared to spaces, using colons (or dots) allows one to break a long chain of declarations across multiple lines. For instance

foo: bar: baz:
    qux: quux: <10

The comma elision rules prevented that possibility using spaces as separator.

Transition

This proposal is large and needs to be implemented in several stages. Many of the concepts may to existing ones, making its implementation not too hard, but each step will still require thorough revisiting of the proposed syntax and semantics.

Phase 1: Implement backwards incompatible changes.

The first step is characterized by making minimal changes for things we are sure of and to get backwards compatible changes out of the way.
Initially, we intend to only support the square bracket notation and not the parenthesis notation. Using string interpolations gets most of the functionality.

In the very early stages of the transition, we may only support string as a valid value. This allows mapping starting early to phase out the current templating notation (<Name>: value). This means initially only adding the [expr]: value notation.

The aliasing construct makes the recently introduced back-quoted identifiers superfluous. We suggest removing them from the language. Backquoted identifiers are semantically easier, but syntactically relatively verbose to describe.

Phase 2: Optional field selection and associative lists

The ability to apply constraints selectively based on the label value is often requested and moreover is somewhat of a blocker in supporting JSON Schema compatibility. This phase may skip selection on values.

Associative lists also fall in this category of making CUE more useful for its original intent and purposes.

Phase 3: Querying

The query language itself can be implemented in several phases. The first step would be to apply key and value filters to lists and structs. This would also be a good point to add value filters LHS optional fields.

This may also include allowing selection based on any kind of literal.

Phase 4: Deprecating old constructs

This includes:

  • quoted identifiers
  • template/bind syntax (<Name>: value)

Phase 5: Consider optional parts of this spec

This includes:

  • (expr): value notation,
  • whether to allow integer maps,
  • type of additional properties (...T), needed for JSON schema compatibility,
  • access to optional elements,
  • identifier for root struct ($ or root()), or an identifier for,
    the current struct, allowing aliasing at the top level,
  • value selection based on instance relation.
@rudolph9
Copy link
Contributor

NOTE: we may or may not actually introduce this notation, the "(x)": bar notation gives most of the benefits and may be clearer. However, this notation helps to understand the consistency of the resulting syntax and where the language may develop.

Isn't this already supported?

bar = 5
x = "foo"
"\(x)": bar 
:! cue eval %                                                                                                                                                                                                                                                                                                        
foo: 5 

@rudolph9
Copy link
Contributor

rudolph9 commented Nov 20, 2019

Integers in label expressions

Viewing lists as integer maps would extend the same power and flexibility of selecting labels and applying constraints to lists.

For instance:

list: 6: T          // Only apply a constraint to the element at position 6
list: [int]: string
list: [>10]: string // Apply a constraint to elements at positions >10
list: (x+y): value

Could we generalize the use of any constraint for use as a field?? The scope of this request is much larger than support json schema but would capture the quoted use case and would allow for something like the following:

boolToString: {
  (bool): *"bool" | string
  (true): "true"
  (false): "false"
}

trueString: boolToString[true]

Obviously implementing a runtime that support this is no easy feat but it's arguably the biggest missing construct convey a value lattice

@rudolph9
Copy link
Contributor

rudolph9 commented Nov 20, 2019

Integers in label expressions
Viewing lists as integer maps would extend the same power and flexibility of selecting labels and applying constraints to lists.
For instance:

list: 6: T          // Only apply a constraint to the element at position 6
list: [int]: string
list: [>10]: string // Apply a constraint to elements at positions >10
list: (x+y): value

Could we generalize the use of any constraint for use as a field?? The scope of this request is much larger than support json schema but would capture the quoted use case and would allow for something like the following:

boolToString: {
  (bool): *"bool" | string
  (true): "true"
  (false): "false"
}

trueString: boolToString[true]

Obviously implementing a runtime that support this is no easy feat but it's arguably the biggest missing construct convey a value lattice

This would also eliminate the need for additional notation described in the following:

Value filters

With the notation for optional fields above, we provided a way to apply constraints to a subset of fields based on the field label. One may also want to apply constraints based on the value of a field.
This can currently be done with comprehensions. Comprehensions can be a bit clunky, however, and cannot be used as a query shorthand. We would like a notation that is consistent and convenient for LHS selection and querying.

Consider this definition.

map: [string]: {
   name: string
   port: int
}

We want to now further constrain that any value with a port in the range 50000 to 50100
must start with the name "home-". For this we borrow the [?expr] notation from JMESPath and JSONPath to filter a value based on a boolean expression:

PortMap: [? @.port>=50000 && @.port<50100]: name: =~"^home-"

The above example could be written:

   map: [string]: {
      name: string
      port: int
   }

  PoartMap: ([map]  & [_]: port: >= 5000 & [_]: port: < 50100): name: =~"^home-"

Given map is constrained to something concrete; PoartMap: ([map] & [_]: port: >= 5000 & [_]: port: < 50100): name: =~"^home-" would yield the same result as ``PortMap: [? @.port>=50000 && @.port<50100]: name: =~"^home-"` without introducing any new notation.

The Array section already touches on this when it describes the (x+y): notation but taking it further, the capability of constraints to be used as fields (distinctly different from optional fields) could serve as the keystone to much more and notation such @ could be syntactic sugar instead of a core operator.

NOTE: there is something missing though as the notion of an array is ambiguous in this example and [map] is trying to say that all fields in map are optional so for example suppose:

map: foo: {
  name: "home-foo"
  port: 50000
}

then [map] in the example could also be described

map_: [map]
map_: ["foo"]: {
  name: "home-foo"
  port: 50000
}

then it follows that PoartMap: ([map] & [_]: port: >= 5000 & [_]: port: < 50100): name: =~"^home-" could be written:

map_: ["foo"]: {
  name: "home-foo"
  port: 50000
}

(map_ & [_]: port: >= 5000 & [_]: port: < 50100): name: =~"^home-"

Where similar to the brackets in [map] making the top level fields in the struct optional () notation would make them not optional and it follows that:

([map]) == map // true

similar to the idea that !!true == true

@rudolph9
Copy link
Contributor

rudolph9 commented Nov 20, 2019

More examples from the rest of the doc

Associative lists

An associative list is a list that defines a key for each of its values, effectively turning it into a map. In CUE terms, elements with the same key are unified and collapsed onto a single element as if it were a struct.

We can introduce associative lists by generalizing the concepts introduced in the previous sections. Most importantly, we allow the @ notation referring to the right-hand side to also be used in a label expression, allowing it to specify how to derive a key from its elements.

Suppose we have the following two lists:

a: [{
    name: "foo"
    val1: 2
}, {
    name: "bar"
    val1: 3
}]
a: [{
    name: "foo"
    val2: 4
}]

What we would like these two lists merge into:

a: [{
    name: "foo"
    val1: 2
    val2: 4
}, {
    name: "bar"
    val1: 3
}]

We can achieve this by defining a as an associative list:

a: [@.name]: {...}

Would this example break the property that expressions can be evaluated in any order and yield the same result?

Since the following would yield _|_ (and should) it would become necessary to keep track of all array union operations in order to be able to apply the a: [@.name]: {...}

a: [{
    name: "foo"
    val1: 2
}, {
    name: "bar"
    val1: 3
}]
a: [{
    name: "foo"
    val2: 4
}]

@mpvl
Copy link
Contributor Author

mpvl commented Nov 27, 2019

@rudolph9 "\(x)": foo is supported, but not (x): foo. Admittedly that's redundant. The main reason to support it would be consistency.

@mpvl
Copy link
Contributor Author

mpvl commented Nov 27, 2019

@rudolph9 Allowing any type as key: needs thorough investigation. Would be interesting to see how YAML resolves it and what kind of issues they run in to. The idea to use this instead of value pointcuts is interesting. I will need to think about that a bit more.

Associative lists: it is indeed true that lists cannot evaluate until all information for an expression is selected (and the type is clear), but it should not affect evaluation order other than that. This is not far off from how evaluation works in CUE anyway. If it does break commutativity than this is broken. It is key for CUE to stay commutative.

@rudolph9
Copy link
Contributor

rudolph9 commented Nov 28, 2019

@mpvl Ultimately my concern is around adding functionality which doesn't have basis in foundational set theory aspects of cuelang. Associtive list could defined within those set theory bounds if we were to treat arrays as syntactic sugar for something like the following:

//a: [{
//    name: "foo"
//    val1: 2
//}, {
//    name: "bar"
//    val1: 3
//}]
a: *{
  0: {
    name: "foo"
    val1: 2
  } 
  1: {
    name: "bar"
    val1: 3
  }
} | {
  0: | close({ // Added to disjuntion set
    name: "foo"
    val1: 2
  }
  1: | close({ // Added to disjuntion set
    name: "bar"
    val1: 3
  })
}

//a: [{
//    name: "foo"
//    val2: 4
//}]
a: *{ 
  0: {
    name: "foo"
    val2: 4
  }
} | {
  0: | close({ // Added to disjuntion set
    name: "foo"
    val2: 4
  })
}

a: {
  associativeFiled=@?: _
  if (associativeFiled) { //i.e. associativeFiled is set to something
    thing=[? int]: _ // match and scope the value of int to the alias thing
    (thing[associativeFiled]): thing
  }
}

a: @: "name" // => makes it an associative list with name as the field being associated

By default the above would yield the current functionality and only by setting the optional filed @ do you get something different.

One big missing feature I've found in cuelang is ability to add to a disjunction associated with a filed, (hence my 0: | ... 1: | ... lines) perhaps I'm missing something obvious 🤔

@mpvl
Copy link
Contributor Author

mpvl commented Nov 29, 2019

@rudolph9 I'm not sure what you mean that you cannot currently associate a default with a field. You can...

But yes, in order for associative lists to work, lists would be syntactic sugar for maps with implied keys, either being an integer index derived from its position or a key derived from the value (for associative lists). So lists would be maps with the additional constraint that the number of elements is equal to the largest index. In the current implementation it is very close to this interpretation already.

@mpvl
Copy link
Contributor Author

mpvl commented Nov 30, 2019

@rudolph9 Regarding your example to eliminate value constraints, I assume you meant to use square brackets (as parentheses would require the value to be concrete). I'm unclear as to how you would distinguish between keys and filters. For instance, how would you filter on a string value versus selecting a set of keys?

The two kinds of filters would need to be clearly distinguished. What I could imagine is allowing key and value filters within a single square bracket separated by a colon:

PortMap: map
PortMap: [string: port: >= 5000 & < 50100]: name: =~"^home-"

It is a bit weird, though, as the interpretation of key may be a bit unclear. Also it gets awkward for filtering when the key constraints is a string, (["foo": port: ...]).

An alternative to the proposed syntax I had in mind myself was to write value filters as:

PortMap: map
PortMap: [string]?[{port: >= 5000 & < 50100}]: name: =~"^home-"

using ?[] to filter by instance and ?() to filter by boolean expression. This would allow constructs like objects: foo?[v1.Service]: name: ..., meaning.

@rudolph9
Copy link
Contributor

rudolph9 commented Dec 2, 2019

@rudolph9 I'm not sure what you mean that you cannot currently associate a default with a field. You can...

I think I was missing something obvious here is the relevant slack thread.

@rudolph9
Copy link
Contributor

@mpvl here is related article on Graph-Relational Object Queries might offer some inspiration.

This was referenced Apr 10, 2020
mpvl added a commit that referenced this issue Apr 20, 2020
This now also allows any of the non-JSON keywords
to be used as references. Previously, these were already
supported as field names.

Issue #339
Issue #165

Change-Id: I721d054c8220ba3536f680fe2e3e502a62f99b6b
Reviewed-on: https://cue-review.googlesource.com/c/cue/+/5683
Reviewed-by: Marcel van Lohuizen <[email protected]>
@mpvl mpvl mentioned this issue Dec 5, 2020
@mpvl
Copy link
Contributor Author

mpvl commented Dec 11, 2020

@rudolph9 : we have looked a bit more into constraining using unification or subsumption, instead of boolean expressions, for instance:

PortMap: [_: {port: >= 5000 & < 50100}]: name: =~"^home-"
objects: [_: #Service]: { ... }

I'm using [] here instead of () in your example, as that is the syntax we use for pattern matching.

But the idea is the same. Note that _: {port: foo} is a generalization of the current field syntax: it just now allows _ as an identifier as well, whereas previously this was explicitly disallowed.

The problem with this approach, though, is that it really doesn't make sense to use unification here. Subsumption makes more sense, but computation of subsumption is hard to do precise, and probably not tractable in all cases, even for resolved values. It also seems unreasonable to require fields to be concrete (maybe resolved, but not concrete). For instance, for k8s, many fields of a service will probably remain undefined.

The great advantage of using a boolean expression, is that all these issues go away. Note that GROQ also uses boolean expressions, as do all other query languages.

We do want to be able to reserve the ability to come up with a way to do subsumption reasonably.

One way to keep this open, is to piggy back on issue #575, and only allow queries of the form [_: must(boolExpr)]: T for now, instead of [_:?boolExr]. This keeps our options open as to whether to still support special syntax or only allow unification/subsumption down the road.

That said, I'm not saying it is impossible to use subsumption. We can possibly get there by adding some restrictions to struct concreteness, implement precise simplifications of ranges, and verify the feasibility of some regular expression math. CUE only allows RE2 regexps, which means it is theoretically possible to determine the intersection of them in O(n) time, which is one thing that would be needed. Theoretically, only the latter is needed for unification (although it resolves itself when concrete values are involved).

myitcv pushed a commit that referenced this issue Jan 13, 2021
List operators are confusing. For instance,
is the result of [a, ...] + [b, ...] open or close?
What about [a] + [b, ...]? And so on.

With the current semantics of comprehensions, they
are also unnecessary. The above can be written as
    [ for x in a {x}, for x in b {x} ].

Though more verbose, it is very clear here that
regardless of whether a and b are open or closed
the result is always closed.

With the query extension, this can also be written
more succinctly, like `[ a.[_], b.[_] ]` or perhaps
`[a.*, b.*]`, either case is clearer than using list
operators.

See Issue #165.

In order to move to being able to give backwards
compatiblity guarantees, we propose getting rid of list
addition and multiplication. An automatic rewriter could
rewrite the old use using `list.Repeat` and `list.Concat`,
the latter of which could refer to the spec to indicate
its equivalence to using comprehensions or, later, queries.

Change-Id: I374bfd59775d66d3da9feb28e1940f8bd3c255e8
Reviewed-on: https://cue-review.googlesource.com/c/cue/+/8063
Reviewed-by: CUE cueckoo <[email protected]>
Reviewed-by: Paul Jolly <[email protected]>
Reviewed-by: Jonathan Amsterdam <[email protected]>
@extemporalgenome
Copy link
Contributor

Why wouldn't patternProperties be supportable in the language that exists today? My novice understanding is that [=~ "^.*$"]: ... is an equivalent concept.

@myitcv
Copy link
Contributor

myitcv commented Feb 3, 2021

In the v0.3.0-beta.3 release notes, [ a.*, b.* ] is proposed a query-based replacement for [a] + [b] (the latter is being removed).

I'm guessing that .* would behave as expected in this case?

[ (if x { [1,2,3] }).*, b.* ]

@myitcv
Copy link
Contributor

myitcv commented Feb 3, 2021

Replying to self following discussion with @mpvl. That should actually be written as:

[ if x { [1,2,3].* }, b.* ]

@mpvl
Copy link
Contributor Author

mpvl commented Feb 4, 2021

Why wouldn't patternProperties be supportable in the language that exists today? My novice understanding is that [=~ "^.*$"]: ... is an equivalent concept.

@extemporalgenome this is indeed a similar concept. The main difference is that it is defined to apply to all matching fields, also existing ones, and not just additional ones. This makes encoding JSONSchema in CUE, well, annoying, We thought about changing the semantics and transition CUE to be more in line with JS, but decided against that. It turns out that the CUE semantics is actually used more and quite convenient.

Instead, we think that [...Pattern]: expression could be allowed in addition. It is an extra feature, but just requires generalizing the syntax.

@mpvl
Copy link
Contributor Author

mpvl commented Feb 4, 2021

@rudolph9 regarding using subsumption, we think it may be possible if we the use of X in [_: X]: expr using the following rules:

  1. a CUE value without optional fields (non-concrete values are okay), so say {a: int, b: "str"}, but not {[string]: int}, would be allowed for general subsumption.
  2. a CUE value with optional fields, like v1.#Service would be allowed, but would only match if the RHS explicitly unified with v1.#Service or any derivative of it.
  3. As a consequence of 1 and 2, a literal {[string]: int} should be disallowed X as a value can never have unified with a value that is out of scope.

These rules are hopefully easy enough to explain. The implementation of this would also be straightforward
and performant.

It would be quite neat to be able to query k8sObjects.[:v1.#Service].

@mpvl
Copy link
Contributor Author

mpvl commented Feb 4, 2021

@jlongtine I recall you asked about recursive queries. We have looked on some algorithmic improvements for CUE and one interesting structure sharing approach may make this possible. Translated from YAML, it would allow any input to be processed in O(n). So that means validating the YAML example in the https://en.wikipedia.org/wiki/Billion_laughs_attack (fully expanded Given a well-behaving CUE program (like no comprehensions, discriminated disjunctions), this means that you could such still validate such an exploding YAML file in bounded time.

We can use the same principle to search for values in a tree in bounded time. That makes no sense when capturing in a list, or when the query has a relation with parent nodes, but at least it would be possible to do static analysis and optimize cases where this is not necessary.

Anyway, so if this is functionality people are interested in, it seems a possibility.

@jlongtine
Copy link
Contributor

@mpvl Definitely interested in this tool. I have some places in https://github.com/cue-sh/cfn-cue that could be helped by recursive queries. Happy to lay out the use case for you soon.

@cueckoo
Copy link

cueckoo commented Jul 3, 2021

This issue has been migrated to cue-lang/cue#165.

For more details about CUE's migration to a new home, please see cue-lang/cue#1078.

@cueckoo cueckoo closed this as completed Jul 3, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Proposal roadmap/language-changes Specific tag for roadmap issue #339
Projects
None yet
Development

No branches or pull requests

6 participants