Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syntactic sugar for distinguishing value from closure arguments #513

Closed
nicowilliams opened this issue Jul 29, 2014 · 24 comments
Closed

Syntactic sugar for distinguishing value from closure arguments #513

nicowilliams opened this issue Jul 29, 2014 · 24 comments
Assignees
Milestone

Comments

@nicowilliams
Copy link
Contributor

def f($a_value, a_closure, $another_value, another_closure):
    ...;

should be equivalent to:

def coerce(closure):
    [first(closure)] | if length == 0 then error("empty value argument") else .[0] end;
def f(a_val, a_closure, b_val, another_closure):
    coerce(a_val) as $a_value |
    coerce(b_val) as $another_value |
    ...;
@nicowilliams nicowilliams added this to the 2.0 release milestone Jul 29, 2014
@nicowilliams nicowilliams self-assigned this Jul 29, 2014
@nicowilliams
Copy link
Contributor Author

If we do not bother checking "prototypes" at call sites then this is a fairly simple feature to implement in the parser/compiler. We can't actually check prototypes at call sites anyways, since we have no easy way to distinguish a "generator" expression from an expression that would produce just one value as desired.

@nicowilliams
Copy link
Contributor Author

Given that this is just syntactic sugar, that we can't actually do anything interesting with this at call sites, I think this should be closed.

But perhaps we could adopt this for documentation purposes. I.e., document has/1 as has($value), while limit/2 as limit($n; generator). This would allow the documentation to be much clearer to users.

@stedolan What think ye?

Leaving this open for now.

@nicowilliams
Copy link
Contributor Author

Actually, there is something we can do at the call sites (at link time): we can check that the closure being passed to a $ident function argument is either a constant expression, a collect or mkdict expression, or an expression that calls first. I'm not necessarily thrilled with that idea, but it's interesting.

@stedolan
Copy link
Contributor

Hmm. Why do we need this distinction? There's nothing wrong with passing closures. There are essentially no contexts in jq where an "expression that produces exactly one value" is what's needed.

In general in jq, , distributes over most operations. That is, if you use a generator that produces multiple values in some context, it's the same as using the context multiple times (except for those contexts that do something special with generators like [...]).

So, passing closures to has is perfectly fine and fits with the rest of jq: has("a", "b", "c") is equivalent to has("a"), has("b"), has("c"). Similarly, (a, b) == c is the same as a == c, b == c, and .[a,b] is the same as .[a], .[b], and so on.

@nicowilliams
Copy link
Contributor Author

@stedolan I've been finding that closure arguments often confuse users. Certainly a way to document that a given [closure] argument to a given function is intended to produce a single value... would be nice. Sure, we can say so in the function's description. But a syntactic way to do the same would be even better.

Also, having a short-hand that captures one value from the closure and binds it to a $IDENT is convenient (though not necessary) syntactic sugar.

Finally, if we could statically detect and warn about generators passed where only one value will be consumed, that would be even better.

Basically, I love that all function arguments are closures, potentially generators, but it's not that easy to explain.

Also, I myself often typo argument separator ; into , and now that we have multi-arity jq will often happily compile such code, but with this feature some times this might yield a useful warning.

@stedolan
Copy link
Contributor

Certainly a way to document that a given [closure] argument to a given function is intended to produce a single value... would be nice

...

Finally, if we could statically detect and warn about generators passed where only one value will be consumed, that would be even better.

My point is that there shouldn't be any of those. There should be no functions that operate on "one value", only those that operate on one value at a time, and produce as many results as their arguments produce.

The function shouldn't "intend" that its argument produce only one value. It shouldn't care how many values the argument produces, and just produce a result for each value.

Also, having a short-hand that captures one value from the closure and binds it to a $IDENT is convenient (though not necessary) syntactic sugar.

arg as $IDENT does this. If the closure produces more than one value, then this gets run more than once, but that's the point.

The ; for separating arguments is regrettable. I don't see any sensible way of changing it (making , separate arguments would be very confusing), but we could tweak the syntax so that f(a,b) was a syntax error and require extra parentheses when the programmer intentionally passes a,b as a single argument.

@pkoppstein
Copy link
Contributor

@stedolan - I'd just like to point out that this thread began as the result of a previous discussion regarding the distinction between "regular filters" and "special filters", the difference between the two being as follows, taking the case of arity-1 filters for simplicity.

Let us define f/1 to be regular if for all generators, s, (null | f(s)) emits the stream s | f(.). Let us say f/1 is special if it is not regular.

(My example of a "regular" filter was has/1, and the example of a "special" filter was limit/2.)

The point I was trying to make -- and the point I'd ask you to consider -- is that neither the name of a filter nor the way in which it is called makes it obvious whether it is regular or special. This may have been by design, but some programming languages make it easy (or even mandatory) to distinguish between "functions" and "macros" at the point at which they are called, e.g. by some naming convention.

So far as jq is concerned, I realize the cat is already out of the bag. Still, for new "special" builtins, it would be possible to adopt a naming convention. I am not advocating "!" specifically, but it wouldn't be too late to rename "limit" to "limit!", in that limit is not part of jq 1.4.

[EDITED to correct definition of "regular".]

@nicowilliams
Copy link
Contributor Author

@stedolan We do have a few such functions. Check out limit(n; expr), currently in master.

When we need to pass multiple values to a function the only reliable way to do it is by collecting them into an array or object to pass as the input of that function. It looks weird, and in particular it's weird to newbies. Explaining that function arguments are closures, in a way that newbies are likely to understand and keep in mind, is hard (though you've done a much better job of it in the manual than I could have). And in practice it's not difficult to use single-value producing closures as function arguments, but it takes care, and I'm thinking we could do more to a) signal an interface's contract to use a single value, b) document it, and c) [with some limitations] check this at compile/link time.

@nicowilliams
Copy link
Contributor Author

@stedolan I do agree that we could require parens around closure expressions that use commas. I'd thought of that but it seemed a bit hacky; if you're OK with it though, then I am as well.

@nicowilliams
Copy link
Contributor Author

@pkoppstein ! generally is taken to mean "destructive", "has side effects", which for jq is impossible (welllll, I/O will change that!). Still, not a terrible idea, but first I want to get @stedolan's opinion as to the builtins that have snuck in that have some arguments expecting a single-valued expression, like the regexp builtins, and limit! Perhaps @stedolan will insist that we rework them so that all argument values come in as part of the input to the function -- a very reasonable proposition, and if he says to jump on that, we'll do it.

@pkoppstein
Copy link
Contributor

@nicowilliams -- Not knowing what the alternatives really are, I just gave "!" as an example since it does seem to be an actual possibility.

But "!" would not be a bad choice -- it connotes surprise! Pay attention! Be careful!

Leading underscores already have another significance, and perhaps trailing underscores should be left for users to play with. "@" already has found a niche in jq. C uses UPPERCASE, and I suppose that would be a possibility, but between these possibilities, my preference would be "!" (at least as I write :0). Are there any other likely candidates?

As for regex filters -- as I recall, they all take their "target" string argument from input, precisely so that one can present a stream of target strings to them. What other realistic possibility is there?

Maybe I've totally missed your point, but it almost seems as though you've momentarily forgotten that in jq it's a feature that s | has(t) will produce a stream of |s| * |t| items (assuming s and t are streams).

@nicowilliams
Copy link
Contributor Author

@pkoppstein The modifiers, for example, in the regex functions, are expected to be single-valued.

@nicowilliams
Copy link
Contributor Author

@pkoppstein Also, what has/1 does is cool, but it's not clear how to do that for, say, match/2.

What should match($patterns[]; $modes[];) do?? Should it do a cartesian product of patterns and modes? Should it consume pairs of pattern+mode and stop when one of the two closures stops producing outputs?

Nor is it the case that the closures must be used as generators. They can be filters (e.g., select/1) and other things (e.g., a comparator for a jq-coded sort, though since closures can't take arguments, the values to compare must be provided as an array of two values as the input to the comparator).

@pkoppstein
Copy link
Contributor

@nicowilliams wrote:

The modifiers, for example, in the regex functions, are expected to be single-valued.

Yes, but so what? I think you missed the point about |s| * |t| outputs for s|has(t).

test/2, for example, already has the desired characteristics:

$ jq -n '"s" | test("s"; "i")'
true

$ jq -n '"s" | test("s"; "g")'
true

$ jq -n '"s" | test("s"; ("i","g"))'
true
true

@wtlangford and @stedolan might like to chime in, but I think @wtlangford got it exactly right.

@pkoppstein
Copy link
Contributor

@nicowilliams asked:

What should match($patterns[]; $modes[];) do?? Should it do a cartesian product of patterns and modes?

Well, that's the jq way -- here, there, and just about everywhere. I can almost here Alec McGuiness intone: "Embrace the streams, Nico."

Are you suggesting that we should go back to 'test([ s, flags])'?

@nicowilliams
Copy link
Contributor Author

@pkoppstein Ah, yes, we do get cartesian product right now when passing two or more generators to a C-coded function.

I know that limit also does treat n as a generator (of course), it's just that it will be surprising!

@nicowilliams
Copy link
Contributor Author

I'll close this.

@stedolan
Copy link
Contributor

stedolan commented Aug 1, 2014

@pkoppstein your distinction between "special" and "regular" functions is exactly right. There's no syntax to distinguish them, and I think the only sensible approach is to try and limit the number of "special" functions as much as possible.

@nicowilliams yep, match($patterns[]; $modes[]) should be a cartesian product. Everything should be a cartesian product unless there's a really good reason not to.

I regret that there are a few more "special" operations than there really need to be, which makes things confusing. I'm not sure whether "select" should really be special, or the // operator.

@nicowilliams
Copy link
Contributor Author

There's only two jq-coded builtins that take more than one argument in master right now. and both are irregular but easily fixed to be regular:

def limit(n; exp): if n < 0 then exp else foreach exp as $item ([n, null]; if .[0] < 1 then break else [.[0] -1, $item] end; .[1]) end;
def nth(n; g): if n < 0 then error(\"nth doesn't support negative indices\") else last(limit(n + 1; g)) end;

This patch makes them work more regularly in @pkoppstein's sense:

diff --git a/builtin.c b/builtin.c
index 238a1fa..4544c17 100644
--- a/builtin.c
+++ b/builtin.c
@@ -997,10 +997,10 @@ static const char* const jq_builtins[] = {
   "     def _while: "
   "         if cond then ., (update | _while) else empty end; "
   "     try _while catch if .==\"break\" then empty else . end;",
-  "def limit(n; exp): if n < 0 then exp else foreach exp as $item ([n, null]; if .[0] < 1 then break else [.[0] -1, $item] end; .[1]) end;",
+  "def limit(n; exp): n as $n | if $n < 0 then exp else foreach exp as $item ([$n, null]; if .[0] < 1 then break else [.[0] -1, $item] end; .[1]) end;",
   "def first(g): foreach g as $item ([false, null]; if .[0]==true then break else [true, $item] end; .[1]);",
   "def last(g): reduce g as $item (null; $item);",
-  "def nth(n; g): if n < 0 then error(\"nth doesn't support negative indices\") else last(limit(n + 1; g)) end;",
+  "def nth(n; g): n as $n | if $n < 0 then error(\"nth doesn't support negative indices\") else last(limit($n + 1; g)) end;",
   "def first: .[0];",
   "def last: .[-1];",
   "def nth(n): .[n];",

I'll push it.

Then we should look at 1-argument builtins that could be doing something similar but aren't.

@nicowilliams
Copy link
Contributor Author

range/1 is also more special than it needs to be. It should be rewritten as def range(x): x as $x | range(0;$x);.

We should inspect:

  • indices
  • index
  • rindex
  • any/1
  • all/1
  • join/1
  • flatten

EDIT: any/1 and all/1 are special forms indeed, and they must be.

@nicowilliams
Copy link
Contributor Author

See #521.

@nicowilliams
Copy link
Contributor Author

BTW, I'd like to think of "regular" and "special" by relation to Lisp. "regular" will be like any defun in Lisp, and "special" will be like a macro or special form in Lisp. Thus limit/2 is special as to the generator argument, but not as to the n argument; ditto nth/2. while is special as well, since it's a looping (recursion, optimized tail recursion) control structure.

Before I stepped all over jq the number of special forms was smaller, and some had special syntax (e.g., reduce), but even then it had recurse/1, which is special as well. Special forms are not bad per se, but their special-ness has to be clear. Which makes me think that some special syntax for declaring which arguments are consumed fully and in sequence (cartesian product like) would be useful!

Something like:

# a and b are not special; c is
def f($a; $b; c): [., $a, $b] | c;

which would be equivalent to:

def f(a; b; c):
    a as $a |
    b as $b |
    [.,$a,$b] | c;

It'd be syntactic sugar to help avoid #521 and it'd help document what is special about any one def.

@stedolan
Copy link
Contributor

stedolan commented Aug 1, 2014

@nicowilliams I like that proposal much, much more than the original one.

@nicowilliams
Copy link
Contributor Author

@stedolan But it's really just syntactic sugar, so don't expect it anytime soon :) (OTOH, it's probably very easy to implement!)

@dtolnay dtolnay added this to the 1.5 release milestone Sep 11, 2015
@dtolnay dtolnay removed this from the 2.0 release milestone Sep 11, 2015
pkoppstein added a commit to pkoppstein/jq that referenced this issue Jun 29, 2023
… uniq(stream)

The primary purpose of this commit (which supercedes PR
jqlang#2624) is to rectify most problems
with `gsub` (and also `sub` with the "g" option), in particular jqlang#1425
('\b'), jqlang#2354 (lookahead), and jqlang#2532 (regex == "^(?!cd ).*$|^cd ";"")).

This commit also partly resolves jqlang#2148 and jqlang#1206 in that `gsub` no
longer loops infinitely; however, because the new `gsub` depends
critically on match(_;"g"), the behavior when regex == "" is sometimes
non-standard. [*1]

Since the new sub/3 relies on uniq/1, that has been added as well [*2].

The documentation has been updated to reflect the fact that `sub` and
`gsub` are intended to be regular in the second argument. [*3]

Also, _nwise/1 has been tweaked to take advantage of TCO.

Footnotes:

[*1] Using the new gsub, '"a" | gsub( ""; "a")' emits "aa" rather than
"aaa" as would be standard.  This is nevertheless better than the
infinite loop behavior of jq 1.6 in this case.

With one exception (as explained in [*2]), the new gsub is implemented
as though match/2 behavior is correct.  That is, bugs in `gsub`
behavior will most likely have their origin in `match/2`.

[*2] `uniq/1` adopts the Unix/Linux name and semantics; it is needed for the following test case:

gsub("(?=u)"; "u")
"qux"
"quux"

Without this functionality:

Test jqlang#23: 'gsub("(?=u)"; "u")' at line number 100
*** Expected "quux", but got "quuux" for test at line number 102: gsub("(?=u)"; "u")

The root of the problem here is `match`: if `match` is fixed, then gsub would not need `untie`.

The addition of `uniq` as a top-level function should be a non-issue
relative to general concern about builtins.jq bloat: the line count of
the new builtin.jq is significantly reduced overall, and the number of
defs is actually reduced by 1 (from 111 (ignoring a redundant def) to 110).

[*3] See e.g. jqlang#513 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants