Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex schemas #317

Merged
merged 84 commits into from
Jan 15, 2021
Merged

Regex schemas #317

merged 84 commits into from
Jan 15, 2021

Conversation

nilern
Copy link
Contributor

@nilern nilern commented Dec 11, 2020

Changed implementation technique, this PR replaces #312.

WIP:

  • transformers
  • cljs support
  • sufficient tests
  • benchmarking
  • (more) optimization
  • cleanup
  • docstrings
  • documentation (README etc.)

@nilern
Copy link
Contributor Author

nilern commented Jan 11, 2021

Is there a way to inline/embed a recursive Schema into a sequence schema? I believe this is needed to describe f.e. hiccup syntax. Or is there another way to do this: ...

Indeed there is. :alt* always specifies a sequence of nodes, so it must be done like this instead:

(explain
  [:schema {:registry {"hiccup" [:or
                                 [:cat*
                                  [:name keyword?]
                                  [:props [:? [:map-of keyword? any?]]]
                                  [:children [:* [:ref "hiccup"]]]]
                                 [:or nil? boolean? number? string?]]}}
   "hiccup"]
  [:div {:class [:foo :bar]}
   "Hello, world of data"])

As a slightly confusing aside, while recursive tree grammars like that are fine extending the regex support to context-free grammars would probably be a bad idea. We would have to choose between supporting every CFG, even ambiguous ones or only supporting deterministic grammars (e.g. LL(k) or LR(k)), which fail to include quite a few unambiguous grammars...

@nilern
Copy link
Contributor Author

nilern commented Jan 11, 2021

what is the difference in using :nested and :schema?

Nothing useful, removed :nested.

@ikitommi
Copy link
Member

Hmm. Shoudn't the refs be looked over? (e.g. m/deref):

spec:

(require '[clojure.spec.alpha :as s])

(s/def ::ints (s/+ int?))
(s/def ::bools (s/+ boolean?))

(s/valid?
  (s/* (s/cat :i ::ints :b ::bools))
  [1 true 2 2 false])
; => true

malli:

(m/validate
  [:* [:cat [:+ int?] [:+ boolean?]]]
  [1 true 2 2 false])
; => true

(m/validate
  [:schema {:registry {"ints" [:+ int?]
                       "bools" [:+ boolean?]}}
   [:* [:cat "ints" "bools"]]]
  [1 true 2 2 false])
; => false

@nilern
Copy link
Contributor Author

nilern commented Jan 11, 2021

Indeed they should. But that opens the door to general recursion and arbitrary CFGs. Hmmm.

(s/def ::as (s/alt :more (s/cat :a int? :as ::as) :done (s/cat)))
(s/conform ::as [1 2 3])
; => [:more {:a 1, :as [:more {:a 2, :as [:more {:a 3, :as [:done {}]}]}]}]

Seems that Spec emergently supports some sort of breadth-first LL(*) (GRDP?) parsing. Which naturally blows up on left recursion:

(s/def ::as (s/alt :more (s/cat :as ::as :a int?) :done (s/cat)))
(s/conform ::as [1 2 3])
; => Execution error (StackOverflowError) at (REPL:1).

It wouldn't be too difficult to extend the current design to GLL or Packrat parsing or we could make recursion in regex schemas an error. But what do we actually want here, with ambiguous schemas in particular? Rule them out? Make conform return the first valid parse (like PEG or ANTLR)? Return all the parses in a (possibly singleton or infinite) seq (like GLL, GLR, Earley etc.)?

@ikitommi
Copy link
Member

I don't have a clear opinion on this, don't fully understand the consequences. There is some sort of recursion checker in malli.generator, not sure if that is any way applicable here.

Is it possible to make the option configurable, without performance penalties? e.g. option to fail fast on ambiguous schemas? default to first valid parse? what is a simplest possible ambiguous schemas?

@nilern
Copy link
Contributor Author

nilern commented Jan 12, 2021

I don't have a clear opinion on this, don't fully understand the consequences. There is some sort of recursion checker in malli.generator, not sure if that is any way applicable here.

I gathered that that just forces generation to terminate. A similar solution does not seem possible here, as it would cause some valid inputs to be flagged as invalid?

Is it possible to make the option configurable, without performance penalties? e.g. option to fail fast on ambiguous schemas? default to first valid parse?

We could add such options to encode, decode and #330. Any speed penalty on regular grammars could be avoided, but at the cost of more code (to test and ship to browsers). I suspect most people do not need CFG:s or even complicated regexes and definitely not all valid parses of an ambiguous grammar. As I said previously, ambiguity in general is very difficult to detect and the restriction to deterministic grammars is famously annoying (shift/reduce conflicts). I have some research interest in these things but this is a very practical project.

what is a simplest possible ambiguous schemas?

[:cat [:* int?] [:* int?]]. But currently we follow regex semantics where the greediness of * disambiguates:

(m/decode [:cat [:* :keyword] [:* :symbol]] ["foo" "bar"] (mt/string-transformer))
; => [:foo :bar]

And [:* [:cat]] is a trivial infinitely ambiguous grammar for the empty seq. At the moment the [:cat] matches 0 times but it could match any number of times.


I would say either disallow non-regular grammars or use GLL and take the first valid parse (in basic APIs). Ambiguous grammars are typical in natural language processing but probably bugs in API definitions and despite Instaparse most people know regexes far better than context-free parsing. So I think general parsing would be overkill and possibly even harmful. On the other hand I don't know what is expected when Malli is used to build linters and stuff.

@ikitommi
Copy link
Member

So I think general parsing would be overkill and possibly even harmful.

Agree.

I would say either disallow non-regular grammars or use GLL and take the first valid parse (in basic APIs).

what is non-regular grammar? code size is relevant in cljs, so would the dissallowing be (much) more code? I think first valid parse is ok anyway with malli. Maybe @borkdude could comment on whatever would be needed/good for tooling.

would inlined recursion be disallowed? The non-recursive references should support inlining (e.g. :schema and :malli.core/schema):

(m/validate
  [:schema {:registry {"ints" [:+ int?]
                       "bools" [:+ boolean?]}}
   [:* [:cat "ints" "bools"]]]
  [1 true 2 2 false])
; => true (now: false)

if that works, I'm happy with whatever you conclude with.

@ikitommi
Copy link
Member

.... but as :schema is now a marker not to inline the result, I think just the :malli.core/schema should inline the contents, it's used internally handle the non-recursive registry references.

@borkdude
Copy link
Contributor

borkdude commented Jan 12, 2021

Maybe @borkdude could comment on whatever would be needed/good for tooling.

Personally I would use this for the function arg specs, but I think that is going a different direction now? Spec directly couples sequences regexes with argument specs, but I'm not sure if that is a good idea.

Also the spec sequence regexes can be used to describe Clojure code itself. I use this in grasp for example:

https://github.com/borkdude/grasp

How does JSON schema implement this (if this is something they support at all)? Can you describe function inputs using that? I'm blissfully ignorant about JSON Schema so far.

@nilern
Copy link
Contributor Author

nilern commented Jan 13, 2021

what is non-regular grammar?

Regular grammars are isomorphic to regular expressions and finite state machines. Basically a non-regular grammar requires nontail recursion or some other sort stack for recognition (validate) and parsing (conform, encode/decode).

code size is relevant in cljs, so would the dissallowing be (much) more code? I think first valid parse is ok anyway with malli.

Cycle recognition requires only a tiny amount of bookkeeping. Surely the GLL support would take more code, especially if it would only be activated when necessary.

would inlined recursion be disallowed?

The cycle detection would disallow "inlined" recursion and only that:

The non-recursive references should support inlining (e.g. :schema and :malli.core/schema): ...

(m/validate
  [:schema {:registry {"ints" [:+ int?]
                       "bools" [:+ boolean?]}}
   [:* [:cat "ints" "bools"]]]
  [1 true 2 2 false])
; => true (now: false)

Yes nonrecursive regex "inlining" like that would work (just as in e.g. Lex).

Non-seqex recursion like this (and my version of the Hiccup schema) already works:

(m/validate
  [:schema {:registry {"ints" [:or int? [:* [:ref "ints"]]]}}
   "ints"]
  [[1] [2 2]])
; => true

@nilern
Copy link
Contributor Author

nilern commented Jan 13, 2021

Personally I would use this for the function arg specs, but I think that is going a different direction now? Spec directly couples sequences regexes with argument specs, but I'm not sure if that is a good idea.

Non-vararg functions don't really need this regex stuff and even keyword arguments don't need recursion: [:cat int? [:* [:alt [:cat [:= :foo] string?] [:cat [:= :bar] double?]]]] etc. Actually there could be something more concise (and faster?) for that 🤔.

Also the spec sequence regexes can be used to describe Clojure code itself. I use this in grasp for example:

https://github.com/borkdude/grasp

I think even the arguments of even the most complicated macros are either regular or the recursive structure is implemented with nested collections. The Common Lisp loop madness seems like a thing of the past.

How does JSON schema implement this (if this is something they support at all)? Can you describe function inputs using that? I'm blissfully ignorant about JSON Schema so far.

I don't think they have something like this at all. They do have recursion but it does not seem possible to use that to describe complicated flat arrays...

@nilern
Copy link
Contributor Author

nilern commented Jan 13, 2021

I think I'll fix the refs but implement the restriction to actually regular expressions. I don't think it makes sense to delay this further :shipit: and make it bigger and slower only to support fringe usages. If anything, this (like everything else) should get smaller and faster than it is.

@ikitommi
Copy link
Member

Sounds good.

@nilern nilern requested a review from ikitommi January 14, 2021 14:58
(-deref [_] (-ref))))))))
(def -ref-schema
(let [into-ref-schema
(memoize
Copy link
Member

@ikitommi ikitommi Jan 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this introduce a memory leak? All schemas are persisted forever in the memoization cache? Also, as options are part of the memoization args and can have mutable state (just as recursion counters with gen), the same schema can appear multiple times in the cache,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just a quick hack to make = (and thus set membership) work for refs and I did suspect it wouldn't fly. I suppose some more complicated or nonextensible solution is in order, any ideas? (Too bad the memoization cache is so naive, OTOH weak maps have other issues and so on...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how to do that elegantly, but few guesses:

  • schema form is immutable, could that be used as equality (like there is malli.util/equals)
  • there is already the local -ref, which is memoized for caching purposes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and yes, ´clojure.core/memoize` is awful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, memoize bit me too once in clj-kondo... huge memory leak ;)

@@ -122,6 +119,8 @@

(defn -update [m k f] (assoc m k (f (get m k))))

(defn -memoize [f] (let [value (atom nil)] (fn [] (or @value) (reset! value (f)))))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted from -ref-schema, but why not just use delay?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no idea. maybe it should.

@nilern
Copy link
Contributor Author

nilern commented Jan 15, 2021

So now I took out the actual cycle detection and instead banned :ref:s altogether as seqex children. Nonrecursive :refs are unnecessary (maybe it should have been called :rec...) and recursively nested sequences have to be wrapped in :schema or something else anyway:

(m/validate
  [:schema {:registry {::ints [:* [:or int? [:ref ::ints]]]}}
   ::ints]
  [[1 2 3]])
; => true

@nilern nilern requested a review from ikitommi January 15, 2021 14:24
@ikitommi
Copy link
Member

This is good.

@ikitommi ikitommi merged commit 6693d39 into metosin:master Jan 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants