-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regex schemas #317
Regex schemas #317
Conversation
Indeed there is. (explain
[:schema {:registry {"hiccup" [:or
[:cat*
[:name keyword?]
[:props [:? [:map-of keyword? any?]]]
[:children [:* [:ref "hiccup"]]]]
[:or nil? boolean? number? string?]]}}
"hiccup"]
[:div {:class [:foo :bar]}
"Hello, world of data"]) As a slightly confusing aside, while recursive tree grammars like that are fine extending the regex support to context-free grammars would probably be a bad idea. We would have to choose between supporting every CFG, even ambiguous ones or only supporting deterministic grammars (e.g. LL(k) or LR(k)), which fail to include quite a few unambiguous grammars... |
Nothing useful, removed |
Hmm. Shoudn't the refs be looked over? (e.g. spec: (require '[clojure.spec.alpha :as s])
(s/def ::ints (s/+ int?))
(s/def ::bools (s/+ boolean?))
(s/valid?
(s/* (s/cat :i ::ints :b ::bools))
[1 true 2 2 false])
; => true malli: (m/validate
[:* [:cat [:+ int?] [:+ boolean?]]]
[1 true 2 2 false])
; => true
(m/validate
[:schema {:registry {"ints" [:+ int?]
"bools" [:+ boolean?]}}
[:* [:cat "ints" "bools"]]]
[1 true 2 2 false])
; => false |
Indeed they should. But that opens the door to general recursion and arbitrary CFGs. Hmmm. (s/def ::as (s/alt :more (s/cat :a int? :as ::as) :done (s/cat)))
(s/conform ::as [1 2 3])
; => [:more {:a 1, :as [:more {:a 2, :as [:more {:a 3, :as [:done {}]}]}]}] Seems that Spec emergently supports some sort of breadth-first LL(*) (GRDP?) parsing. Which naturally blows up on left recursion: (s/def ::as (s/alt :more (s/cat :as ::as :a int?) :done (s/cat)))
(s/conform ::as [1 2 3])
; => Execution error (StackOverflowError) at (REPL:1). It wouldn't be too difficult to extend the current design to GLL or Packrat parsing or we could make recursion in regex schemas an error. But what do we actually want here, with ambiguous schemas in particular? Rule them out? Make conform return the first valid parse (like PEG or ANTLR)? Return all the parses in a (possibly singleton or infinite) seq (like GLL, GLR, Earley etc.)? |
I don't have a clear opinion on this, don't fully understand the consequences. There is some sort of recursion checker in malli.generator, not sure if that is any way applicable here. Is it possible to make the option configurable, without performance penalties? e.g. option to fail fast on ambiguous schemas? default to first valid parse? what is a simplest possible ambiguous schemas? |
I gathered that that just forces generation to terminate. A similar solution does not seem possible here, as it would cause some valid inputs to be flagged as invalid?
We could add such options to encode, decode and #330. Any speed penalty on regular grammars could be avoided, but at the cost of more code (to test and ship to browsers). I suspect most people do not need CFG:s or even complicated regexes and definitely not all valid parses of an ambiguous grammar. As I said previously, ambiguity in general is very difficult to detect and the restriction to deterministic grammars is famously annoying (shift/reduce conflicts). I have some research interest in these things but this is a very practical project.
(m/decode [:cat [:* :keyword] [:* :symbol]] ["foo" "bar"] (mt/string-transformer))
; => [:foo :bar] And I would say either disallow non-regular grammars or use GLL and take the first valid parse (in basic APIs). Ambiguous grammars are typical in natural language processing but probably bugs in API definitions and despite Instaparse most people know regexes far better than context-free parsing. So I think general parsing would be overkill and possibly even harmful. On the other hand I don't know what is expected when Malli is used to build linters and stuff. |
Agree.
what is non-regular grammar? code size is relevant in cljs, so would the dissallowing be (much) more code? I think first valid parse is ok anyway with malli. Maybe @borkdude could comment on whatever would be needed/good for tooling. would inlined recursion be disallowed? The non-recursive references should support inlining (e.g. (m/validate
[:schema {:registry {"ints" [:+ int?]
"bools" [:+ boolean?]}}
[:* [:cat "ints" "bools"]]]
[1 true 2 2 false])
; => true (now: false) if that works, I'm happy with whatever you conclude with. |
.... but as |
Personally I would use this for the function arg specs, but I think that is going a different direction now? Spec directly couples sequences regexes with argument specs, but I'm not sure if that is a good idea. Also the spec sequence regexes can be used to describe Clojure code itself. I use this in grasp for example: https://github.com/borkdude/grasp How does JSON schema implement this (if this is something they support at all)? Can you describe function inputs using that? I'm blissfully ignorant about JSON Schema so far. |
Regular grammars are isomorphic to regular expressions and finite state machines. Basically a non-regular grammar requires nontail recursion or some other sort stack for recognition (
Cycle recognition requires only a tiny amount of bookkeeping. Surely the GLL support would take more code, especially if it would only be activated when necessary.
The cycle detection would disallow "inlined" recursion and only that:
(m/validate
[:schema {:registry {"ints" [:+ int?]
"bools" [:+ boolean?]}}
[:* [:cat "ints" "bools"]]]
[1 true 2 2 false])
; => true (now: false) Yes nonrecursive regex "inlining" like that would work (just as in e.g. Lex). Non-seqex recursion like this (and my version of the Hiccup schema) already works: (m/validate
[:schema {:registry {"ints" [:or int? [:* [:ref "ints"]]]}}
"ints"]
[[1] [2 2]])
; => true |
Non-vararg functions don't really need this regex stuff and even keyword arguments don't need recursion:
I think even the arguments of even the most complicated macros are either regular or the recursive structure is implemented with nested collections. The Common Lisp
I don't think they have something like this at all. They do have recursion but it does not seem possible to use that to describe complicated flat arrays... |
I think I'll fix the refs but implement the restriction to actually regular expressions. I don't think it makes sense to delay this further and make it bigger and slower only to support fringe usages. If anything, this (like everything else) should get smaller and faster than it is. |
Sounds good. |
src/malli/core.cljc
Outdated
(-deref [_] (-ref)))))))) | ||
(def -ref-schema | ||
(let [into-ref-schema | ||
(memoize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this introduce a memory leak? All schemas are persisted forever in the memoization cache? Also, as options are part of the memoization args and can have mutable state (just as recursion counters with gen), the same schema can appear multiple times in the cache,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was just a quick hack to make =
(and thus set membership) work for refs and I did suspect it wouldn't fly. I suppose some more complicated or nonextensible solution is in order, any ideas? (Too bad the memoization cache is so naive, OTOH weak maps have other issues and so on...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how to do that elegantly, but few guesses:
- schema form is immutable, could that be used as equality (like there is
malli.util/equals
) - there is already the local
-ref
, which is memoized for caching purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and yes, ´clojure.core/memoize` is awful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, memoize bit me too once in clj-kondo... huge memory leak ;)
@@ -122,6 +119,8 @@ | |||
|
|||
(defn -update [m k f] (assoc m k (f (get m k)))) | |||
|
|||
(defn -memoize [f] (let [value (atom nil)] (fn [] (or @value) (reset! value (f))))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extracted from -ref-schema
, but why not just use delay
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no idea. maybe it should.
So now I took out the actual cycle detection and instead banned (m/validate
[:schema {:registry {::ints [:* [:or int? [:ref ::ints]]]}}
::ints]
[[1 2 3]])
; => true |
This is good. |
Changed implementation technique, this PR replaces #312.
WIP: