Regex schemas #317

nilern · 2020-12-11T11:31:10Z

Changed implementation technique, this PR replaces #312.

WIP:

nilern · 2021-01-11T10:05:48Z

Is there a way to inline/embed a recursive Schema into a sequence schema? I believe this is needed to describe f.e. hiccup syntax. Or is there another way to do this: ...

Indeed there is. :alt* always specifies a sequence of nodes, so it must be done like this instead:

(explain
  [:schema {:registry {"hiccup" [:or
                                 [:cat*
                                  [:name keyword?]
                                  [:props [:? [:map-of keyword? any?]]]
                                  [:children [:* [:ref "hiccup"]]]]
                                 [:or nil? boolean? number? string?]]}}
   "hiccup"]
  [:div {:class [:foo :bar]}
   "Hello, world of data"])

As a slightly confusing aside, while recursive tree grammars like that are fine extending the regex support to context-free grammars would probably be a bad idea. We would have to choose between supporting every CFG, even ambiguous ones or only supporting deterministic grammars (e.g. LL(k) or LR(k)), which fail to include quite a few unambiguous grammars...

nilern · 2021-01-11T10:16:26Z

what is the difference in using :nested and :schema?

Nothing useful, removed :nested.

ikitommi · 2021-01-11T13:51:24Z

Hmm. Shoudn't the refs be looked over? (e.g. m/deref):

spec:

(require '[clojure.spec.alpha :as s])

(s/def ::ints (s/+ int?))
(s/def ::bools (s/+ boolean?))

(s/valid?
  (s/* (s/cat :i ::ints :b ::bools))
  [1 true 2 2 false])
; => true

malli:

(m/validate
  [:* [:cat [:+ int?] [:+ boolean?]]]
  [1 true 2 2 false])
; => true

(m/validate
  [:schema {:registry {"ints" [:+ int?]
                       "bools" [:+ boolean?]}}
   [:* [:cat "ints" "bools"]]]
  [1 true 2 2 false])
; => false

nilern · 2021-01-11T14:59:15Z

Indeed they should. But that opens the door to general recursion and arbitrary CFGs. Hmmm.

(s/def ::as (s/alt :more (s/cat :a int? :as ::as) :done (s/cat)))
(s/conform ::as [1 2 3])
; => [:more {:a 1, :as [:more {:a 2, :as [:more {:a 3, :as [:done {}]}]}]}]

Seems that Spec emergently supports some sort of breadth-first LL(*) (GRDP?) parsing. Which naturally blows up on left recursion:

(s/def ::as (s/alt :more (s/cat :as ::as :a int?) :done (s/cat)))
(s/conform ::as [1 2 3])
; => Execution error (StackOverflowError) at (REPL:1).

It wouldn't be too difficult to extend the current design to GLL or Packrat parsing or we could make recursion in regex schemas an error. But what do we actually want here, with ambiguous schemas in particular? Rule them out? Make conform return the first valid parse (like PEG or ANTLR)? Return all the parses in a (possibly singleton or infinite) seq (like GLL, GLR, Earley etc.)?

ikitommi · 2021-01-12T06:07:41Z

I don't have a clear opinion on this, don't fully understand the consequences. There is some sort of recursion checker in malli.generator, not sure if that is any way applicable here.

Is it possible to make the option configurable, without performance penalties? e.g. option to fail fast on ambiguous schemas? default to first valid parse? what is a simplest possible ambiguous schemas?

nilern · 2021-01-12T10:24:32Z

I don't have a clear opinion on this, don't fully understand the consequences. There is some sort of recursion checker in malli.generator, not sure if that is any way applicable here.

I gathered that that just forces generation to terminate. A similar solution does not seem possible here, as it would cause some valid inputs to be flagged as invalid?

Is it possible to make the option configurable, without performance penalties? e.g. option to fail fast on ambiguous schemas? default to first valid parse?

We could add such options to encode, decode and #330. Any speed penalty on regular grammars could be avoided, but at the cost of more code (to test and ship to browsers). I suspect most people do not need CFG:s or even complicated regexes and definitely not all valid parses of an ambiguous grammar. As I said previously, ambiguity in general is very difficult to detect and the restriction to deterministic grammars is famously annoying (shift/reduce conflicts). I have some research interest in these things but this is a very practical project.

what is a simplest possible ambiguous schemas?

[:cat [:* int?] [:* int?]]. But currently we follow regex semantics where the greediness of * disambiguates:

(m/decode [:cat [:* :keyword] [:* :symbol]] ["foo" "bar"] (mt/string-transformer))
; => [:foo :bar]

And [:* [:cat]] is a trivial infinitely ambiguous grammar for the empty seq. At the moment the [:cat] matches 0 times but it could match any number of times.

I would say either disallow non-regular grammars or use GLL and take the first valid parse (in basic APIs). Ambiguous grammars are typical in natural language processing but probably bugs in API definitions and despite Instaparse most people know regexes far better than context-free parsing. So I think general parsing would be overkill and possibly even harmful. On the other hand I don't know what is expected when Malli is used to build linters and stuff.

ikitommi · 2021-01-12T17:48:29Z

So I think general parsing would be overkill and possibly even harmful.

Agree.

I would say either disallow non-regular grammars or use GLL and take the first valid parse (in basic APIs).

what is non-regular grammar? code size is relevant in cljs, so would the dissallowing be (much) more code? I think first valid parse is ok anyway with malli. Maybe @borkdude could comment on whatever would be needed/good for tooling.

would inlined recursion be disallowed? The non-recursive references should support inlining (e.g. :schema and :malli.core/schema):

(m/validate
  [:schema {:registry {"ints" [:+ int?]
                       "bools" [:+ boolean?]}}
   [:* [:cat "ints" "bools"]]]
  [1 true 2 2 false])
; => true (now: false)

if that works, I'm happy with whatever you conclude with.

ikitommi · 2021-01-12T17:50:41Z

.... but as :schema is now a marker not to inline the result, I think just the :malli.core/schema should inline the contents, it's used internally handle the non-recursive registry references.

borkdude · 2021-01-12T20:50:41Z

Maybe @borkdude could comment on whatever would be needed/good for tooling.

Personally I would use this for the function arg specs, but I think that is going a different direction now? Spec directly couples sequences regexes with argument specs, but I'm not sure if that is a good idea.

Also the spec sequence regexes can be used to describe Clojure code itself. I use this in grasp for example:

https://github.com/borkdude/grasp

How does JSON schema implement this (if this is something they support at all)? Can you describe function inputs using that? I'm blissfully ignorant about JSON Schema so far.

nilern · 2021-01-13T10:39:45Z

what is non-regular grammar?

Regular grammars are isomorphic to regular expressions and finite state machines. Basically a non-regular grammar requires nontail recursion or some other sort stack for recognition (validate) and parsing (conform, encode/decode).

code size is relevant in cljs, so would the dissallowing be (much) more code? I think first valid parse is ok anyway with malli.

Cycle recognition requires only a tiny amount of bookkeeping. Surely the GLL support would take more code, especially if it would only be activated when necessary.

would inlined recursion be disallowed?

The cycle detection would disallow "inlined" recursion and only that:

The non-recursive references should support inlining (e.g. :schema and :malli.core/schema): ...

(m/validate
  [:schema {:registry {"ints" [:+ int?]
                       "bools" [:+ boolean?]}}
   [:* [:cat "ints" "bools"]]]
  [1 true 2 2 false])
; => true (now: false)

Yes nonrecursive regex "inlining" like that would work (just as in e.g. Lex).

Non-seqex recursion like this (and my version of the Hiccup schema) already works:

(m/validate
  [:schema {:registry {"ints" [:or int? [:* [:ref "ints"]]]}}
   "ints"]
  [[1] [2 2]])
; => true

nilern · 2021-01-13T11:14:03Z

Personally I would use this for the function arg specs, but I think that is going a different direction now? Spec directly couples sequences regexes with argument specs, but I'm not sure if that is a good idea.

Non-vararg functions don't really need this regex stuff and even keyword arguments don't need recursion: [:cat int? [:* [:alt [:cat [:= :foo] string?] [:cat [:= :bar] double?]]]] etc. Actually there could be something more concise (and faster?) for that 🤔.

Also the spec sequence regexes can be used to describe Clojure code itself. I use this in grasp for example:

https://github.com/borkdude/grasp

I think even the arguments of even the most complicated macros are either regular or the recursive structure is implemented with nested collections. The Common Lisp loop madness seems like a thing of the past.

How does JSON schema implement this (if this is something they support at all)? Can you describe function inputs using that? I'm blissfully ignorant about JSON Schema so far.

I don't think they have something like this at all. They do have recursion but it does not seem possible to use that to describe complicated flat arrays...

nilern · 2021-01-13T11:35:47Z

I think I'll fix the refs but implement the restriction to actually regular expressions. I don't think it makes sense to delay this further and make it bigger and slower only to support fringe usages. If anything, this (like everything else) should get smaller and faster than it is.

ikitommi · 2021-01-13T11:57:26Z

Sounds good.

…uction

ikitommi · 2021-01-14T16:41:17Z

src/malli/core.cljc

-           (-deref [_] (-ref))))))))
+(def -ref-schema
+  (let [into-ref-schema
+        (memoize


Doesn't this introduce a memory leak? All schemas are persisted forever in the memoization cache? Also, as options are part of the memoization args and can have mutable state (just as recursion counters with gen), the same schema can appear multiple times in the cache,

This was just a quick hack to make = (and thus set membership) work for refs and I did suspect it wouldn't fly. I suppose some more complicated or nonextensible solution is in order, any ideas? (Too bad the memoization cache is so naive, OTOH weak maps have other issues and so on...)

Not sure how to do that elegantly, but few guesses:

schema form is immutable, could that be used as equality (like there is malli.util/equals)

there is already the local -ref, which is memoized for caching purposes.

and yes, ´clojure.core/memoize` is awful.

Yep, memoize bit me too once in clj-kondo... huge memory leak ;)

nilern · 2021-01-15T10:02:35Z

src/malli/core.cljc

@@ -122,6 +119,8 @@

 (defn -update [m k f] (assoc m k (f (get m k))))

+(defn -memoize [f] (let [value (atom nil)] (fn [] (or @value) (reset! value (f)))))


Extracted from -ref-schema, but why not just use delay?

no idea. maybe it should.

nilern · 2021-01-15T14:24:28Z

So now I took out the actual cycle detection and instead banned :ref:s altogether as seqex children. Nonrecursive :refs are unnecessary (maybe it should have been called :rec...) and recursively nested sequences have to be wrapped in :schema or something else anyway:

(m/validate
  [:schema {:registry {::ints [:* [:or int? [:ref ::ints]]]}}
   ::ints]
  [[1 2 3]])
; => true

ikitommi · 2021-01-15T18:34:26Z

This is good.

ikitommi and others added 30 commits December 7, 2020 11:51

wip

b720f6f

wip

355277b

Initial NFA code drop (from my Seqexp 'perf' branch).

619f7a1

Reimplement regex validator.

ccc1fd6

Seq regex parsing, first draft.

05eb30d

Make regex parser behave passably and add malli.regex/parse(r).

b4e1e5c

re/fn -> re/is

99766f4

Disable some broken stuff.

f71beba

Add some regex explainer sketches.

ebba0e0

Push path and in into ExplanatoryVM.

95fa6e7

Make regex schemas work with regular validator and explainer.

5de3895

Add explain instruction.

a18f729

Disallow trailing seq via end instruction.

873c350

Fix ::end-of-input and ::input-remaining schema args.

7c5a0e7

Add missing regex validation end clause.

9390282

Extract regex-validator and regex-explainer.

8d6876c

Improve regex LensSchema impls.

dfc13b5

Add regex validate tests (imitating Seqexp tests).

3a935e1

Move bool coercion inside exec-recognizer.

680e669

Add seqexp generators.

ed62e6b

Remove fixed FIXME.

cd2a84b

Move regex macros to separate namespace.

75a1cf1

Move regex compiler to separate namespace.

d392ea5

Make everything compile on cljs.

8b32557

Fix regex VM on cljs.

cdf92d8

Add :nested schema for preventing regex schema 'inlining'.

046169d

Use list for regex parse stack.

7168669

Add seqexp transformers.

9c329d4

Fix regex-transformer self-enter/leave.

2efd016

Unify encoder-regex and decoder-regex into transformer-regex.

aae9cb7

ikitommi mentioned this pull request Jan 10, 2021

Support Schema defn syntax #125

Closed

Fix -sequence-entry-schema type name.

4e4ae75

Remove :nested, already had :schema.

1c87a24

Actually use quadratic probing (how embarrassing).

911709a

Pauli Jaakkola added 2 commits January 14, 2021 16:21

Add missing :cat* and :alt* transform tests.

76b3503

Read through (nonrecursive) RefSchemas in seqex validator etc. constr…

6c48077

…uction

nilern requested a review from ikitommi January 14, 2021 14:58

ikitommi reviewed Jan 14, 2021

View reviewed changes

Preent recursive seqexen more conservatively.

1cad6df

nilern commented Jan 15, 2021

View reviewed changes

Pauli Jaakkola added 3 commits January 15, 2021 12:46

RegexSchema cleanups.

0c51c3e

Fix seqex generators.

9732f87

Add int tree seqex test.

674710d

nilern requested a review from ikitommi January 15, 2021 14:24

ikitommi merged commit 6693d39 into metosin:master Jan 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regex schemas #317

Regex schemas #317

nilern commented Dec 11, 2020 •

edited

Loading

nilern commented Jan 11, 2021

nilern commented Jan 11, 2021

ikitommi commented Jan 11, 2021

nilern commented Jan 11, 2021

ikitommi commented Jan 12, 2021

nilern commented Jan 12, 2021

ikitommi commented Jan 12, 2021

ikitommi commented Jan 12, 2021

borkdude commented Jan 12, 2021 •

edited

Loading

nilern commented Jan 13, 2021

nilern commented Jan 13, 2021

nilern commented Jan 13, 2021

ikitommi commented Jan 13, 2021

ikitommi Jan 14, 2021 •

edited

Loading

nilern Jan 15, 2021

ikitommi Jan 15, 2021

ikitommi Jan 15, 2021

borkdude Jan 15, 2021

nilern Jan 15, 2021

ikitommi Jan 15, 2021

nilern commented Jan 15, 2021

ikitommi commented Jan 15, 2021

		@@ -122,6 +119,8 @@

		(defn -update [m k f] (assoc m k (f (get m k))))

		(defn -memoize [f] (let [value (atom nil)] (fn [] (or @value) (reset! value (f)))))

Regex schemas #317

Regex schemas #317

Conversation

nilern commented Dec 11, 2020 • edited Loading

nilern commented Jan 11, 2021

nilern commented Jan 11, 2021

ikitommi commented Jan 11, 2021

nilern commented Jan 11, 2021

ikitommi commented Jan 12, 2021

nilern commented Jan 12, 2021

ikitommi commented Jan 12, 2021

ikitommi commented Jan 12, 2021

borkdude commented Jan 12, 2021 • edited Loading

nilern commented Jan 13, 2021

nilern commented Jan 13, 2021

nilern commented Jan 13, 2021

ikitommi commented Jan 13, 2021

ikitommi Jan 14, 2021 • edited Loading

Choose a reason for hiding this comment

nilern Jan 15, 2021

Choose a reason for hiding this comment

ikitommi Jan 15, 2021

Choose a reason for hiding this comment

ikitommi Jan 15, 2021

Choose a reason for hiding this comment

borkdude Jan 15, 2021

Choose a reason for hiding this comment

nilern Jan 15, 2021

Choose a reason for hiding this comment

ikitommi Jan 15, 2021

Choose a reason for hiding this comment

nilern commented Jan 15, 2021

ikitommi commented Jan 15, 2021

nilern commented Dec 11, 2020 •

edited

Loading

borkdude commented Jan 12, 2021 •

edited

Loading

ikitommi Jan 14, 2021 •

edited

Loading