Allow the RefResolver to be pickled #682

Beirdo · 2020-04-19T20:31:12Z

this requires removing the LRU Caches from the object being pickling as
wrappers are ultimately not picklable
Recreate the LRU caches when unpickling

- this requires removing the LRU Caches from the object being pickling as wrappers are ultimately not picklable - Recreate the LRU caches when unpickling

Beirdo · 2020-04-19T20:43:10Z

Might be useful if this would show the logs for the failed test. Not sure what the issue is.

Julian · 2020-04-19T21:33:09Z

Hi! Thanks for the contribution it's very much appreciated (regardless of the following :)

I don't actually want to support folks using pickle. I recognize there are valid uses of it, but it is a minuscule amount (of valid uses) compared to the ones out there. If you're in that tiny tiny bucket, generally you also should know how to evade cases like this as well. But yeah I don't really want to maintain code supporting pickle unfortunately.

(On the CI, I've never actually seen GHA fail that hard, but looks like some transient failure in even cloning the repo or something: https://github.com/Julian/jsonschema/actions/runs/81874473, rekicking I bet would work)

And as I say, definitely appreciate the work no matter what!

Beirdo · 2020-04-20T06:03:42Z

I guess I'll continue to maintain my fork with this in it then. The main driver for this is that I am using openapi-core to create API clients from Swagger/OpenAPI definitions, and it heavily uses jsonschema for its validations. However, each time I load up, parse and validate one of those files, it takes 2-20s each (depending mainly on how many remote refs there are). This makes for a fairly lengthy load time, so I'm pickling the fully validated specs and loading them (if fresh) instead on startup. This takes 2-5ms each.

The LRU Caches in jsonschema were the main thing blocking this, hence making this commit. If you know of a better way to store the entire class structure to allow for rapid reload that doesn't require pickle, I'm happy to take a look at it, but for now, this is the best option.

I tried just removing and recreating the RefResolver itself, but that just didn't work right. :)

Anyways, if you do feel like reconsidering, it's here for you if you want it. I'm OK with having pip pull my fork instead of the official fork though, at least for now. Thanks for taking a look.

Julian · 2020-04-20T11:34:53Z

However, each time I load up, parse and validate one of those files, it takes 2-20s each (depending mainly on how many remote refs there are). This makes for a fairly lengthy load time, so I'm pickling the fully validated specs and loading them (if fresh) instead on startup. This takes 2-5ms each.

Which parsing / validating is taking 2-20s? jsonschema doesn't really parse anything, and if you're talking validating a schema, if your schema's been validated why re-validate it? Not sure I follow where the time is being spent. But if you did want to store and reload objects, the supported way to do so is to dump what's needed to a proper serialization format (JSON say) and then re-create them -- the LRU cache can be stored simply as a dict (and then serialized to JSON) and then reloaded to create a new resolver.

Beirdo · 2020-04-20T17:29:11Z

I'm not clear on what is taking the time. I could add a pile of instrumentation to figure that out, but I think it's the pulling of references over the internet maybe?

The point here is to use this as an API, the swagger files (YAML) do need to be loaded up as specs, and more validation is done on every API request and every API response, which is why the RefResolver is still necessary, it seems. I'm not OK with recreating all of the spec tree and everything it pulls in as it does take so long. The whole point of using pickling rather than JSON or CBOR is such that on unpickling, the data is already re-created into the correct classes, and there is no extra data parsing required and very few if any custom serialization bits required.

I just tried converting the LRU caches to dicts to store them, and reload from the dicts. Unfortunately, we have no access to the contents of the cache (there's no hooks for that in functool._lru_cache_wrapper), so we end up with empty caches anyways. I therefore see no point in not just creating new empty caches. I'm not up for monkeypatching functool, or attempting to replace it (both would be potentially very risky changes for everyone), so as it is if we marshal this object into any format at all, we lose the contents of the LRU caches.

I don't see any real difference between using pickles or json in the end other than an increased requirement for my code to handle custom JSON serialzation and deserialization for every class used in the API spec object whereas the standard python library covers pickling and unpickling just about everything there except for wrapped functions, which is where the LRU caches come in.

Also, as this isn't state I wish to share to a different platform or even to a different machine, so there's no particular need to go with something completely portable like JSON or CBOR. Pickling is portable enough anyways, other than not being backwards compatible to older versions of Python. Marshalling to JSON in particular would almost assuredly be slower anyways and far more problematic.

I realize that my use case here is an edge case, but as this change doesn't affect anything at all except for the users of getstate and setstate, which to my knowledge is only pickling, I don't see how it would be harmful either.

For reference:

The LRU caches for one swagger file (this one takes around 12s to create a spec from) before dumping it:
_urljoin_cache: CacheInfo(hits=219, misses=81, maxsize=1024, currsize=81)
_remote_cache: CacheInfo(hits=106, misses=44, maxsize=1024, currsize=44)

After attempting to reload from a pickled dict of the LRU caches:
_urljoin_cache: CacheInfo(hits=0, misses=0, maxsize=1024, currsize=0)
_remote_cache: CacheInfo(hits=0, misses=0, maxsize=1024, currsize=0)

After reloading as written in this patch:
_urljoin_cache: CacheInfo(hits=0, misses=0, maxsize=1024, currsize=0)
_remote_cache: CacheInfo(hits=0, misses=0, maxsize=1024, currsize=0)

Julian · 2020-04-20T18:10:07Z

It's hard to for me to know what to suggest without the performance data, but what I suspect given what you've said is that the thing I would support is something more like fully resolving schemas at runtime (which is what e.g. #371 was doing).

If what's happening is you've got lots of remote schemas, and want to fetch all of them and save doing that repeatedly on startup, that's a simple schema transformation, and one I definitely agree can/should be supported.

The whole point of using pickling rather than JSON or CBOR is such that on unpickling, the data is already re-created into the correct classes, and there is no extra data parsing required and very few if any custom serialization bits required.

I don't really want to argue this piece :), that's precisely the reason pickle is bad and will not be supported, so we should just agree to disagree here.

I realize that my use case here is an edge case, but as this change doesn't affect anything at all except for the users of getstate and setstate, which to my knowledge is only pickling, I don't see how it would be harmful either.

Again with respect, I know you're trying to get your piece done, but this is a common thing for users (of this library or others, which I'm likely myself guilty of) -- they think "if just this one thing was added it'd fix my issue, and hurt nothing else" -- but that's never the case. Ever. Code lives forever, supported functionality stays supported indefinitely, things have snowball effects, documentation, tests and expectations are never fully managed. Saying "it's just this one little thing" is just always wrong unfortunately. As a maintainer there's a need to create clear lines.

For this library, pickling objects is not supported. (Same as some other things -- threading is not supported either in the sense that objects are not safe to be shared across threads -- if someone showed up contributing support and tests to add it with no effects on other parts of the code, I'd still turn it down for the same reasons as these).

I hope even if you don't like it you can see where I'm coming from. And yeah as I said, happy to see solutions that support prefetching or ways to make it trivial to serialize and deserialize what you need explicitly (e.g. at some point jsonschema will just use attrs for all classes and then you can essentially just call attr.asdict on whatever you'd like). Just not pickle.

Julian · 2020-04-20T18:58:29Z

Sorry - also, one other thing which I didn't explicitly speak out and should have mentioned just in case -- but when I alluded to being able to do this without upstream support, what I was referring to was the copyreg standard library module -- if you did want to use pickle for your own purposes, you don't need to rely on a fork if you don't necessarily want to, downstream code is free to define how objects are pickled by registering hooks there regardless of whether the objects themselves "support" it.

Beirdo · 2020-04-20T21:40:11Z

Fair enough. Hmm, I had not run across copyreg. That might be the best overall plan, you are right. :)

I completely do understand where you are coming from here, and I appreciate the dialog. I'll see if I can't get this to work with a copyreg methodology, as then there's nothing to maintain in this package at all, fork or otherwise. :)

Thanks, and keep healthy.

Julian · 2020-04-20T21:49:33Z

Glad to hear (and extra glad to hear that the follow-up clarifying copyreg was helpful).

I would love by the way if you ever do do the profiling exercise to see what it shows (even just running pip install pyinstrument && pyinstrument yourprogram.py would be super insightful if you could share any of it). If there are optimizations to be had I would love to see what turns up hot.

Thanks, and keep healthy.

You and your loved ones as well, all the best!

Beirdo · 2020-04-20T23:02:57Z

I'll try that soon as well, good thought.

I found the copyreg to be poorly documented... makes it obvious how to do something about the serialization (pickle.dump), but I fail to see how to tweak deserialization (pickle.load). For the moment, I have moved the two functions into my own package, and monkeypatch them in (yuck) while I try to determine how to properly use copyreg :)

Not sure if pyinstrument is going to like me much (I'm using stackless python), but it's sure worth a shot. Sounds like a good tool to learn anyways.

Beirdo · 2020-04-20T23:42:31Z

`
Program: testmud.py

23.270 testmud.py:3
├─ 21.937 start_mud HavokMud/startup.py:29
│ ├─ 17.762 init HavokMud/server.py:31
│ │ ├─ 13.670 init HavokMud/swaggerapi/eosio_chain.py:9
│ │ │ └─ 13.670 wrapper HavokMud/utils.py:97
│ │ │ └─ 13.669 init HavokMud/swaggerapi/init.py:60
│ │ │ ├─ 10.179 create_spec openapi_core/schema/shortcuts.py:8
│ │ │ │ [9378 frames hidden] openapi_core, openapi_spec_validator,...
│ │ │ └─ 3.309 openapi_core/schema/schemas/registries.py:26
│ │ │ [1311 frames hidden] openapi_core, openapi_spec_validator,...

`

Any way to make it NOT skip those thousands of frames to get you more detail, I wonder... Still learning this tool, it's great! That's the section for the slower of the swagger files, which can be pulled from (if interested): https://raw.githubusercontent.com/Beirdo/eos/develop/plugins/chain_api_plugin/chain.swagger.yaml

I'm calling openapi_core.create_spec(object, "https://eosio.github.io/schemata/v2.0/oas/") BTW.

Beirdo · 2020-04-20T23:57:43Z

BTW, just added a SHA512 digest of the pickled file data for validation on startup to avoid someone messing with it... I guess I could get even more paranoid and encrypt it. :)

Beirdo · 2020-04-21T03:46:40Z

pyimplement.zip

Oh my. Seems that openapi_core, when creating specs... goes massively recursive... I think I'll also look into whether they have effective LRUs in place or not too as some of the references MUST be repeated.

But there's why I don't want to do this on every startup :)

Beirdo · 2020-04-21T16:47:58Z

Fork deleted. :) Thanks for your time.

Julian · 2020-04-21T16:55:58Z

Sure thing and thanks for the dumps I'll see if I can have a look but will probably take a bit to get some time.

…

On Tue, Apr 21, 2020, 12:50 Gavin Hurlbut ***@***.***> wrote: Fork deleted. :) Thanks for your time. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#682 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACQQXXY7NRSYEKBEUKGOZ3RNXFFPANCNFSM4ML6BM6Q> .

4ecd01f30 Merge pull request #687 from swaeberle/check-single-label-idn-hostnames 732e7275a test single label IDN hostnames ab3924a66 Merge pull request #685 from swaeberle/check-single-label-hostnames 9265a4fa9 do not test hostname with leading digit for older drafts 261b52db1 do not allow starting digits in hostnames for older drafts 9fc231ea4 test digits in hostnames e9b20158e test plain single label hostnames c8b57093d test valid single label hostnames 299aa7fe5 Merge pull request #682 from json-schema-org/useless-branch fbb3cac60 Bump the sanity check to use a released version of jsonschema git-subtree-dir: json git-subtree-split: 4ecd01f30bce36a61224fa0f46c2c3f0cf7481dc

Allow the RefResolver to be pickled

c8ae5e9

- this requires removing the LRU Caches from the object being pickling as wrappers are ultimately not picklable - Recreate the LRU caches when unpickling

Julian closed this Apr 20, 2020

tskir mentioned this pull request Sep 6, 2022

Investigate validation against JSON schema directly from (Py)Spark opentargets/issues#2722

Closed

dtkav mentioned this pull request Dec 5, 2022

[dev-2.0] RefResolver cannot resolve yaml spec-first/connexion#681

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow the RefResolver to be pickled #682

Allow the RefResolver to be pickled #682

Beirdo commented Apr 19, 2020

Beirdo commented Apr 19, 2020

Julian commented Apr 19, 2020

Beirdo commented Apr 20, 2020

Julian commented Apr 20, 2020 •

edited

Loading

Beirdo commented Apr 20, 2020

Julian commented Apr 20, 2020 •

edited

Loading

Julian commented Apr 20, 2020

Beirdo commented Apr 20, 2020

Julian commented Apr 20, 2020

Beirdo commented Apr 20, 2020

Beirdo commented Apr 20, 2020

Beirdo commented Apr 20, 2020

Beirdo commented Apr 21, 2020

Beirdo commented Apr 21, 2020

Julian commented Apr 21, 2020 via email

Allow the RefResolver to be pickled #682

Allow the RefResolver to be pickled #682

Conversation

Beirdo commented Apr 19, 2020

Beirdo commented Apr 19, 2020

Julian commented Apr 19, 2020

Beirdo commented Apr 20, 2020

Julian commented Apr 20, 2020 • edited Loading

Beirdo commented Apr 20, 2020

Julian commented Apr 20, 2020 • edited Loading

Julian commented Apr 20, 2020

Beirdo commented Apr 20, 2020

Julian commented Apr 20, 2020

Beirdo commented Apr 20, 2020

Beirdo commented Apr 20, 2020

Beirdo commented Apr 20, 2020

Beirdo commented Apr 21, 2020

Beirdo commented Apr 21, 2020

Julian commented Apr 21, 2020 via email

Julian commented Apr 20, 2020 •

edited

Loading

Julian commented Apr 20, 2020 •

edited

Loading