Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get rid of JsonPathFinder and Boxes #58

Merged
merged 4 commits into from
Jun 4, 2024
Merged

Conversation

xMAC94x
Copy link
Contributor

@xMAC94x xMAC94x commented Dec 5, 2023

Hi @besok I started using your crate and got into performance issues, now I did some looks into it and tried to simplify the interface a bit.
Removed the JsonPathFinder and just had the methods there as stand-alone methods, like common in the rust env.

Want to get your opinion on this, what do you think ?

@besok
Copy link
Owner

besok commented Dec 6, 2023

Great. Thanks for the pull request. I will try to come back to it this week( I am currently on vacation without access to my laptop unfortunately:( )

Btw, getting rid of the given struct has boosted performance or it is rather due to redundancy of the layouts?

@xMAC94x
Copy link
Contributor Author

xMAC94x commented Dec 6, 2023

I wanted to search in some existing json Value however i can't pass a &Value to the struct, so i would need to copy it.
One could think about extending the struct to also allow & but its way easier to get rid of it entirely, it's not doing too much anyways.

Additionally i think having a sleek api that makes it easy to reuse code and doesn't encouraging cloning is a good way to make it hard for consumers of this crate to missuse.
Ideally it should be straight forward to use, and the straight forward approach should come with no hidden cost.

But for now enjoy the vacation :)

@besok
Copy link
Owner

besok commented Dec 7, 2023

Yeah. thanks. :)

I agree the supplementary costs should be avoided, the struct JsonPathFinder was created simply intending to use it directly. like having two strings (JSON and js path) and calculating the path "on a fly".

Maybe yo are right. I will need to recall what the struct does now actually =)

@besok
Copy link
Owner

besok commented Dec 7, 2023

closes #59

Copy link
Owner

@besok besok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late response. Just came back.
To me looks good. Let's remove the finder.

@besok
Copy link
Owner

besok commented Dec 23, 2023

It has some fmt and clippy failures. You can fix it locally just kicking off the appropriate commands and correct the code

@Norlock
Copy link

Norlock commented May 15, 2024

@besok can you fix those maybe? I would like to use this otherwise I will fork it no problem but since its only clippy fixes its probably done quickly.

@besok
Copy link
Owner

besok commented May 15, 2024

There are merge conflicts since the aforementioned in the PR structure obtained a new field (cfg)!
I can take a look at the weekend and either resolve the conflicts or discard the pr :)

@xMAC94x
Copy link
Contributor Author

xMAC94x commented May 17, 2024

Just had a look, it seems like #62 adds a cache for regex, imo this is is not necessary when doing it like in this PR.
You dont need a cache because you just use the existing regex, and this can cache the regex at the caller.

Simplifies this code and lets the caller do stuff more efficient.

I can do a rebase, but we first have to discuss what to do with those caching behavior.

@xMAC94x xMAC94x force-pushed the xMAC94x/functions branch from ff66e94 to 7cd697c Compare May 17, 2024 10:27
@besok
Copy link
Owner

besok commented May 17, 2024

You dont need a cache because you just use the existing regex, and this can cache the regex at the caller.

I don't grasp fully what you mean tbh, can you give an example or expand the answer?

@xMAC94x
Copy link
Contributor Author

xMAC94x commented May 17, 2024

I'll try to explain, I don't remember the internals from this crate as the PR is a bit older.

So mostly #61 complained about the crate beeing slow. And they even said what is slow, recompiling the regex.

By looking in this crate, example:

// https://github.com/besok/jsonpath-rust/blob/main/src/lib.rs#L150
let json: Value = serde_json::from_str("{}").unwrap();
let path = json.path("$..book[?(@.author size 10)].title").unwrap();

Lets assume the following speed behavior:

  1. creating a json object: slow
  2. creating a json path object: slow
  3. actually searching: fast

line 1 does step 1
line 2 does step 2 and 3 combined.

When people complain about something beeing slow, they prob do it for alot of things, e.g. have the same json data and search different things inside, or have different json data but search the same thing inside. (or even both, the important stuff is, often one thing stays the same).

The way this is currently build does not profit when one thing stays constant, it has to redo step 1 or step 2 in either case. Those are the slow steps.

With this PR:

let json: Value = serde_json::from_str("{}")?;
let path: JsonPathInst = JsonPathInst::from_str("$..book[?(@.author size 10)].title")?;

let v = find_slice(&path, &json);

You see that we need an additional line, one line per step. And you'll notice that the find_slice actually just takes references, that means it does not consume the data as its only accessed read-only.

This allows a cunsumer of this librarby, like the person from #61 to precompile the regex easily, and reuse it as often as they want. They can even go multihtreading if they need.

The idea is: rather to inject the cache behavior into the whole stack: just make this library work simply, and thus allows devs to reuse (=effectively building their own cache) regex that are already compiled

xMAC94x added 2 commits May 17, 2024 18:30
 - JsonPathFinder interface does not really benefit from storing the json or path internally
 - trying to get rid of the Box<> that is used inside of JsonPathFinder
@xMAC94x xMAC94x force-pushed the xMAC94x/functions branch from 7cd697c to beabd2d Compare May 17, 2024 17:00
xMAC94x added 2 commits May 17, 2024 19:05
equal bench with reuse     time:   [510.30 ns 512.16 ns 514.26 ns]
equal bench without reuse  time:   [21.436 µs 21.456 µs 21.479 µs]
regex bench with reuse     time:   [58.875 µs 58.925 µs 58.975 µs]
regex bench without reuse  time:   [85.324 µs 85.416 µs 85.517 µs]
JsonPathInst generation    time:   [23.988 µs 24.019 µs 24.052 µs]
@xMAC94x xMAC94x force-pushed the xMAC94x/functions branch from beabd2d to e85553c Compare May 17, 2024 17:37
@xMAC94x
Copy link
Contributor Author

xMAC94x commented May 17, 2024

Okay, just had some refactoring and mid into it i noted what we were talking about, the regex is actually executed as part of the search on not when creating the jsonpathinst. thus getting rid of the cache is a regression here.

Though anyway maybe I can convince you, rather than building a cache, evaluate the respective regexs on creation of the jsonpathinst.

Getting rid of all the boxes with this PR improves the default case on my machine from:
139.82 µs down to 86.049 µs
When keeping the data, even down to 56.392 µs.
Thats not quite yet the 24.638 µs

But on the other hand, every other operation, that is not a regex, gets a speeup from
23.448 µs to 512.16 ns (e.g. see equal benchmark).

When combining both we might even see sub us for the regex case

@besok
Copy link
Owner

besok commented May 17, 2024

The introduced cache handles a specific situation.
The compilation of regex is a voracious operation and takes quite some time.

Thus, if a user has either multithreading or a stream of queries with the same regex in the long-living app,
the cache will boost performance. It can be also serialized and spreaded further (I doubt it needs to anybody but the opportunity exists )

Though anyway maybe I can convince you, rather than building a cache, evaluate the respective regexs on creation of the jsonpathinst.

This is an unconnected process or at least I don't see a bond.

Still, I would not say the cache and removing boxes and other auxiliary structures is a mutually exclusive process.
And grooming the API and removing deprecated (so to say) structures also would be good to merge into.
Therefore, I would not remove the cache but instead introduce a new method like find_slice_with_cfg and so on.

@xMAC94x
Copy link
Contributor Author

xMAC94x commented May 17, 2024

Here is some ugly code that statically compiles the regex, when the JsonPathInst is generated: e527958

compilation time goes up by:
JsonPathInst generation time: [76.777 µs 76.844 µs 76.915 µs]
change: [+2.6997% +3.0563% +3.3790%] (p = 0.00 < 0.05)
Performance has regressed.

while now execution time improves by:
regex bench with reuse time: [591.67 ns 594.80 ns 598.18 ns]
change: [-98.975% -98.969% -98.964%] (p = 0.00 < 0.05)
Performance has improved.

this is an other 50-times fold improvement.

The only thing i have: It currently only works with static regex, idk if dynamic regex is a thing.
Do you know if dynamic regex is a thing. I also dont know if there is an less-ugly way to encode it rather than in the FilterSign.

@besok
Copy link
Owner

besok commented May 17, 2024

But that is precisely the case.
We parse a string that represents regex.
The compilation of this string into the particular regular expression is what the cache is summoned to solve.

In your example, you precompile it beforehand outside of the jspath.

@xMAC94x
Copy link
Contributor Author

xMAC94x commented Jun 3, 2024

sorry, non-native speaker here. Is this now a good thing that its compiled with the jspath or a bad thing ?

@besok
Copy link
Owner

besok commented Jun 3, 2024

It is not bad or good. It does not solve the initial problem.
The problem is the following:
Given - we have 1K requests for jspath like $.[?(@.field == '[a-zA-Z]')].
Target - compile it only once and then use already compiled regex.

You precompile a regular expression outside of the library. But it should be done inside the library because we parse a string into regex inside the library. So, you need something inside the library that will keep track of the regexes that are already compiled.

@xMAC94x
Copy link
Contributor Author

xMAC94x commented Jun 3, 2024

in that example:

let path = JsonPathInst::from_str('$.[?(@.field == '[a-zA-Z]')]').unwrap();
for _ in 0..1000 {
  // actually 1000 different jsons:
  let json = json!({
        "field":"abcXYZ",
  });
  let x = jsonpath_rust::find(&path, &json);
}

we can solve the problem with regex compilation in 2 ways:

  • add a cache that espeically only caches regexes and stores them.
  • just compile the regex when the path is created, that way we can leverage both effects. the regex is only compiled ones, but the rest is only compiled once. Even higher speed increase. no need for an internal hashmap with expensive hashmap lookups.

Or do I get it wrong somehow ?

@besok
Copy link
Owner

besok commented Jun 3, 2024

I think in general you are correct.

add a cache that espeically only caches regexes and stores them.

Yeah, that is how it is implemented now.

just compile the regex when the path is created, that way we can leverage both effects. the regex is only compiled ones, but the rest is only compiled once. Even higher speed increase. no need for an internal hashmap with expensive hashmap lookups.

Maybe, I missed something but a couple of points.
You want the users to implement a so to say, user cache with jsonpathinst instead of regexes, don't you?
It will work if you have the same queries but in general, a user can have something like that:

  • $.[?(@.field == '[a-zA-Z]')]
  • $.field1[?(@.field == '[a-zA-Z]')]
  • $.[?(@.field == '[a-zA-Z]')].field2.length
  • $.[?(@ == '[a-zA-Z]')]

and so on. The queries will be different and the regex will be calculated again.

@xMAC94x
Copy link
Contributor Author

xMAC94x commented Jun 4, 2024

Good point, I haven't though about that yet.

I don't necessarily want the user to build a cache in the sense of a HashMap.
But I want the user be able to reuse jsonpathinst for multiple json data easily. That has a similary effect on speed (and technically could also be called a cache) because it doesn't need to be reevaluated on each .find()

Mentioning different Json-Paths but with the same regex inside is truely something we should consider.
Such a scenario would definitely benefit from a cache. The question now is how common is that for users of this crate. Because while a cache helps if you have many matches, it also adds costs for all other queries that are not different-json-path-containing-same-regex.

It now gets a bit tricky, because we have to think how a user of this crates uses it and find a good tradeoff.

IMO I would prob see a user have some different-json-path-containing-same-regex like paths, but I don't see them scaling. what I want to say is, maybe they have 10 of this queries, but it would rather be uncommon that they have thousands.

I would probably expect most use cases are either: The application has a fixed number of paths and many scaling data, or the application has a fixed number of data and is checking many paths. In both cases, beeing able to reuse via reference would provide a huge benefit (see above -98,9%, that 100x faster).

That would allow, even if there are hundreds of jsonpath where the same regex is used, we would still be faster in the end.

The only exception is probably a application where every jsonpath, without exception is different from each other but share the same regex

@besok
Copy link
Owner

besok commented Jun 4, 2024

Very good point. I don't know why but the idea of caching the whole jspath slipped through my fingers. I will think about it in the evening a bit, searching for pitfalls.

@besok
Copy link
Owner

besok commented Jun 4, 2024

I have taken a look to the benches and truly it seems the benefit of having a regex cache is leveled by the alternative of caching jspathinst itself.

Thanks for the great PR!
The solution is definitely better and cleaner than the one with the regex cache.

@besok besok merged commit 1d375e6 into besok:main Jun 4, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants