Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make MAX_DIMENSIONS configurable via a system property. #12306

Closed
wants to merge 3 commits into from

Conversation

dsmiley
Copy link
Contributor

@dsmiley dsmiley commented May 17, 2023

The principle objective of this PR is to make it easier for users to use Lucene with higher dimensions. While it's technically possible to circumvent the existing limit, it's a non-obvious awkward hack that either the user would need to figure out or would be baked into higher level search platforms. It's so obscure that most Lucene committers didn't even know it was possible! Thus the practical effect for users now is that it's not possible so they don't use Lucene.

The system property proposed here is "lucene.hnsw.maxDimensions".

I also deprecated the field in anticipation that it will move to a codec specific place. Regardless of if/when that happens, I don't think we want to advertise this limit where it is now, which is at a surface level Lucene API with a present value based on the default codec that may not make sense for other vector codecs. Moving that is out of scope of this PR.

AFAICT, this limit is merely an ergonomics kind of limit to help the user from shooting themselves in the foot. The underlying codec we have can read/write an arbitrary number of dimensions. CheckIndex validates the number of dimensions is greater than zero but doesn't enforce a maximum.

Deprecate FloatVectorValues.MAX_DIMENSIONS because we anticipate a better home.
Copy link
Member

@rmuir rmuir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm -1 against using a system property for this due to backwards compatibility in the future:

  • if a user can set this to 1 billion with the default codec, then we must always support 1 billion dimensions going forwards in future codecs. easy to lock ourselves out of implementations. These values should be constants such as dimensions limit to BKD tree.
  • such insane values are not tested by anything and may cause overflows. release 9.x already had overflows here, twice, due to situations like this. supported values must be tested.
  • system property is not used appropriately (e.g. proper security checks). If there is a security manager, and it doesn't allow read to this property, classes will fail to load.
  • backwards compatibility/lockout IS THE MAIN CONCERN: yes there are serious performance problems, we need to be able to improve and fix them.
  • correctness is the SECONDARY CONCERN: we should only support the number of dimensions that actually work.

I'm aware the current limit is not implemented correctly or validated but it should be. just because someone can bypass it doesn't change our back compat guarantee. at least today, if we make a new faster codec, we know we must support 1k dimensions for it to be the default codec. non-default codecs can have higher limits because they don't imply this strict back compatibility.

@dsmiley
Copy link
Contributor Author

dsmiley commented May 17, 2023

I get your point on supportability, and I appreciate that you and others take that so seriously! Then let's have documentation around this setting to make it explicitly clear that it isn't "supported" (may or may not work; you're on your own). We could even go one step further by logging such a warning to ensure the user is aware. WDYT?

@rmuir
Copy link
Member

rmuir commented May 17, 2023

Sorry, I really don't believe we should do this in the default codec. Again you can just make an alternative non default codec in lucene/codecs with a different limit than the default codec. I would still use a constant and test with up to the Max limit if u want this codec to have any correctness.

@dsmiley
Copy link
Contributor Author

dsmiley commented May 17, 2023

I could also imagine some users hosting search for others might want to configure the limit lower!

@rmuir
Copy link
Member

rmuir commented May 17, 2023

The default codec with the index back compact guarantees that we have is just no place for leniency or "you are on your own". You should trust the index is supported.

The lucene/codecs package is the place for doing more experimental stuff without the back compact hassle/signups/guarantee/locking. But I still argue users should get clear exceptions and behavior there too, not integer overflows or failures to load classes. The supported limits should be tested, hence constants.

@dsmiley
Copy link
Contributor Author

dsmiley commented May 17, 2023

BTW thanks for your prompt and friendly responses.

I have two lines of inquiry here:

(A) I don't accept the rule/belief that you have in which our default codecs shall not have expert/unsupported options. Why? It seems like a preferential / opinion sort of belief and not something technically grounded. If such unsupported options exist, would the project or its users be harmed somehow?

(B) I could imagine a trivial change to this PR in which we actually test 2048 dimensions via having the test runner use this system property. Maybe there will be hurdles / gotchas; I may explore this. WDYT?

@rmuir
Copy link
Member

rmuir commented May 17, 2023

I gave you the technical reasons I am against the system property. I don't want to argue on and on because you don't agree with it, we don't have to agree.

@dsmiley
Copy link
Contributor Author

dsmiley commented May 18, 2023

The only argument you've given against a system property specifically is:

system property is not used appropriately (e.g. proper security checks). If there is a security manager, and it doesn't allow read to this property, classes will fail to load.

Okay, good point. I can improve the PR to catch a SecurityException and use the default value; maybe log a warning as well.

@michaelwechner
Copy link
Contributor

Just to make sure I understand the -1 correctly: The concern is that if a user can set the dimension by himself, because the Lucene version works well with this dimension at that time for this user, but let's assume that a future Lucene version will not work anymore or not work well enough anymore with the dimension set by the user, then this means the user cannot use this new Lucene version in the future. Is this the main concern?

@uschindler
Copy link
Contributor

The only argument you've given against a system property specifically is:

system property is not used appropriately (e.g. proper security checks). If there is a security manager, and it doesn't allow read to this property, classes will fail to load.

Okay, good point. I can improve the PR to catch a SecurityException and use the default value; maybe log a warning as well.

Please take the code of MmapDirectory. There it reads a sysprop. The code there is the "most modern" variant to read it. Also wrap it with AccessController.doPrivileged! It is not easy visible in MmapDirectory, because the whole code block is wrapped by AccessController.

@mikemccand
Copy link
Member

Just to make sure I understand the -1 correctly: The concern is that if a user can set the dimension by himself, because the Lucene version works well with this dimension at that time for this user, but let's assume that a future Lucene version will not work anymore or not work well enough anymore with the dimension set by the user, then this means the user cannot use this new Lucene version in the future. Is this the main concern?

This is also my concern.

I am less worried about properly implementing sysprops, catching security exceptions etc. :) We should not be adding this sysprop in the first place.

Sure it is possible today to sneakily bypass Lucene's aKNN dimension limit check, with the unfortunate subclassing workaround Mike S posted on the dev list, but we should not be making it easier for users to fall into this dangerous and long-term trap (what this PR does). Rather, we should make it harder, e.g. by moving the limit down into the Codec itself where the HNSW algorithm determines how it scales. If the limit is implemented in the Codec, a user would have to swap a different Codec, or privately fork Lucene's default HNSW Codec component, making it very clear that they are entering "not supported by Lucene's back compat on its default Codec".

@alessandrobenedetti
Copy link
Contributor

As expressed in the poll, I like this idea, thanks @dsmiley for opening a draft pull request, while we collect more opinions in the poll.
My main question for everyone is: how do we decide on a meaningful limit?
Why is '1024' reasonable?

The reason I am supporting the reconfigurability option is that I would have no idea of a reasonable limit to be enforced on our users.
I also doubt we should enforce a limit whatsoever right now, with no technical grounding
i.e. there is no data structure or typing choice optimized around the '1024' limit
I'll think further about this later on today!

@michaelwechner
Copy link
Contributor

Just to make sure I understand the -1 correctly: The concern is that if a user can set the dimension by himself, because the Lucene version works well with this dimension at that time for this user, but let's assume that a future Lucene version will not work anymore or not work well enough anymore with the dimension set by the user, then this means the user cannot use this new Lucene version in the future. Is this the main concern?

This is also my concern.

I am less worried about properly implementing sysprops, catching security exceptions etc. :) We should not be adding this sysprop in the first place.

Sure it is possible today to sneakily bypass Lucene's aKNN dimension limit check, with the unfortunate subclassing workaround Mike S posted on the dev list, but we should not be making it easier for users to fall into this dangerous and long-term trap (what this PR does). Rather, we should make it harder, e.g. by moving the limit down into the Codec itself where the HNSW algorithm determines how it scales. If the limit is implemented in the Codec, a user would have to swap a different Codec, or privately fork Lucene's default HNSW Codec component, making it very clear that they are entering "not supported by Lucene's back compat on its default Codec".

Thanks for clarifying!

I think it is great, that Lucene is providing back compatibility, but at the same time I think it should be possible to change the limit at your own risk (as mentioned by David Smiley also allow to configure it lower than MAX_DIMENSIONS)

It would be great if we could accomplish this by moving the MAX_DIMENSIONS to the HNSW implementation and still give the user somehow the possibility to change MAX_DIMENSIONS.

Does anything speak against this?

@dsmiley
Copy link
Contributor Author

dsmiley commented May 18, 2023

There are non-mutually exclusive things that could happen to help a user wanting to work with higher dimensions; this PR focuses on configurability of an implementation that already exists. If this PR were to become non-veto worthy to others somehow, I see it needs to be a supported number of dimensions, not merely any number a user might set it to (even though IMO that'd be fine).

Thanks for the tip Uwe RE MMapDirectory.

Just to make sure I understand the -1 correctly: The concern is that if a user can set the dimension by himself, because the Lucene version works well with this dimension at that time for this user, but let's assume that a future Lucene version will not work anymore or not work well enough anymore with the dimension set by the user, then this means the user cannot use this new Lucene version in the future. Is this the main concern?

This is addressable by actually setting it to 2048 in tests to show that we support it, even though its default value should be what it is now.

@nknize
Copy link
Member

nknize commented May 18, 2023

-1 for even introducing the system property option only because I think it's too trappy. We only just released Lucene 9.6 seven days ago. Why do we need to rush to this in a PR when we haven't started the 9.7 release process? Do we not have enough time for someone enthusiastic here to open a PR for #12309?

My main question for everyone is: how do we decide on a meaningful limit?
Why is '1024' reasonable?

Exactly!! Where are the benchmarks!? For all these claims about OpenAPI embeddings and GPT this/that/etc. we don't even have empirical evidence to pick a limit let alone help guide the user community on the traps associated with increasing this limit!

@FranciscoBorges on #11507 you said We benchmarked a patched Lucene/Solr. We fully understand (we measured it :-P) Do you have any restrictions preventing you from posting your numbers to help guide this community on performance expectations at different dimensional limits? Could be a big help in guiding all these opinions.

@dsmiley
Copy link
Contributor Author

dsmiley commented May 18, 2023

Veto's are serious business, balancing harm to the project & users against the benefit the change brings to both. Thus the change needn't be perfect in every possible way but needs to be a net positive. I hope those using their veto powers weigh the practical upside to this change. The upside / reason for this change has been expressed by users very well so I don't plan to repeat that.

Please elaborate on your "trappy" argument, Nick. A user is trapped into setting this somehow or what bad thing happens? Of course if they find the performance to be slow then they have it coming to them. There are plenty of ways to have slow indexing and OOMs, as I'm sure you all appreciate :-). I don't see why the Lucene release timeline matters. This PR aims to improve the Lucene we have today in a simple way, even if new/exciting changes are happening. Even if some new codec were to suddenly appear with a higher limit, why veto this improvement to this codec?

Please keep this PR discussion focused on configurability and not other topics like wether the default number of dimensions should change, to include benchmarks. It won't change in this PR. The dev list would be a good forum for that.

@nknize
Copy link
Member

nknize commented May 18, 2023

Thus the change needn't be perfect in every possible way but needs to be a net positive.

💯 So lets "progress not perfection" on #12309 ? I still haven't heard a reason to justify the rush to a system property seven days after the last minor release.

Please elaborate on your "trappy" argument, Nick.

A user configures this system property, indexes a 10Billion dimension vector. Goes to sleep. Lucene community wakes up, realizes empirical benchmarks and informed decisions matter, benchmarks the implementation and finds the debt ceiling shouldn't be raised and it gave the user community a cyanide pill wrapped as a tic tac, adds a different mechanism with strict restrictions to avoid a JVM crumble. User wakes up, upgrades lucene which now imposes this new smart safety mechanism, and the user now has to jump through stupid hoops to upgrade their incompatible index. THAT kind of trap. This kind of experience should be avoided at all cost, even if it's only likely to happen 1% of the time, because (to me) even giving a terrible experience to the 1% user is unacceptable.

@nknize
Copy link
Member

nknize commented May 18, 2023

Given how simple this change is, is there harm in putting it in our back pocket as an alternative if no other option pans out closer to the 9.7 release? I don't see a justifiable reason to rush this right now.

@alessandrobenedetti
Copy link
Contributor

alessandrobenedetti commented May 18, 2023

Thus the change needn't be perfect in every possible way but needs to be a net positive.

💯 So lets "progress not perfection" on #12309 ? I still haven't heard a reason to justify the rush to a system property seven days after the last minor release.

Please elaborate on your "trappy" argument, Nick.

A user configures this system property, indexes a 10Billion dimension vector. Goes to sleep. Lucene community wakes up, realizes empirical benchmarks and informed decisions matter, benchmarks the implementation and finds the debt ceiling shouldn't be raised and it gave the user community a cyanide pill wrapped as a tic tac, adds a different mechanism with strict restrictions to avoid a JVM crumble. User wakes up, upgrades lucene which now imposes this new smart safety mechanism, and the user now has to jump through stupid hoops to upgrade their incompatible index. THAT kind of trap. This kind of experience should be avoided at all cost, even if it's only likely to happen 1% of the time, because (to me) even giving a terrible experience to the 1% user is unacceptable.

I get your point, but upgrades in general are tricky and care is necessary.
If a new Lucene version won't support anymore a 10 Billion dimension vector for valid technical reasons(data structures, typing, etc) at that point it will be the responsibility of the user to:
-1 not upgrade OR
-2 Stick to some dimensionality reduction approach OR
-3 abandon Lucene for another software

With no configurability we currently leave only option 3 to our users anyway because we decided that 1024 is a golden limit for performance to be acceptable.
The intention in configurability is to leave to the user the responsibility of deciding what performance is acceptable for his/her use case (at least my intention).

I will spend more time on this later as a lot of good stuff has been added, but in relation to upgrades and back compatibility I am not sure I see them as a critical blocker (happy to change my mind)

@nknize
Copy link
Member

nknize commented May 18, 2023

With no configurability we currently leave only option 3 to our users anyway because we decided that 1024 is a golden limit for performance to be acceptable.

But this isn't a perpetual one way door. I don't think anyone is saying we don't make it configurable. I'm proposing we do it via Lucene95HnswVectorsFormat as per #12309 so we can provide better bwc guarantees instead of a global system property that's harder to provide bwc.

in relation to upgrades and back compatibility I am not sure I see them as a critical blocker

I don't think anyone is definitive on anything right now. And we have time. No need to rush this and make a knee jerk decision because the GPT kool-aid is so strong.

@dsmiley
Copy link
Contributor Author

dsmiley commented May 18, 2023

A user configures this system property, indexes a 10Billion dimension vector.

We can block that with a maximum of 2048, the tested and thus supported maximum limit. I'm presently looking at incorporating this.

Given how simple this change is, is there harm in putting it in our back pocket as an alternative if no other option pans out closer to the 9.7 release? I don't see a justifiable reason to rush this right now.

At the very moment, we're just discussing (no merge) because there are vetos. I want to continue this discussion now. If the vetos were lifted then sure; could be delayed but wouldn't want to jeopardize it missing the release.

I don't think anyone is saying we don't make it configurable.

I haven't quite heard anyone say that either, but I haven't heard you or Rob suggest an alternative way to make it configurable. I've heard suggestions of writing new codecs or subclassing codecs but that's not configurability; it's customizability at best. And IMO it's too much to ask of most users.

I'm proposing we do it via Lucene95HnswVectorsFormat as per #12309 so we can provide better bwc guarantees instead of a global system property that's harder to provide bwc.

I just saw #12309 which is strictly moving the limit, not about making it configurable. I like that it's focused on that, and I welcome that change. This PR here (#12306) is about the configurable characteristic.

How is a global system property hard to provide backwards compatibility for?

Use a logger instead of stderr
Log a warning about unsuportability beyond 2048
Test 2048, and test that we test it :-)
@alessandrobenedetti
Copy link
Contributor

Mmm where is 2048 coming from?
I am generally not a big fan of magic numbers, but happy to change my mind if there's any rationale behind it

@dsmiley
Copy link
Contributor Author

dsmiley commented May 18, 2023

All limits are magic numbers :-). Maybe you mean, lets add a bit of docs in the code to reflect why it is what it is? From some of the conversation threads, I recall 2048 is the highest number I saw thrown around as might be useful. What matters here in this PR, I see from some folks, is that it's a supportable number. Tests pass with it. If you recommend a different number then lets here it but I admit not being too concerned with it. The default 1024 matters much more and that's not changing here.

In my latest commit, I had it log a warning if 2048 is exceeded. My thinking is that maybe somebody wants to benchmark it just to see if it works but not actually use it in a production setting. I don't love a hard (exception) upper limit but I suspect some people might consider this a veto-worthy objection. You've got to wear armor in this community to be a part of it for long :-/

@alessandrobenedetti
Copy link
Contributor

I mean, yeah, I hate limits that are not related to typing constraint, and also in that case, I agree they are magic numbers behind the scenes :)

Still, I would question why showing a warning for 2049 and not for 2047.
To be honest I don't have a better option, except maybe "no warning at all".

Anyway It's not something I would enforce it as a veto as long as someone can go upper than 2048 with no consequence except the warning

@dsmiley
Copy link
Contributor Author

dsmiley commented May 27, 2023

The PR has evolved from the first iteration. Are there remaining concerns with the PR as it is today? It shows that 2048 dimensions is tested, works, and thus is supportable. It would make a lot of our users happy (Lucene exists to serve them) and lure more users, and I don't see what harm it would bring.

@rmuir
Copy link
Member

rmuir commented May 27, 2023

The PR has evolved from the first iteration. Are there remaining concerns with the PR as it is today? It shows that 2048 dimensions is tested, works, and thus is supportable. It would make a lot of our users happy (Lucene exists to serve them) and lure more users, and I don't see what harm it would bring.

I still veto a system property at all. re-read what i already wrote.

@rmuir
Copy link
Member

rmuir commented May 27, 2023

pushing new commits and pretending that they address concerns is bs. ignoring my concerns is bs. claiming security manager was my only concern with the system property is an actual lie when i list 5 fucking bullet points above.

stop this nonsense.

@dsmiley
Copy link
Contributor Author

dsmiley commented May 28, 2023

I see you are very frustrated. I insist I'm trying to listen to you (and others) Rob; sorry if I'm clumsy in doing so. Your chief concern is supportability both today and in perpetuity forever and ever. Mike & Nick voiced this concern too. I'll put aside, at the moment, wether it's reasonable to have such a sacrosanct standard, as I don't agree with it across major releases. Nevertheless, the PR shows 2048 dimensions works today; why would it not in the future?

You seem to express you don't really like system properties (at least not for this any way) but that's not really a technical argument. You expressed you'd rather users do other things, but that's not to say we should prevent this way. You expressed the default codec isn't a place for leniency. I contest that it's a technical argument, and it'd debatable what "leniency" means here. Unlike the first iteration of this patch where you voiced your concerns, the version here shows 2048 is supported (unlike billions, say).

@uschindler
Copy link
Contributor

Here's an example why making the vector dimensions configurable is a bad idea: #12281
This issue shows that each added dimension makes the floating point errors larger and sometimes also returns NaN. Do we have tests when multiplying vectors causes NaN?

@alessandrobenedetti
Copy link
Contributor

alessandrobenedetti commented Jun 10, 2023

Here's an example why making the vector dimensions configurable is a bad idea: #12281 This issue shows that each added dimension makes the floating point errors larger and sometimes also returns NaN. Do we have tests when multiplying vectors causes NaN?

I may sound like somebody who contradicts another just for the sake of doing so, but I do genuinely believe these kind of discoveries support the fact that making it configurable is actually a good idea:
We are not changing a production system here but we are changing a library.
Enabling more users to experiment with higher dimensions increase the probability of finding (and then solving) this sort of issues.
I suspect we are not recommending anywhere here to go to prod with un-tested and un-benchmarked vector sizes anyway

@mikemccand
Copy link
Member

We've since enabled Codec to set the limit, which is very expert and I think a safer way to change the limit than a sysprop? So we can close this one?

@mikemccand mikemccand closed this Nov 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants