-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make MAX_DIMENSIONS configurable via a system property. #12306
Conversation
Deprecate FloatVectorValues.MAX_DIMENSIONS because we anticipate a better home.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm -1 against using a system property for this due to backwards compatibility in the future:
- if a user can set this to 1 billion with the default codec, then we must always support 1 billion dimensions going forwards in future codecs. easy to lock ourselves out of implementations. These values should be constants such as dimensions limit to BKD tree.
- such insane values are not tested by anything and may cause overflows. release 9.x already had overflows here, twice, due to situations like this. supported values must be tested.
- system property is not used appropriately (e.g. proper security checks). If there is a security manager, and it doesn't allow read to this property, classes will fail to load.
- backwards compatibility/lockout IS THE MAIN CONCERN: yes there are serious performance problems, we need to be able to improve and fix them.
- correctness is the SECONDARY CONCERN: we should only support the number of dimensions that actually work.
I'm aware the current limit is not implemented correctly or validated but it should be. just because someone can bypass it doesn't change our back compat guarantee. at least today, if we make a new faster codec, we know we must support 1k dimensions for it to be the default codec. non-default codecs can have higher limits because they don't imply this strict back compatibility.
I get your point on supportability, and I appreciate that you and others take that so seriously! Then let's have documentation around this setting to make it explicitly clear that it isn't "supported" (may or may not work; you're on your own). We could even go one step further by logging such a warning to ensure the user is aware. WDYT? |
Sorry, I really don't believe we should do this in the default codec. Again you can just make an alternative non default codec in lucene/codecs with a different limit than the default codec. I would still use a constant and test with up to the Max limit if u want this codec to have any correctness. |
I could also imagine some users hosting search for others might want to configure the limit lower! |
The default codec with the index back compact guarantees that we have is just no place for leniency or "you are on your own". You should trust the index is supported. The lucene/codecs package is the place for doing more experimental stuff without the back compact hassle/signups/guarantee/locking. But I still argue users should get clear exceptions and behavior there too, not integer overflows or failures to load classes. The supported limits should be tested, hence constants. |
BTW thanks for your prompt and friendly responses. I have two lines of inquiry here: (A) I don't accept the rule/belief that you have in which our default codecs shall not have expert/unsupported options. Why? It seems like a preferential / opinion sort of belief and not something technically grounded. If such unsupported options exist, would the project or its users be harmed somehow? (B) I could imagine a trivial change to this PR in which we actually test 2048 dimensions via having the test runner use this system property. Maybe there will be hurdles / gotchas; I may explore this. WDYT? |
I gave you the technical reasons I am against the system property. I don't want to argue on and on because you don't agree with it, we don't have to agree. |
The only argument you've given against a system property specifically is:
Okay, good point. I can improve the PR to catch a SecurityException and use the default value; maybe log a warning as well. |
Just to make sure I understand the -1 correctly: The concern is that if a user can set the dimension by himself, because the Lucene version works well with this dimension at that time for this user, but let's assume that a future Lucene version will not work anymore or not work well enough anymore with the dimension set by the user, then this means the user cannot use this new Lucene version in the future. Is this the main concern? |
Please take the code of MmapDirectory. There it reads a sysprop. The code there is the "most modern" variant to read it. Also wrap it with AccessController.doPrivileged! It is not easy visible in MmapDirectory, because the whole code block is wrapped by AccessController. |
This is also my concern. I am less worried about properly implementing sysprops, catching security exceptions etc. :) We should not be adding this sysprop in the first place. Sure it is possible today to sneakily bypass Lucene's aKNN dimension limit check, with the unfortunate subclassing workaround Mike S posted on the dev list, but we should not be making it easier for users to fall into this dangerous and long-term trap (what this PR does). Rather, we should make it harder, e.g. by moving the limit down into the Codec itself where the HNSW algorithm determines how it scales. If the limit is implemented in the Codec, a user would have to swap a different Codec, or privately fork Lucene's default HNSW Codec component, making it very clear that they are entering "not supported by Lucene's back compat on its default Codec". |
As expressed in the poll, I like this idea, thanks @dsmiley for opening a draft pull request, while we collect more opinions in the poll. The reason I am supporting the reconfigurability option is that I would have no idea of a reasonable limit to be enforced on our users. |
Thanks for clarifying! I think it is great, that Lucene is providing back compatibility, but at the same time I think it should be possible to change the limit at your own risk (as mentioned by David Smiley also allow to configure it lower than MAX_DIMENSIONS) It would be great if we could accomplish this by moving the MAX_DIMENSIONS to the HNSW implementation and still give the user somehow the possibility to change MAX_DIMENSIONS. Does anything speak against this? |
There are non-mutually exclusive things that could happen to help a user wanting to work with higher dimensions; this PR focuses on configurability of an implementation that already exists. If this PR were to become non-veto worthy to others somehow, I see it needs to be a supported number of dimensions, not merely any number a user might set it to (even though IMO that'd be fine). Thanks for the tip Uwe RE MMapDirectory.
This is addressable by actually setting it to 2048 in tests to show that we support it, even though its default value should be what it is now. |
-1 for even introducing the system property option only because I think it's too trappy. We only just released Lucene 9.6 seven days ago. Why do we need to rush to this in a PR when we haven't started the 9.7 release process? Do we not have enough time for someone enthusiastic here to open a PR for #12309?
Exactly!! Where are the benchmarks!? For all these claims about OpenAPI embeddings and GPT this/that/etc. we don't even have empirical evidence to pick a limit let alone help guide the user community on the traps associated with increasing this limit! @FranciscoBorges on #11507 you said |
Veto's are serious business, balancing harm to the project & users against the benefit the change brings to both. Thus the change needn't be perfect in every possible way but needs to be a net positive. I hope those using their veto powers weigh the practical upside to this change. The upside / reason for this change has been expressed by users very well so I don't plan to repeat that. Please elaborate on your "trappy" argument, Nick. A user is trapped into setting this somehow or what bad thing happens? Of course if they find the performance to be slow then they have it coming to them. There are plenty of ways to have slow indexing and OOMs, as I'm sure you all appreciate :-). I don't see why the Lucene release timeline matters. This PR aims to improve the Lucene we have today in a simple way, even if new/exciting changes are happening. Even if some new codec were to suddenly appear with a higher limit, why veto this improvement to this codec? Please keep this PR discussion focused on configurability and not other topics like wether the default number of dimensions should change, to include benchmarks. It won't change in this PR. The dev list would be a good forum for that. |
💯 So lets "progress not perfection" on #12309 ? I still haven't heard a reason to justify the rush to a system property seven days after the last minor release.
A user configures this system property, indexes a 10Billion dimension vector. Goes to sleep. Lucene community wakes up, realizes empirical benchmarks and informed decisions matter, benchmarks the implementation and finds the debt ceiling shouldn't be raised and it gave the user community a cyanide pill wrapped as a tic tac, adds a different mechanism with strict restrictions to avoid a JVM crumble. User wakes up, upgrades lucene which now imposes this new smart safety mechanism, and the user now has to jump through stupid hoops to upgrade their incompatible index. THAT kind of trap. This kind of experience should be avoided at all cost, even if it's only likely to happen 1% of the time, because (to me) even giving a terrible experience to the 1% user is unacceptable. |
Given how simple this change is, is there harm in putting it in our back pocket as an alternative if no other option pans out closer to the 9.7 release? I don't see a justifiable reason to rush this right now. |
I get your point, but upgrades in general are tricky and care is necessary. With no configurability we currently leave only option 3 to our users anyway because we decided that 1024 is a golden limit for performance to be acceptable. I will spend more time on this later as a lot of good stuff has been added, but in relation to upgrades and back compatibility I am not sure I see them as a critical blocker (happy to change my mind) |
But this isn't a perpetual one way door. I don't think anyone is saying we don't make it configurable. I'm proposing we do it via
I don't think anyone is definitive on anything right now. And we have time. No need to rush this and make a knee jerk decision because the GPT kool-aid is so strong. |
We can block that with a maximum of 2048, the tested and thus supported maximum limit. I'm presently looking at incorporating this.
At the very moment, we're just discussing (no merge) because there are vetos. I want to continue this discussion now. If the vetos were lifted then sure; could be delayed but wouldn't want to jeopardize it missing the release.
I haven't quite heard anyone say that either, but I haven't heard you or Rob suggest an alternative way to make it configurable. I've heard suggestions of writing new codecs or subclassing codecs but that's not configurability; it's customizability at best. And IMO it's too much to ask of most users.
I just saw #12309 which is strictly moving the limit, not about making it configurable. I like that it's focused on that, and I welcome that change. This PR here (#12306) is about the configurable characteristic. How is a global system property hard to provide backwards compatibility for? |
Use a logger instead of stderr Log a warning about unsuportability beyond 2048 Test 2048, and test that we test it :-)
Mmm where is 2048 coming from? |
All limits are magic numbers :-). Maybe you mean, lets add a bit of docs in the code to reflect why it is what it is? From some of the conversation threads, I recall 2048 is the highest number I saw thrown around as might be useful. What matters here in this PR, I see from some folks, is that it's a supportable number. Tests pass with it. If you recommend a different number then lets here it but I admit not being too concerned with it. The default 1024 matters much more and that's not changing here. In my latest commit, I had it log a warning if 2048 is exceeded. My thinking is that maybe somebody wants to benchmark it just to see if it works but not actually use it in a production setting. I don't love a hard (exception) upper limit but I suspect some people might consider this a veto-worthy objection. You've got to wear armor in this community to be a part of it for long :-/ |
I mean, yeah, I hate limits that are not related to typing constraint, and also in that case, I agree they are magic numbers behind the scenes :) Still, I would question why showing a warning for 2049 and not for 2047. Anyway It's not something I would enforce it as a veto as long as someone can go upper than 2048 with no consequence except the warning |
The PR has evolved from the first iteration. Are there remaining concerns with the PR as it is today? It shows that 2048 dimensions is tested, works, and thus is supportable. It would make a lot of our users happy (Lucene exists to serve them) and lure more users, and I don't see what harm it would bring. |
I still veto a system property at all. re-read what i already wrote. |
pushing new commits and pretending that they address concerns is bs. ignoring my concerns is bs. claiming security manager was my only concern with the system property is an actual lie when i list 5 fucking bullet points above. stop this nonsense. |
I see you are very frustrated. I insist I'm trying to listen to you (and others) Rob; sorry if I'm clumsy in doing so. Your chief concern is supportability both today and in perpetuity forever and ever. Mike & Nick voiced this concern too. I'll put aside, at the moment, wether it's reasonable to have such a sacrosanct standard, as I don't agree with it across major releases. Nevertheless, the PR shows 2048 dimensions works today; why would it not in the future? You seem to express you don't really like system properties (at least not for this any way) but that's not really a technical argument. You expressed you'd rather users do other things, but that's not to say we should prevent this way. You expressed the default codec isn't a place for leniency. I contest that it's a technical argument, and it'd debatable what "leniency" means here. Unlike the first iteration of this patch where you voiced your concerns, the version here shows 2048 is supported (unlike billions, say). |
Here's an example why making the vector dimensions configurable is a bad idea: #12281 |
I may sound like somebody who contradicts another just for the sake of doing so, but I do genuinely believe these kind of discoveries support the fact that making it configurable is actually a good idea: |
We've since enabled Codec to set the limit, which is very expert and I think a safer way to change the limit than a |
The principle objective of this PR is to make it easier for users to use Lucene with higher dimensions. While it's technically possible to circumvent the existing limit, it's a non-obvious awkward hack that either the user would need to figure out or would be baked into higher level search platforms. It's so obscure that most Lucene committers didn't even know it was possible! Thus the practical effect for users now is that it's not possible so they don't use Lucene.
The system property proposed here is "lucene.hnsw.maxDimensions".
I also deprecated the field in anticipation that it will move to a codec specific place. Regardless of if/when that happens, I don't think we want to advertise this limit where it is now, which is at a surface level Lucene API with a present value based on the default codec that may not make sense for other vector codecs. Moving that is out of scope of this PR.
AFAICT, this limit is merely an ergonomics kind of limit to help the user from shooting themselves in the foot. The underlying codec we have can read/write an arbitrary number of dimensions. CheckIndex validates the number of dimensions is greater than zero but doesn't enforce a maximum.