-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP](POC) Add SuperBIG extension to simulate an unlimited fuzzy virtual context #1548
Conversation
This sounds amazing also reminds me of https://github.com/wawawario2/long_term_memory |
What is this dark sorcery 🤯 |
|
The ChromaDB instance is run locally, only stored on your computer. Additionally, I intercepted all analytics calls to posthog yesterday, so no analytics are collected
Not a summary. It dumps the contents as-is and uses (by default) cosine similarity search on the embeddings to retrieve relevant portions
It's completely offline :) |
|
Looks like it should provide better accuracy. Sure, I'll add it |
This is miracle-tier. I copy and pasted the Stable Diffusion paper, resulting in 36946 tokens in the input textfield. Then I asked what datasets were mentioned in the paper, and it gave a factual answer. I see that you are using a template based on Alpaca. Can the extension be generalized somehow to work with other instruction templates or in chat mode? The ability to use external data sources as input would also be cool, like a |
Yes, give me time lol |
Yes. I'm working on it, please give me time lol |
this is very cool.. maybe a button to refresh the uploaded document in case you re-edit it? or would it be refreshed automatically when you make changes and save it? |
@snicolast If you are using the ### Data approach, it is not currently cached. Editing the text will modify the chunks. In the update I will post later, I will add the ability to refresh individual sources |
Chat characters will likely need some more accessible functionality so that appropriate context can be loaded on switch character (this is directed more at ooba than kaioken, since character support is likely a later concern). I'm assuming it's grabbing things based on ChromaDB collections so it should be easy enough to pass that through assuming it's accessible/hookable. I haven't looked at the extension API in a bit so forgive me if I'm out of date or off base, but I don't believe it had any way to hook character change. |
I think some are confused on the goals of the extension, and I really just uploaded what was needed to set it up without going into depth on what the ideal state actually looks like, so let me explain. Keep in mind I'm cutting a lot out and just shitting out what's in my head. The ability to use an embedding store as a database of text is not new. It has been around for a long time, and it is a simple concept. Cosine similarity allows us to take an input text, get its embeddings, and use those embeddings to find relevant text information in an embeddings store. If that was all this extension was going to be -- but for context -- I wouldn't have even bothered to write the code. You can do this with Langchain out-of-the-box with some simple steps. But that is not the only thing I intend to do here. LLMs are notoriously bad at picking out the finer bits of context to use when generating the output. It is difficult to get them to comply with minute instructions and small details in the input prompt. Vector stores help with that by letting us filter down the prompt to only the portions that are most relevant to the input, and they can hold much more data than the LLM can in it's limited context size -- but they're still very lossy. This is where Focus comes in. This is the actual goal of the extension and I'm in the process of adding it. What does that mean? The way vector stores are usually used is storing text information and retrieving it semantically -- using natural-language. The retrieved text is injected into the prompt, and you're done with it until the next input comes along. But what if we stacked a layer in between the input and the vector store -- specifically, a layer that can be configured based on any data, not just the input prompt, and guides the retrieval of information from the vector store? Not a where filter, but something more complex. What we could do is use a simple list of embeddings that themselves are stored in a vector store. We compare the input embeddings and/or any other miscellaneous data to retrieve the top 1 result from the list. The result is a string, but it can be mapped to anything, and we can use the mapping to further configure how, what, or even when we retrieve data from the context store. Think of this scenario: you have a source added for the homepage of Food.com or some other recipe site. You ask the model, "What kinds of recipe can you help me to make?". In the Focus store, we have a list that contains some operators that may or may not be correlated with input intent, including the operation "list". By using the input prompt, we fetch the top 1 result from our Focus operators, and it comes back with the "list" operation. We mapped "list" to a function that parses the webpage and formats it into distinct sections of links and link label, the function then runs that list of sections through another distance calculation to get appropriate headings for each section (so the site directory has the heading "navigation", and the section with the recipes is marked with "recipe". Rather than returning the raw data from the context store, we instead return this transformed version, and by running a calculation again on the input and this new data, we would get back the chunk that contains the recipe list. The LLM now only sees the most relevant information from the site, so it should have no problem answering the original question. That is the power of focus -- we can program the reasoning of the LLM. Not directly at the model level, but by limiting what it sees in the context. We can control how the LLM "reasons" by controlling what it can see with simple operations, giving us finer control on the output of the model. Let's look at another example from an anon on /lmg/:
Focus operators will solve this issue. Imagine in the future we have some sort of giant "Personality Compendium", just a huge TXT where each section is labeled with a personality trait, and some characteristics of that trait defined as focus operations. We have a Character Focus layer set up that pulls from the compendium, the character personality definition file and the character's memory file -- note that we can have multiple Focus layers pulling from different combinations of files/sources, and each Focus layer can have multiple steps. One of the compendium sections -- "survivalist" -- contains the traits for hardy characters and has a label-type focus operation that looks like this (spit balling on the syntax): This has a top-level label "ambience", a sub-label "dangerous", and a final label "weapon". The intent is to tell the Focus layer that when the ambience of the scene is dangerous, find all context chunks in the character memory bucket that relate to "weapon". We can even add recency bias and also the character name if we wanted some automagic formatting behind the scenes (spit balling on the syntax): This says, if the ambience is dangerous, search all chunks that relate to "weapon + character name" in this character's memory file with a bias towards more recent information. A situation occurs, some text is generated and we run this through the character focus layer. We first do an embedding search for the output text + the word "ambience", with a defined list of outputs, one of them being "dangerous". We could even have this step further guided by the character's personality -- maybe they have a bias to perceiving situations as dangerous, giving it a stronger weight. Remember -- this is all configurable. If we get "dangerous" as the top 1 result, this matches the focus operator of our survivalist character, so we now do another embedding search with the word "weapon" + character name with a bias towards more recent chunks in the memory bucket. We get 3 matches, maybe one of these referencing the character having a weapon. We then inject the retrieved chunks into the context. Blah blah blah, other chunks are generated via other operators and also added to the pool. Now, when the character's response is generated, they can only see the personality traits that apply to their character, the situation at hand, and the fact they have a weapon, along with other information. Because the context has been massaged, the LLM now has the optimal information for playing the character's role. In the ideal state, users would just drop in a simple character card and a compendium and their character would stay true to their personality, no matter what. Regardless of context size, regardless of memory limitations, regardless of the inherent reasoning capabilities of the underlying model. Do I have a personality compendium on-hand? No. Do I know what the syntax will look like? No. But that's not the point. This would allow you to program the output of the model without having to find the perfect dataset, or the best finetune, or the best generation settings, or retraining a model, or waiting for a model that has 1 bajillion context that can't fit onto 24 GB of VRAM. |
Sorry for the massive wall of text, but hopefully it illustrates the schizo paradise I have concocted in my head |
I can confidently say that I have no idea what you are trying to do, but it sounds smart and I'm rooting for you and for Focus. |
Just gonna do the same and give some thoughts on project structure to consider if you haven't already. Obviously, ignore them if you've already thought on them and decided to go a different direction. The basic framework sounds logical and practical broadly speaking. I think the wider thought process here would be project structure and planning. There are non-character use cases that people will likely want and so it's possible the baseline here (SuperBIG) is just simple sort of dropdown-based collection switching so people can search against whatever larger dataset. A box to handle adding big dumb blocks of text into a named collection and then either using one at a time via the dropdown or adding a As to focus and the broader character specifics, the overall sounds good (ignoring the tragic inability to weight context in LLMs currently). There hasn't been much discussion of making use of multiple calls to the LLM for things like classification and even the Tavern Extras uses a specific mood model. It could absolutely be worth giving a context to the loaded model and just saying "Out of these terms () which best fits the above scene." and see how it does. That'd be painful for CPU users potentially (though llama.cpp uses GPU for processing now), but with the lookup slowdowns being what they're likely to be for the ChromaDB search anyway, it's potentially worth looking into. If it is consistent, that would help cut down on manual classification. Obviously for personalities, that's something better set in cards and pulled out based on just parsing card data. webui could have a more thorough UI for all that (something like agnai) so that the character card data is in a predictable format and queryable. Again, that's more of an ooba thing and oos for this PR specifically, but likely of value. in the longer term for many extensions. If the character compendiums are more something you're thinking of as examples of a given set of reactions to stimuli/different ambiences, that would likely still be worth doing for maintaining particular character speech quirks for sure. That's definitely a huge value and doesn't really preclude what I've mentioned above, I don't think. More explicit, orderly data about the characters will help optimize the other moving parts. Given those possibilities, you could even just pass another classification request to the LLM, saying "Given the situation above, which of these personality traits is most likely to be relevant? " Or even "which personality trait is most likely to be relevant in a situation?" for a shorter context version. Obviously, there would need to be a level of confidence in the LLM's classification prowess there, but it could save on the need for more complex character compendiums, though those would obviously still help augment it. Godspeed. Hope any of that was helpful or thought-provoking. |
@digiwombat Thanks for the feedback
I have no plans to include that kind of functionality in this extension, and if users want it, they can do it using Langchain. Making multiple calls to the LLM is slow and requires rolling the dice hoping you get a properly formatted response back, and only gets worse the lower the parameter size becomes. I'm not too worried on the performance of the vector search as it can be optimized, but using a whole LLM for this would just end up boomeranging the problem. Remember, the goal is to refine the context so that 1) users can pretend they have a much larger context than they actually do, and 2) the LLM can reason better since it sees properly scoped pieces of information from the fake context. Using a LLM that suffers from 2 to get around 2 isn't an approach I will use.
Yes, if Focus works, it would let you program the context of the model. Like typical programs, you would just load a pre-built API or library that focuses your character's responses and be on your way.
A vector search is more than enough Again, thanks for the feedback and ideas! |
Pre processing text can help Focus too. Transforming English to simple English inherently cuts the vocabulary from 170K to 2.5K, while not losing context, making it easier for the model to understand and for the vector search to find matches (less variability). One can use that parameter to adjust meta model fidelity (how large the pre processing vocabulary is). Using most frequent nouns as the main nodes across the vector store map, reduces the embedding space while giving both the model and vector search a stronger set of words to find the right vector chunk (most frequent words will appear the most in most original text). And you can compress text by using a unique data dictionary of emojis, again against the most frequent words, so your token count drops, while your search speed increases. Generating a knowledge base framework first, then adding as many category maps and their subsets (definitions, examples) as possible to it, would also help create a generalized vector store across use cases. Based project. |
This comment was marked as resolved.
This comment was marked as resolved.
The focus operator and focus layer concept far exceeds the scope of this project, and honestly I think it's equally or more important. What you're describing basically improves the output with an exponential roll off of diminishing returns as a product of how well it prevents chain queries (compared with chains of thought), and it seems it's already working very well for that, and writing Focus layers is an inexhaustible resource since from what I can gather you're talking about writing an ecosystem of simple programming functions. On top of that, after the massive upgrade of having a very simple Focus ecosystem, the exponential roll off of pushing in this direction can be delayed further once you talk about having AI manage focus layers, and potentially even having AI write the focus operators and compendiums. That being said, I think there's something much, much larger here. Because of that, I'd love to see you add this to a separate project instead of leaving it as a draft. I understand you don't want to work on the applications that use this themselves, but having a starting point for a Focus API with a center to rally around and a simple example of some compendiums would be a huge boon. If you don't, I'll be working on this on my own, since it seems very, very big. Thanks for the idea mang, sounds awesome |
@hydrix9 Thanks for the feedback There are 2 parts of the extension: pseudocontext (virtual prompts) and Focus. The next update structures the project better for packaging into a separate module (much later down the road). For now, I think it is fine living here :) |
@kaiokendev what do you think of merging a simplified version of the extension and then building on top of that in the future? |
@oobabooga Sorry, I've been busy working on another LoRA. I will upload what I have later today. It will not be complete, but maybe others can modify it in the meantime. |
I look forward to see whatever you came up with so far. |
@oobabooga Sorry for the delay, my GPU was busy doing something else for the day. I have updated the extension. Note it is not complete:
I did not even check if it breaks if you don't provide a source, anyway here is the example of a prompt:
Note: it's very dirty code, please don't mock it lol |
Ah, I forgot to mention I am not sure if ### Data will work in this state, since I changed so much. |
Settings and Sources tab are empty with the current pull. with gr.Tab('Sources'):
pass
with gr.Tab('Settings'):
pass Or, are potentially compressed so efficiently as to appear to be empty. The end result for UX is the same either way. It's possible this is intentional and my brain refused to read the words in kaiokendev's post above. |
@digiwombat It is intentional, I wanted to add those tabs, but I got sidetracked by other work. There is a long list of things I intended to add but I haven't touched the code in a week or so, and won't be able to work on it for at least another 2 weeks. |
No worries, I thought it might be. Ooba, if you merge, those line should likely be commented out for now, assuming you don't want to answer that question about a million times. Mostly making notes here for when kaioken is done with 🌶️thedataset🌶️ or when ooba takes over this version of the PR. All testing was in notebook mode. Initial testing definitely worked for URL loading, however if I try to run anything else before loading the URL, it I get an index out of range error. I tried this with both Likewise, after loading the URL (I can load basically any URL without issue) to clear the error, it does not appear to make new sources if I just copy-paste a bunch of text in (as opposed to another url) and continues to say "the best source seems to be " and never tries to chunk the context without the addition of a URL. When dealing with URLs alone, I tried loading a fairly heavy Reddit thread (using old.reddit) and got the following error:
Further attempts to ask questions didn't lead to a reattempt at scanning the URL just use an old URL if prompted (presumably because the hash was already present and the failure left the empty collection in place. Scanning of threads on other websites was generally successful. I was testing against 13b, so regens were necessary, but that is expected. The console has useful outputs until the Settings and Sources UI tabs are completed. I'd consider those a high priority for baseline usability. Without it, it's hard to know what sources are being considered without reading the console output (fine for now). EDIT: It is mentioned in the console output, but it is probably also worth disclaiming that the databases are transient and will presumably be lost when webui is closed. (They do persist disabling and re-enabling superbig without restarting webui fully, which may or may not be desired functionality) |
Some other user mentioned they want to integrate this into SillyTavern, so I will fix these issues to get it to a usable state, and add the UI for adding sources, and push tomorrow
|
I have taken the liberty of hijacking this PR and simplifying it. My adapted implementation works as follows:
I'm skeptical of the benefit of OOP. The code ends up being more about defining the structures than effectively doing something. This procedural implementation should make it straightforward to expand the extension to get data from text files or URLs in the future. I'll try to expand it to chat mode in the next days by using the ChromaCollector to select the most relevant question/reply pairs to based on the new user input. |
This is a pretty aggressive structure and feature set reduction. Should that warrant a rename in case @kaiokendev wants to roll the larger feature set out into a repo of its own so there won't be any confusion? Obviously, pending his thoughts on the matter overall. |
Did more testing this morning. Boxes only show up in notebook mode, whoops for me on that one. If possible, I'd say the tab shouldn't show up in modes it doesn't support, not sure if that functionality is possible at the moment. Will likely be pretty confusing. Otherwise, it's anecdotal, but I feel like my responses are being processed slower than the original branch. Not sure if it's a change in batching or search logic since I didn't go super deep. |
I moved the unsimplified version to https://github.com/kaiokendev/superbig and will post the fixes there later, along with future changes |
Definitely recommend a name change on the simple version in light of that. |
I'm freaking out about how amazing this is. Works great for pasting text from stuff like GameFAQs. |
The way to build scripts is similiar to WESTWORLD. |
Unfortunately I am not a coder but I have an idea- Persistent Storage Bonus- I am sure everyone has seen this but maybe a small second search specialty model in the process could help somehow? I am trying to figure out how to write a character/mode now to help but have no idea what or if anything will help get EdgeGPT working better. |
This PR showcases a proof-of-concept extension that generalizes the idea of using a vectorstore to index large documents to fake a larger (fuzzy/lossy) context window by dumping the prompt into ChromaDB and using retrieval methods to extract the relevant portions back into the real context. It does not extend the context length of the base model or modify the underlying model architecture in any way and can be used with any model as a base. The settings can be tuned to yield better results.
ELI5: This extension wraps your model's context in a virtual context of unlimited size - think like a swapfile or pagefile
The PR has only a naive method for non-instruct and instruct. Using the base, more sophisticated retrievers and chunking logic could be added in to yield substantially better results.
The performance impact is not as severe as expected, since it does not actually put the entire contents into memory. The main performance hit comes from the indexing, so there is still performance loss.
Output generated in 9.00 seconds (5.89 tokens/s, 53 tokens, context 1008, seed 1239992403)
Output generated in 7.20 seconds (5.42 tokens/s, 39 tokens, context 1196, seed 9253882)
Output generated in 9.73 seconds (3.49 tokens/s, 34 tokens, context 1326, seed 747769285)
Below are a few tests done using the entire contents of Bulatov et. al 2023 (RMT) https://arxiv.org/pdf/2304.11062.pdf and https://huggingface.co/tsumeone/llama-30b-supercot-4bit-128g-cuda
With the extension, the model was able to correctly answer questions about portions of the paper.
Additionally the chunk size was set to 700 characters, and preset LLaMa-Precise with seed 1 and the following prompt format:
As you can see due to the naive approach, the output is not able to accomadate abrupt stops in the data (but it almost finished a correct answer):
Will leave this PR in a draft state for now. More work needs to be done to make a better chunker/retrieval. Different types of prompting modes (novel generation, chat, QA) have fixed formats that should make it easy to create specialty chunkers/retrievers that yield consistently good results. Further, more elaborate retrieval schemes could be created, such as caching one portion of the data while looking at the output of the model to retrieve subchunks based on the progress of the answer
edit: More sample images
Using ### Data as the entire page contents from https://animetranscript.fandom.com/wiki/Enter:_Naruto_Uzumaki!
Using ### Data as the entire page contents from https://en.wikipedia.org/wiki/Alfred_Shout
TL;DR it works. if anyone is interested to help would appreciate it
edit: I put this as a PR to main repo since I think most people would want this, when it's in a complete state