Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP](POC) Add SuperBIG extension to simulate an unlimited fuzzy virtual context #1548

Merged
merged 10 commits into from
May 7, 2023

Conversation

kaiokendev
Copy link
Contributor

@kaiokendev kaiokendev commented Apr 25, 2023

This PR showcases a proof-of-concept extension that generalizes the idea of using a vectorstore to index large documents to fake a larger (fuzzy/lossy) context window by dumping the prompt into ChromaDB and using retrieval methods to extract the relevant portions back into the real context. It does not extend the context length of the base model or modify the underlying model architecture in any way and can be used with any model as a base. The settings can be tuned to yield better results.

ELI5: This extension wraps your model's context in a virtual context of unlimited size - think like a swapfile or pagefile

The PR has only a naive method for non-instruct and instruct. Using the base, more sophisticated retrievers and chunking logic could be added in to yield substantially better results.

The performance impact is not as severe as expected, since it does not actually put the entire contents into memory. The main performance hit comes from the indexing, so there is still performance loss.
Output generated in 9.00 seconds (5.89 tokens/s, 53 tokens, context 1008, seed 1239992403)
Output generated in 7.20 seconds (5.42 tokens/s, 39 tokens, context 1196, seed 9253882)
Output generated in 9.73 seconds (3.49 tokens/s, 34 tokens, context 1326, seed 747769285)

Below are a few tests done using the entire contents of Bulatov et. al 2023 (RMT) https://arxiv.org/pdf/2304.11062.pdf and https://huggingface.co/tsumeone/llama-30b-supercot-4bit-128g-cuda
With the extension, the model was able to correctly answer questions about portions of the paper.
Additionally the chunk size was set to 700 characters, and preset LLaMa-Precise with seed 1 and the following prompt format:

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
You are ArxivGPT, trained on millions of Arxiv papers. You always answer the question, even if full context isn't provided to you. The following are snippets from an Arxiv paper. Use the snippets to answer the question. Think about it step by step

### Data:
[...contents of the paper...]

### Input:
[question]

### Response:
...
### Input:
What problems did they experience during testing?

### Response:
They experienced issues such as overfitting, lack of data diversity, and poor performance on out-of-domain examples.
...
### Input:
What were the longest input sizes possible with Longformer, Big Bird, and Long T5?

### Response:
The longest Longformer, Big Bird, and Long T5 models reported in their respective papers have a maximum length of less than 33,000 tokens.
...
### Input:
How did they evaluate reasoning with facts?

### Response:
To evaluate reasoning with facts, they used a more complicated task, where two facts were generated and positioned randomly within the input sequence. A question was then posed at the end of the sequence which could only be answered correctly by utilizing one of the facts.

As you can see due to the naive approach, the output is not able to accomadate abrupt stops in the data (but it almost finished a correct answer):

### Input:
How well does RMT generalize to different sequence lengths?

### Response:
The ability of RMT to generalize to longer sequences also emerges with a growing number of training segments. After being trained on 5 or more segments, RMT can generalize nearl

Will leave this PR in a draft state for now. More work needs to be done to make a better chunker/retrieval. Different types of prompting modes (novel generation, chat, QA) have fixed formats that should make it easy to create specialty chunkers/retrievers that yield consistently good results. Further, more elaborate retrieval schemes could be created, such as caching one portion of the data while looking at the output of the model to retrieve subchunks based on the progress of the answer

edit: More sample images

Using ### Data as the entire page contents from https://animetranscript.fandom.com/wiki/Enter:_Naruto_Uzumaki!
naruto scene

Using ### Data as the entire page contents from https://en.wikipedia.org/wiki/Alfred_Shout
alfredshout

TL;DR it works. if anyone is interested to help would appreciate it
edit: I put this as a PR to main repo since I think most people would want this, when it's in a complete state

@USBhost
Copy link
Contributor

USBhost commented Apr 26, 2023

This sounds amazing also reminds me of https://github.com/wawawario2/long_term_memory

@kaiokendev kaiokendev changed the title [WIP](POC) Add pseudocontext to simulate an unlimited fuzzy virtual context [WIP](POC) Add SuperBIG extension to simulate an unlimited fuzzy virtual context Apr 26, 2023
@Brawlence
Copy link
Contributor

Brawlence commented Apr 26, 2023

What is this dark sorcery 🤯

@oobabooga
Copy link
Owner

oobabooga commented Apr 26, 2023

If I understand correctly, this uses an API that takes as input the beginning of the prompt (even if it goes beyond 2048 characters) and outputs a relevant summary that is then used for generation. Is that correct? If so, I wonder if there are privacy considerations to have in mind, and if an offline alternative is possible. it runs offline as answered below

@kaiokendev
Copy link
Contributor Author

kaiokendev commented Apr 26, 2023

an AP

The ChromaDB instance is run locally, only stored on your computer. Additionally, I intercepted all analytics calls to posthog yesterday, so no analytics are collected

outputs a relevant summary

Not a summary. It dumps the contents as-is and uses (by default) cosine similarity search on the embeddings to retrieve relevant portions

and if an offline alternative is possible

It's completely offline :)

@paulhandy
Copy link

Not a summary. It dumps the contents as-is and uses (by default) cosine similarity search on the embeddings to retrieve relevant portions
Do you think you could use @yoheinakajima's asymmetrix in place?

@kaiokendev
Copy link
Contributor Author

Do you think you could use @yoheinakajima's asymmetrix in place?

Looks like it should provide better accuracy. Sure, I'll add it

@oobabooga oobabooga marked this pull request as ready for review April 26, 2023 21:30
@oobabooga oobabooga marked this pull request as draft April 26, 2023 21:31
@oobabooga
Copy link
Owner

oobabooga commented Apr 26, 2023

This is miracle-tier. I copy and pasted the Stable Diffusion paper, resulting in 36946 tokens in the input textfield. Then I asked what datasets were mentioned in the paper, and it gave a factual answer.

sd

I see that you are using a template based on Alpaca. Can the extension be generalized somehow to work with other instruction templates or in chat mode?

The ability to use external data sources as input would also be cool, like a .txt file.

@kaiokendev
Copy link
Contributor Author

kaiokendev commented Apr 26, 2023

Yes, give me time lol
I am currently adding modifiable sources (so you can just put the URL and in the prompt write [[[put_my_page_contents_here]]]) and also modular templates (so it will automatically work no matter the format, and you edit in a new format)
Also need to make the search function modifiable and separate from the collection db implementation (just Chroma for now) to add asymmetrix, and am working on Focuser (which will be like virtual attention, but I'll explain that more as that is completed)

@oobabooga
Copy link
Owner

I have added a suggestion: putting the data into a separate textbox created by the extension. Then the user can type some special token like <|data|> in the prompt that gets replaced with the contents of that textbox automatically at generation time.

input

@kaiokendev
Copy link
Contributor Author

Yes. I'm working on it, please give me time lol
The current UI will have tabs, under the Sources tab you can drop a path to a file, a URL, or raw text, and give it a name for the injection point. It will then grab the contents, cache it, and replace the portion of the user input using the contents of the injection. I also plan to add the ability to crawl directories or lists of URLs on webpages and find the appropriate URL from the directory that is most likely to contain the relevant information (such as adding the entire UE5 documentation root page and it will automatically find the relevant webpage for the specific thing you're asking about)

@snicolast
Copy link

this is very cool.. maybe a button to refresh the uploaded document in case you re-edit it? or would it be refreshed automatically when you make changes and save it?

@kaiokendev
Copy link
Contributor Author

@snicolast If you are using the ### Data approach, it is not currently cached. Editing the text will modify the chunks. In the update I will post later, I will add the ability to refresh individual sources

@digiwombat
Copy link

Chat characters will likely need some more accessible functionality so that appropriate context can be loaded on switch character (this is directed more at ooba than kaioken, since character support is likely a later concern). I'm assuming it's grabbing things based on ChromaDB collections so it should be easy enough to pass that through assuming it's accessible/hookable.

I haven't looked at the extension API in a bit so forgive me if I'm out of date or off base, but I don't believe it had any way to hook character change.

@kaiokendev
Copy link
Contributor Author

I think some are confused on the goals of the extension, and I really just uploaded what was needed to set it up without going into depth on what the ideal state actually looks like, so let me explain. Keep in mind I'm cutting a lot out and just shitting out what's in my head.

The ability to use an embedding store as a database of text is not new. It has been around for a long time, and it is a simple concept. Cosine similarity allows us to take an input text, get its embeddings, and use those embeddings to find relevant text information in an embeddings store. If that was all this extension was going to be -- but for context -- I wouldn't have even bothered to write the code. You can do this with Langchain out-of-the-box with some simple steps. But that is not the only thing I intend to do here.

LLMs are notoriously bad at picking out the finer bits of context to use when generating the output. It is difficult to get them to comply with minute instructions and small details in the input prompt. Vector stores help with that by letting us filter down the prompt to only the portions that are most relevant to the input, and they can hold much more data than the LLM can in it's limited context size -- but they're still very lossy.

This is where Focus comes in. This is the actual goal of the extension and I'm in the process of adding it.
Focus is configurable reasoning.

What does that mean? The way vector stores are usually used is storing text information and retrieving it semantically -- using natural-language. The retrieved text is injected into the prompt, and you're done with it until the next input comes along. But what if we stacked a layer in between the input and the vector store -- specifically, a layer that can be configured based on any data, not just the input prompt, and guides the retrieval of information from the vector store? Not a where filter, but something more complex.

What we could do is use a simple list of embeddings that themselves are stored in a vector store. We compare the input embeddings and/or any other miscellaneous data to retrieve the top 1 result from the list. The result is a string, but it can be mapped to anything, and we can use the mapping to further configure how, what, or even when we retrieve data from the context store.

Think of this scenario: you have a source added for the homepage of Food.com or some other recipe site. You ask the model, "What kinds of recipe can you help me to make?". In the Focus store, we have a list that contains some operators that may or may not be correlated with input intent, including the operation "list". By using the input prompt, we fetch the top 1 result from our Focus operators, and it comes back with the "list" operation. We mapped "list" to a function that parses the webpage and formats it into distinct sections of links and link label, the function then runs that list of sections through another distance calculation to get appropriate headings for each section (so the site directory has the heading "navigation", and the section with the recipes is marked with "recipe".

Rather than returning the raw data from the context store, we instead return this transformed version, and by running a calculation again on the input and this new data, we would get back the chunk that contains the recipe list. The LLM now only sees the most relevant information from the site, so it should have no problem answering the original question.

That is the power of focus -- we can program the reasoning of the LLM. Not directly at the model level, but by limiting what it sees in the context. We can control how the LLM "reasons" by controlling what it can see with simple operations, giving us finer control on the output of the model.

Let's look at another example from an anon on /lmg/:

So for example, if a character pocketed a gun early on, and you ask if they have a gun, it might remember that. But they might be in a situation where a gun is called for, and not use it

Focus operators will solve this issue.

Imagine in the future we have some sort of giant "Personality Compendium", just a huge TXT where each section is labeled with a personality trait, and some characteristics of that trait defined as focus operations. We have a Character Focus layer set up that pulls from the compendium, the character personality definition file and the character's memory file -- note that we can have multiple Focus layers pulling from different combinations of files/sources, and each Focus layer can have multiple steps. One of the compendium sections -- "survivalist" -- contains the traits for hardy characters and has a label-type focus operation that looks like this (spit balling on the syntax):
ambience: dangerous?: memory.weapon

This has a top-level label "ambience", a sub-label "dangerous", and a final label "weapon". The intent is to tell the Focus layer that when the ambience of the scene is dangerous, find all context chunks in the character memory bucket that relate to "weapon". We can even add recency bias and also the character name if we wanted some automagic formatting behind the scenes (spit balling on the syntax):
ambience: dangerous?: memory.weapon +{name} +bias:time

This says, if the ambience is dangerous, search all chunks that relate to "weapon + character name" in this character's memory file with a bias towards more recent information.

A situation occurs, some text is generated and we run this through the character focus layer. We first do an embedding search for the output text + the word "ambience", with a defined list of outputs, one of them being "dangerous". We could even have this step further guided by the character's personality -- maybe they have a bias to perceiving situations as dangerous, giving it a stronger weight. Remember -- this is all configurable. If we get "dangerous" as the top 1 result, this matches the focus operator of our survivalist character, so we now do another embedding search with the word "weapon" + character name with a bias towards more recent chunks in the memory bucket. We get 3 matches, maybe one of these referencing the character having a weapon.

We then inject the retrieved chunks into the context. Blah blah blah, other chunks are generated via other operators and also added to the pool. Now, when the character's response is generated, they can only see the personality traits that apply to their character, the situation at hand, and the fact they have a weapon, along with other information. Because the context has been massaged, the LLM now has the optimal information for playing the character's role.

In the ideal state, users would just drop in a simple character card and a compendium and their character would stay true to their personality, no matter what. Regardless of context size, regardless of memory limitations, regardless of the inherent reasoning capabilities of the underlying model.

Do I have a personality compendium on-hand? No. Do I know what the syntax will look like? No. But that's not the point. This would allow you to program the output of the model without having to find the perfect dataset, or the best finetune, or the best generation settings, or retraining a model, or waiting for a model that has 1 bajillion context that can't fit onto 24 GB of VRAM.

@kaiokendev
Copy link
Contributor Author

Sorry for the massive wall of text, but hopefully it illustrates the schizo paradise I have concocted in my head

@oobabooga
Copy link
Owner

I can confidently say that I have no idea what you are trying to do, but it sounds smart and I'm rooting for you and for Focus.

@digiwombat
Copy link

Keep in mind I'm cutting a lot out and just shitting out what's in my head.

Just gonna do the same and give some thoughts on project structure to consider if you haven't already. Obviously, ignore them if you've already thought on them and decided to go a different direction.

The basic framework sounds logical and practical broadly speaking. I think the wider thought process here would be project structure and planning. There are non-character use cases that people will likely want and so it's possible the baseline here (SuperBIG) is just simple sort of dropdown-based collection switching so people can search against whatever larger dataset. A box to handle adding big dumb blocks of text into a named collection and then either using one at a time via the dropdown or adding a ### DATA block to define which collections to search against could be useful. And would allow for simple injection into character contexts while the more complex parts (Focus/Character) get put together. That might already be your plan, just laying it out as some thoughts on structure if they're not. It might simpler to have a baseline for Big Dumb Context sets and then build the more complex system on top rather than one-size-fits-all from the jump.

As to focus and the broader character specifics, the overall sounds good (ignoring the tragic inability to weight context in LLMs currently). There hasn't been much discussion of making use of multiple calls to the LLM for things like classification and even the Tavern Extras uses a specific mood model. It could absolutely be worth giving a context to the loaded model and just saying "Out of these terms () which best fits the above scene." and see how it does. That'd be painful for CPU users potentially (though llama.cpp uses GPU for processing now), but with the lookup slowdowns being what they're likely to be for the ChromaDB search anyway, it's potentially worth looking into. If it is consistent, that would help cut down on manual classification.

Obviously for personalities, that's something better set in cards and pulled out based on just parsing card data. webui could have a more thorough UI for all that (something like agnai) so that the character card data is in a predictable format and queryable. Again, that's more of an ooba thing and oos for this PR specifically, but likely of value. in the longer term for many extensions. If the character compendiums are more something you're thinking of as examples of a given set of reactions to stimuli/different ambiences, that would likely still be worth doing for maintaining particular character speech quirks for sure. That's definitely a huge value and doesn't really preclude what I've mentioned above, I don't think. More explicit, orderly data about the characters will help optimize the other moving parts.

Given those possibilities, you could even just pass another classification request to the LLM, saying "Given the situation above, which of these personality traits is most likely to be relevant? " Or even "which personality trait is most likely to be relevant in a situation?" for a shorter context version. Obviously, there would need to be a level of confidence in the LLM's classification prowess there, but it could save on the need for more complex character compendiums, though those would obviously still help augment it.

Godspeed. Hope any of that was helpful or thought-provoking.

@kaiokendev
Copy link
Contributor Author

kaiokendev commented Apr 27, 2023

@digiwombat Thanks for the feedback

There hasn't been much discussion of making use of multiple calls to the LLM for things like classification and even the Tavern Extras uses a specific mood model. It could absolutely be worth giving a context to the loaded model and just saying "Out of these terms () which best fits the above scene." and see how it does [...] but with the lookup slowdowns being what they're likely to be for the ChromaDB search anyway, it's potentially worth looking into

I have no plans to include that kind of functionality in this extension, and if users want it, they can do it using Langchain. Making multiple calls to the LLM is slow and requires rolling the dice hoping you get a properly formatted response back, and only gets worse the lower the parameter size becomes. I'm not too worried on the performance of the vector search as it can be optimized, but using a whole LLM for this would just end up boomeranging the problem. Remember, the goal is to refine the context so that 1) users can pretend they have a much larger context than they actually do, and 2) the LLM can reason better since it sees properly scoped pieces of information from the fake context. Using a LLM that suffers from 2 to get around 2 isn't an approach I will use.

If the character compendiums are more something you're thinking of as examples of a given set of reactions to stimuli/different ambiences, that would likely still be worth doing for maintaining particular character speech quirks for sure. That's definitely a huge value and doesn't really preclude what I've mentioned above, I don't think. More explicit, orderly data about the characters will help optimize the other moving parts.

Yes, if Focus works, it would let you program the context of the model.
Chain approaches, like what you suggested, are focused on refining the output of the model by doing multiple deep passes over the input. Focus instead tries to achieve a better output by managing the context better. The benefits would be 1) speed -- it's much faster to do the vector search in N phases and call the LLM once than to call the LLM N times, 2) control -- unlike chained outputs, Focus is more directed, so you don't need to mess with generation parameters or finetuning, and 3) size -- because Focus does the magic outside of the LLM, you're not limited to the context size of the model.

Like typical programs, you would just load a pre-built API or library that focuses your character's responses and be on your way.

Obviously, there would need to be a level of confidence in the LLM's classification prowess there, but it could save on the need for more complex character compendiums, though those would obviously still help augment it.

A vector search is more than enough

Again, thanks for the feedback and ideas!

@grexzen
Copy link

grexzen commented Apr 28, 2023

Pre processing text can help Focus too. Transforming English to simple English inherently cuts the vocabulary from 170K to 2.5K, while not losing context, making it easier for the model to understand and for the vector search to find matches (less variability). One can use that parameter to adjust meta model fidelity (how large the pre processing vocabulary is).

Using most frequent nouns as the main nodes across the vector store map, reduces the embedding space while giving both the model and vector search a stronger set of words to find the right vector chunk (most frequent words will appear the most in most original text).

And you can compress text by using a unique data dictionary of emojis, again against the most frequent words, so your token count drops, while your search speed increases.

Generating a knowledge base framework first, then adding as many category maps and their subsets (definitions, examples) as possible to it, would also help create a generalized vector store across use cases.

Based project.

@Xabab

This comment was marked as resolved.

@hydrix9
Copy link

hydrix9 commented Apr 28, 2023

The focus operator and focus layer concept far exceeds the scope of this project, and honestly I think it's equally or more important. What you're describing basically improves the output with an exponential roll off of diminishing returns as a product of how well it prevents chain queries (compared with chains of thought), and it seems it's already working very well for that, and writing Focus layers is an inexhaustible resource since from what I can gather you're talking about writing an ecosystem of simple programming functions.

On top of that, after the massive upgrade of having a very simple Focus ecosystem, the exponential roll off of pushing in this direction can be delayed further once you talk about having AI manage focus layers, and potentially even having AI write the focus operators and compendiums. That being said, I think there's something much, much larger here.

Because of that, I'd love to see you add this to a separate project instead of leaving it as a draft. I understand you don't want to work on the applications that use this themselves, but having a starting point for a Focus API with a center to rally around and a simple example of some compendiums would be a huge boon. If you don't, I'll be working on this on my own, since it seems very, very big. Thanks for the idea mang, sounds awesome

@kaiokendev
Copy link
Contributor Author

kaiokendev commented Apr 28, 2023

@hydrix9 Thanks for the feedback

There are 2 parts of the extension: pseudocontext (virtual prompts) and Focus.
Pseudocontext by itself is useful, but simply, Focus is like a mini search engine we can run on the pseudocontext and also control the behavior of the search. Because the behavior is configurable and can take into account the output of the model as well as the recursive relationships between the meanings of the input and even the search results (from input, get top related chunks, then get chunks related to those, etc, or get chunks related to input + search result 2 at depth N), it will be very powerful when fully realized

The next update structures the project better for packaging into a separate module (much later down the road). For now, I think it is fine living here :)
Since it only needs the prompt + model output to work, it would be easy to add as a external library in other systems (llama.cpp for example)

@oobabooga
Copy link
Owner

@kaiokendev what do you think of merging a simplified version of the extension and then building on top of that in the future?

@kaiokendev
Copy link
Contributor Author

@oobabooga Sorry, I've been busy working on another LoRA. I will upload what I have later today. It will not be complete, but maybe others can modify it in the meantime.

@oobabooga
Copy link
Owner

I look forward to see whatever you came up with so far.

@kaiokendev
Copy link
Contributor Author

kaiokendev commented May 6, 2023

@oobabooga Sorry for the delay, my GPU was busy doing something else for the day. I have updated the extension. Note it is not complete:

  • I still haven't tested it for chat, only notebook mode, but the latest commit adds paging, cached sources, automatically creating sources from posted urls, and automatically cleans webpages to the bare text content for better search.
  • I added source formatters (Alpaca only for now)
  • I intended to add the PageRank and a light knowledge graph over the document when it is added so we can start doing some more fancy stuff
  • I was planning on making it shrink the lowest ranking page by 1 token per token generated by the model to allow whatever generation length.
  • There is also a concept of windows, whereby each source's pages are contained in its own window and can be scrolled through separately at varying speeds (so in the future we can automatically scroll pages as their content is implied based on the output of the model.)
  • I wanted to get rid of ChromaDB since I think this dependency is not necessary -- we can make our own collector wrapper over the chosen search algorithm easily
  • Still did not add Asymmetrix (but I want to!)

I did not even check if it breaks if you don't provide a source, anyway here is the example of a prompt:

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Here is a webpage:
https://www.sbert.net/docs/pretrained_models.html

### Input:
What is all_mpnet_base_v2?

### Response:
All_mpnet_base_v2 refers to a pretrained language model by Sentence Transformers which has been fine-tuned on multiple datasets such as natural questions, trivia qa, webquestions, curatedtrec etc. It is a multi-task model capable of performing several NLP tasks like semantic search, text summarization, machine translation, sentiment analysis, named entity recognition, and others.

Note: it's very dirty code, please don't mock it lol

@kaiokendev
Copy link
Contributor Author

Ah, I forgot to mention I am not sure if ### Data will work in this state, since I changed so much.

@digiwombat
Copy link

Settings and Sources tab are empty with the current pull.

       with gr.Tab('Sources'):
            pass
        with gr.Tab('Settings'):
            pass

Or, are potentially compressed so efficiently as to appear to be empty. The end result for UX is the same either way. It's possible this is intentional and my brain refused to read the words in kaiokendev's post above.

@kaiokendev
Copy link
Contributor Author

@digiwombat It is intentional, I wanted to add those tabs, but I got sidetracked by other work. There is a long list of things I intended to add but I haven't touched the code in a week or so, and won't be able to work on it for at least another 2 weeks.

@digiwombat
Copy link

digiwombat commented May 6, 2023

No worries, I thought it might be. Ooba, if you merge, those line should likely be commented out for now, assuming you don't want to answer that question about a million times.

Mostly making notes here for when kaioken is done with 🌶️thedataset🌶️ or when ooba takes over this version of the PR. All testing was in notebook mode.

Initial testing definitely worked for URL loading, however if I try to run anything else before loading the URL, it I get an index out of range error. I tried this with both ### Data and the formatting used in the example, no dice.

Likewise, after loading the URL (I can load basically any URL without issue) to clear the error, it does not appear to make new sources if I just copy-paste a bunch of text in (as opposed to another url) and continues to say "the best source seems to be " and never tries to chunk the context without the addition of a URL.

When dealing with URLs alone, I tried loading a fairly heavy Reddit thread (using old.reddit) and got the following error:

chromadb.errors.NotEnoughElementsException: Number of requested results 3 cannot be greater than number of elements in index 2

Further attempts to ask questions didn't lead to a reattempt at scanning the URL just use an old URL if prompted (presumably because the hash was already present and the failure left the empty collection in place.

Scanning of threads on other websites was generally successful. I was testing against 13b, so regens were necessary, but that is expected.

The console has useful outputs until the Settings and Sources UI tabs are completed. I'd consider those a high priority for baseline usability. Without it, it's hard to know what sources are being considered without reading the console output (fine for now).

EDIT: It is mentioned in the console output, but it is probably also worth disclaiming that the databases are transient and will presumably be lost when webui is closed. (They do persist disabling and re-enabling superbig without restarting webui fully, which may or may not be desired functionality)

@kaiokendev
Copy link
Contributor Author

Some other user mentioned they want to integrate this into SillyTavern, so I will fix these issues to get it to a usable state, and add the UI for adding sources, and push tomorrow

  • if I try to run anything else before loading the URL, it I get an index out of range error

  • after loading the URL (I can load basically any URL without issue) to clear the error, it does not appear to make new sources if I just copy-paste a bunch of text in

  • never tries to chunk the context without the addition of a URL

  • When dealing with URLs alone, I tried loading a fairly heavy Reddit thread (using old.reddit) and got the following error: chromadb.errors.NotEnoughElementsException: Number of requested results 3 cannot be greater than number of elements in index 2

  • Further attempts to ask questions didn't lead to a reattempt at scanning the URL just use an old URL if prompted

  • until the Settings and Sources UI tabs are completed. I'd consider those a high priority for baseline usability. Without it, it's hard to know what sources are being considered without reading the console output

@oobabooga oobabooga marked this pull request as ready for review May 7, 2023 06:49
@oobabooga oobabooga merged commit 5a4bd39 into oobabooga:main May 7, 2023
@oobabooga
Copy link
Owner

I have taken the liberty of hijacking this PR and simplifying it. My adapted implementation works as follows:

  • The user input should be written between <|begin-user-input|> and <|end-user-input|> tags
  • The injection point should be specified with <|injection-point|> in the prompt
  • The data should be entered in a textbox created by the extension

I'm skeptical of the benefit of OOP. The code ends up being more about defining the structures than effectively doing something. This procedural implementation should make it straightforward to expand the extension to get data from text files or URLs in the future.

I'll try to expand it to chat mode in the next days by using the ChromaCollector to select the most relevant question/reply pairs to based on the new user input.

@digiwombat
Copy link

This is a pretty aggressive structure and feature set reduction. Should that warrant a rename in case @kaiokendev wants to roll the larger feature set out into a repo of its own so there won't be any confusion? Obviously, pending his thoughts on the matter overall.

@digiwombat
Copy link

digiwombat commented May 7, 2023

Merged version seems to be non-functional (no box for data) and visual formatting is broken (5a4bd39).

Did more testing this morning. Boxes only show up in notebook mode, whoops for me on that one. If possible, I'd say the tab shouldn't show up in modes it doesn't support, not sure if that functionality is possible at the moment. Will likely be pretty confusing.

Otherwise, it's anecdotal, but I feel like my responses are being processed slower than the original branch. Not sure if it's a change in batching or search logic since I didn't go super deep.

@kaiokendev
Copy link
Contributor Author

I moved the unsimplified version to https://github.com/kaiokendev/superbig and will post the fixes there later, along with future changes

@digiwombat
Copy link

Definitely recommend a name change on the simple version in light of that. simplebig maybe?

@TFWol
Copy link

TFWol commented May 8, 2023

I'm freaking out about how amazing this is. Works great for pasting text from stuff like GameFAQs.

@QHYAHFY
Copy link

QHYAHFY commented Jul 5, 2023

我认为有些人对扩展的目标感到困惑,我实际上只是上传了设置它所需的内容,而没有深入了解理想状态的实际情况,所以让我解释一下。请记住,我删掉了很多东西,只是把脑子里的东西都扔掉了。

使用嵌入存储作为文本数据库的能力并不新鲜。它已经存在很长时间了,而且是一个简单的概念。余弦相似度允许我们获取输入文本,获取其嵌入,并使用这些嵌入在嵌入存储中查找相关文本信息。如果这就是这个扩展的全部内容——但是为了上下文——我什至不会费心去编写代码。您可以通过一些简单的步骤使用开箱即用的 Langchain 来完成此操作。但这并不是我打算在这里做的唯一事情。

众所周知,法学硕士不擅长挑选出更精细的上下文以在生成输出时使用。让他们遵守输入提示中的微小指令和小细节是很困难的。向量存储通过让我们将提示过滤到仅与输入最相关的部分来帮助解决这一问题,并且它们可以比 LLM 在有限的上下文大小中容纳更多的数据 - 但它们仍然非常有损。

这就是 Focus 的用武之地。这是扩展的实际目标,我正在添加它。 焦点是可配置的推理。

这意味着什么?向量存储通常使用的方式是存储文本信息并使用自然语言进行语义检索。检索到的文本将被注入到提示中,直到下一个输入出现为止,您就完成了它。但是,如果我们在输入和向量存储之间堆叠一个层——具体来说,一个可以基于任何数据(而不仅仅是输入提示)进行配置的层,并指导从向量存储中检索信息,会怎么样_呢_?不是 where 过滤器,而是更复杂的东西。

我们可以做的是使用一个简单的嵌入列表,这些嵌入本身存储在向量存储中。我们比较输入嵌入和/或任何其他杂项数据,以从列表中检索前 1 个结果。结果是一个字符串,但它可以映射到_任何东西_,我们可以使用映射来进一步配置如何、什么,甚至何时从上下文存储中检索数据。

Think of this scenario: you have a source added for the homepage of Food.com or some other recipe site. You ask the model, "What kinds of recipe can you help me to make?". In the Focus store, we have a list that contains some operators that may or may not be correlated with input intent, including the operation "list". By using the input prompt, we fetch the top 1 result from our Focus operators, and it comes back with the "list" operation. We mapped "list" to a function that parses the webpage and formats it into distinct sections of links and link label, the function then runs that list of sections through another distance calculation to get appropriate headings for each section (so the site directory has the heading "navigation", and the section with the recipes is marked with "recipe".

Rather than returning the raw data from the context store, we instead return this transformed version, and by running a calculation again on the input and this new data, we would get back the chunk that contains the recipe list. The LLM now only sees the most relevant information from the site, so it should have no problem answering the original question.

That is the power of focus -- we can program the reasoning of the LLM. Not directly at the model level, but by limiting what it sees in the context. We can control how the LLM "reasons" by controlling what it can see with simple operations, giving us finer control on the output of the model.

Let's look at another example from an anon on /lmg/:

So for example, if a character pocketed a gun early on, and you ask if they have a gun, it might remember that. But they might be in a situation where a gun is called for, and not use it

Focus operators will solve this issue.

Imagine in the future we have some sort of giant "Personality Compendium", just a huge TXT where each section is labeled with a personality trait, and some characteristics of that trait defined as focus operations. We have a Character Focus layer set up that pulls from the compendium, the character personality definition file and the character's memory file -- note that we can have multiple Focus layers pulling from different combinations of files/sources, and each Focus layer can have multiple steps. One of the compendium sections -- "survivalist" -- contains the traits for hardy characters and has a label-type focus operation that looks like this (spit balling on the syntax): ambience: dangerous?: memory.weapon

This has a top-level label "ambience", a sub-label "dangerous", and a final label "weapon". The intent is to tell the Focus layer that when the ambience of the scene is dangerous, find all context chunks in the character memory bucket that relate to "weapon". We can even add recency bias and also the character name if we wanted some automagic formatting behind the scenes (spit balling on the syntax): ambience: dangerous?: memory.weapon +{name} +bias:time

This says, if the ambience is dangerous, search all chunks that relate to "weapon + character name" in this character's memory file with a bias towards more recent information.

A situation occurs, some text is generated and we run this through the character focus layer. We first do an embedding search for the output text + the word "ambience", with a defined list of outputs, one of them being "dangerous". We could even have this step further guided by the character's personality -- maybe they have a bias to perceiving situations as dangerous, giving it a stronger weight. Remember -- this is all configurable. If we get "dangerous" as the top 1 result, this matches the focus operator of our survivalist character, so we now do another embedding search with the word "weapon" + character name with a bias towards more recent chunks in the memory bucket. We get 3 matches, maybe one of these referencing the character having a weapon.

We then inject the retrieved chunks into the context. Blah blah blah, other chunks are generated via other operators and also added to the pool. Now, when the character's response is generated, they can only see the personality traits that apply to their character, the situation at hand, and the fact they have a weapon, along with other information. Because the context has been massaged, the LLM now has the optimal information for playing the character's role.

In the ideal state, users would just drop in a simple character card and a compendium and their character would stay true to their personality, no matter what. Regardless of context size, regardless of memory limitations, regardless of the inherent reasoning capabilities of the underlying model.

Do I have a personality compendium on-hand? No. Do I know what the syntax will look like? No. But that's not the point. This would allow you to program the output of the model without having to find the perfect dataset, or the best finetune, or the best generation settings, or retraining a model, or waiting for a model that has 1 bajillion context that can't fit onto 24 GB of VRAM.

The way to build scripts is similiar to WESTWORLD.

@angrysky56
Copy link

Unfortunately I am not a coder but I have an idea-
The old data is wiped with every new entry of data and it doesn't store at least in oobabooga's ui so you can't grow your own data base. It needs a filing system to save referenced data in with subfolders too. That would be an easy way to have a character/mode that references the data you want the AI to use so you could have specialist information not commonly available or special instructions of a large scale right? You could have a list of saved subjects and click a topic folder to have the AI use it for reference.

Persistent Storage
By default DuckDB operates on an in-memory database. That means that any tables that are created are not persisted to disk. Using the .connect method a connection can be made to a persistent database. Any data written to that connection will be persisted, and can be reloaded by re-connecting to the same file.

Bonus-
Perhaps it could be combined with EdgeGPT, which I haven't got working but once- maybe, or have some of Bings functionality too so you don't have to manually enter each item or tell the AI when to insert data? Be great if it could munch all kinds of data from a URL too- charts, pics, pdf, etc.
https://learn.microsoft.com/en-us/bing/search-apis/bing-web-search/reference/query-parameters

I am sure everyone has seen this but maybe a small second search specialty model in the process could help somehow? I am trying to figure out how to write a character/mode now to help but have no idea what or if anything will help get EdgeGPT working better.
https://www.searchenginejournal.com/how-bing-ai-search-uses-web-content/480643/#close

@truedat101 truedat101 mentioned this pull request Oct 11, 2023
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.