BobNet is a modular, portable Large Language Model built in community using consumer-grade hardware. The framework ingests training data and emits sharable "Bob" files that represent itty-bitty <5M Tensorflow language models trained only on one brief text each (represented by a training data file). The same models that are emitted are stored in a local SqliteDB as a vector store.
When you do inference, a vector search is done with your query first, the most appropriate models are retrieved and then the highest-probability result from each round is selected as the next token. It feels like in #ai they are always coming up with fancy new terms I don't understand, so I propose "mixture of Bob" or M.O.B. for short as the name of this design.
The goal is to crowd-source training of individual topics into Bob files that can be shared with others. Because Bob files can be emitted, shared and then ingested elsewhere, you can build a BobNet specific to your use case, and it will only be aware of what you provide. Fully modular.
BobNet was created to try and address the following problems I see in the current LLM ecosystem:
- Capable models can only currently be built at great expense using specialized hardware.
- A model built in this way is a monolith, where you have to either take it or leave it in its entirety.
- A model built this way contains training data that is not controlled by the consumer.
- Communities and individuals must rely on the good favor of large, for-profit corporations rather than building something themselves.
BobNet is built and tested using CPU-only on consumer-grade hardware (currently an HP Z640). Rather than forcing users to acquire more and more VRAM to execute a model, BobNet has a very small resource footprint, relying on storage space as its most limiting resource (cheep!). Every individual .bob file trained on a text is shareable / portable, and a BobNet can be built selectively at the discretion of the user. This means you can include just the contents you want, like general conversation, specialized information for your organization, general facts, etc... but opt-in rather than relying on prompts to protect your users from uninformed responses.
Help build the BobNet! Join the revolution!
FUTURE WORK
- "Pet Store" - interface between BobNet instances to allow truly distributed, specialized inference
- Ability to either interface locally (for example, in a Raspberry Pi cluster on a LAN)
- Ability to interface across networks
- General optimization and improvement
- It takes 5 minutes per 256 char of text currently to train on an Intel Xeon E5-2690
- It takes ~50MB of storage space per 256 char of text when persisted to a .bob file
This is Bob:
Bob isn't very strong on his own:
But when he and his friends get together, they can do great things:
Also, Bob is portable - you can share him with your friends!
Take the following steps to start using BobNet:
- Clone this repo and run "install.bat".
- Copy text training data under the "ingest" subfolder
- Each individual file will become its own "Bob" language model
- Smaller, focused files are best.
- To train, run "run_bob_net.bat" with no arguments
- BobNet will ingest the training data you provided.
- It will store a language model per training file in a vector store.
- It will also emit a *.bob file in the "share" subdirectory that you can share with others.
- To do inference, run "run_bob_net.bat" with one argument, the question you are trying to answer
- Example: "What is 2 + 2?"
- BobNet will do a vector search to find which internal language models best fit your question.
- BobNet will use each identified model to do inference, providing only the most confident result.
- Models will be penalized to the degree to which they are unfamiliar with any part of the question text.
- You can share *.bob files with other people
- Each file represents the work output of training on a single text input file
- You can import *.bob files shared by others by putting them in the "import" subdirectory
- As a result, you can build your BobNet in a modular fashion, only including approved sources