Mamba 4chan

About

The Kingdom of the Crystal Kek, the sequel to Raiders of the Lost Kek. The legendary GPT-4chan is returned with selective SSM.

Installation

We provided a simple setup.sh to install the Conda environment. You need to satisfy the following prerequisite:

Linux
NVIDIA GPU
CUDA 12+ supported GPU driver
Miniforge

Then, simply run source ./setup.sh to get started.

Dataset

We utilized the Raiders of the Lost Kek dataset, which contains over 3.3 million threads and 134.5 million posts from /pol/. Each dataset entry is a JSON file representing a /pol/ thread.

The dataset is preprocessed by reformatting each entry into the following structure:

---(post start) No.
Content
-----(thread end)
(2 new lines after a thread)

Here's an example thread in the reformatted style:

--- 943264
Hi /pol/
--- 943265
>> 943264
Hi anon
------

The preprocessed dataset is then tokenized using the tokenizer from GPT-NeoX and stored as numpy memmap files with uint16 dtype. These steps reduce the dataset size from 106 GB to 11 GB, making distribution much easier. You can generate the memmap file using generate dataset.ipynb, or you can download the pre-generated memmap:

Raw Text Download	Num. of Char.	Tokenized Download	Num. of Tokens
Download	21B	Download	6B

Fine-tuned Models

We provide the following fine-tuned models, each trained for one epochs on the tokenized dataset using a single RTX A6000 with a context size of 2048 tokens. Mixed precision (bf16) was used for training, while the model weights were stored in fp32. We will release more models and improved versions as opportunities arise.

Name	Model Dim.	Num. of Layers	Batch Size	Gradient Acc.	Download	Fine-tuning Log
Mamba 4chan 130M	768	24	20	60	Download	log
Mamba 4chan 370M	1024	48	12	100	Download	log

Training and Inferencing

We provide mamba 4chan train.ipynb, which contains all the necessary code to train a Mamba 4chan model and log the training progress. The logged parameters can be modified in model.py.

The base model's hyperparameters are stored in model_config.py, and you can adjust them as needed. When further training our model, note that all hyperparameters are saved directly in the model file. For more information, refer to PyTorch Lightning's documentation. The same applies to inferencing, as PyTorch Lightning automatically handles all parameters when loading our model.

Here's a sample code snippet to perform inferencing with Mamba 4chan:

from transformers import AutoTokenizer

from model import mamba_4chan

model = mamba_4chan.load_from_checkpoint("path_to.ckpt")
model.cuda()
model.eval()

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
text = "--- 94326400\nHi /pol/, lets have a thread about".rstrip()
pred = model.generate_text(tokenizer, text, 256)

You can also use this colab notebook for a quick demo.

Credits

Our work builds upon the remarkable achievement of Mamba <3.

Some code for dataset preprocessing is taken from here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mamba 4chan

About

Installation

Dataset

Fine-tuned Models

Training and Inferencing

Credits

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
generate dataset.ipynb		generate dataset.ipynb
generation.py		generation.py
mamba 4chan train.ipynb		mamba 4chan train.ipynb
model.py		model.py
model_config.py		model_config.py
setup.sh		setup.sh

Mcilie/Mamba-img-app

Folders and files

Latest commit

History

Repository files navigation

Mamba 4chan

About

Installation

Dataset

Fine-tuned Models

Training and Inferencing

Credits

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages