The Kingdom of the Crystal Kek, the sequel to Raiders of the Lost Kek. The legendary GPT-4chan is returned with selective SSM.
We provided a simple setup.sh to install the Conda environment. You need to satisfy the following prerequisite:
- Linux
- NVIDIA GPU
- CUDA 12+ supported GPU driver
- Miniforge
Then, simply run source ./setup.sh
to get started.
We utilized the Raiders of the Lost Kek dataset, which contains over 3.3 million threads and 134.5 million posts from /pol/. Each dataset entry is a JSON file representing a /pol/ thread.
The dataset is preprocessed by reformatting each entry into the following structure:
---(post start) No.
Content
-----(thread end)
(2 new lines after a thread)
Here's an example thread in the reformatted style:
--- 943264
Hi /pol/
--- 943265
>> 943264
Hi anon
------
The preprocessed dataset is then tokenized using the tokenizer from GPT-NeoX and stored as numpy memmap files with uint16 dtype. These steps reduce the dataset size from 106 GB to 11 GB, making distribution much easier. You can generate the memmap file using generate dataset.ipynb, or you can download the pre-generated memmap:
Raw Text Download | Num. of Char. | Tokenized Download | Num. of Tokens |
---|---|---|---|
Download | 21B | Download | 6B |
We provide the following fine-tuned models, each trained for one epochs on the tokenized dataset using a single RTX A6000 with a context size of 2048 tokens. Mixed precision (bf16) was used for training, while the model weights were stored in fp32. We will release more models and improved versions as opportunities arise.
Name | Model Dim. | Num. of Layers | Batch Size | Gradient Acc. | Download | Fine-tuning Log |
---|---|---|---|---|---|---|
Mamba 4chan 130M | 768 | 24 | 20 | 60 | Download | log |
Mamba 4chan 370M | 1024 | 48 | 12 | 100 | Download | log |
We provide mamba 4chan train.ipynb, which contains all the necessary code to train a Mamba 4chan model and log the training progress. The logged parameters can be modified in model.py.
The base model's hyperparameters are stored in model_config.py, and you can adjust them as needed. When further training our model, note that all hyperparameters are saved directly in the model file. For more information, refer to PyTorch Lightning's documentation. The same applies to inferencing, as PyTorch Lightning automatically handles all parameters when loading our model.
Here's a sample code snippet to perform inferencing with Mamba 4chan:
from transformers import AutoTokenizer
from model import mamba_4chan
model = mamba_4chan.load_from_checkpoint("path_to.ckpt")
model.cuda()
model.eval()
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
text = "--- 94326400\nHi /pol/, lets have a thread about".rstrip()
pred = model.generate_text(tokenizer, text, 256)
You can also use this colab notebook for a quick demo.
Our work builds upon the remarkable achievement of Mamba <3.
Some code for dataset preprocessing is taken from here.