Theoretically this is a universe of code for playing with embeddings. In reality it contains one file. More to come, I hope.
This file benchmarks various embeddings using the Enron email corpus. Once you install the various libraries it needs, you can run it with python bench.py. It will:
- Download the Enron email dataset.
- Unzip it.
- Attempt to run embeddings on it (with OpenAI's embedder as a default, you can change that at the end of the file to T5, or some other engine.)
- Cluster the embeddings.
- Label the clusters by sampling the subject lines from the clusters and sending them to GPT-3.
- Show you a pretty chart, like the one you see above.
Visualization helper. This file helps you go from "a list of embeddings" to "something pretty to look at".
- Make longer embeddings work by chunking and averaging out the results.