SALT is a Streamlit app for textual data analysis 📚🔬
Just give it a dataset of texts, and it will allow you to:
- Perform semantic-search over the dataset 🔎
- Label examples efficiently through active-learning 🔄
- Create a simple & fast (yet surprisingly good) classifier 🤖
- Find clusters of similar examples (lexically and/or semantically) 🗄️
- Clone the repository
git clone https://github.com/AI21Labs/salt.git
cd salt
- (Optional) Set up virtual environment
pyenv virtualenv 3.9.0 salt
pyenv activate salt
pip install --upgrade pip
- Install python dependencies
pip install -r requirements.txt
- Run the app
python -m streamlit run salt/view/main.py
Here is the basic flow (for more advanced options see the FAQ section below)
- Load a CSV or JSON-lines file
- Select the relevant column and choose a name for the project
- Click on "Create project" and wait until the creation process is completed
- Go to the "Clusters" step
- Choose similarity type and number of clusters
- Click on "Run clustering" and wait for the clustering process to finish
- Review the results (clustering overview / by cluster)
- Go to the "Review 📖" step
- Define the classes: provide at least one "seed example" for each class
- You can use "Search 🔎" to easily find relevant examples
- Assign each example with its label by editing the "label" column
- Go to the "Labeling 🖊️️" step
- Label examples one by one (chosen by the active-learning classifier)
- After labeling 10 examples, you'll start to see an updating graph of predictions change-rate
- It may help you know when to stop (once each class has converged to some stable mode)
- At any point, you can go back to "Review 📖" to:
- Download all labels and predictions
- View labels/predictions for specific examples
- Add a new class (by providing its "seed example")
Yes! Go to the "Inference 🔦" step, which provides several options for running the classifier:
- Insert any text 🔤
- Upload a file of texts 📃
- From code 💻 (export the model and use the code sample to run it)
Yes! This can be done by creating a new project that extends the current one:
- Go to the "Setup ⚙️️" step, upload the new examples file, select the relevant column and choose the project name
- Click on "Optional settings", select the base project and then click on "Create project"
Yes! When you create the project, click on "Optional settings" and select the label column
Yes! If you insert a seed example with multiple labels (separated by a comma, e.g. "pos,neg"), your classifier and the labeling interface mode will turn from "Single-label 📎️" into "Multi-label 🖇️️"
So you've labeled some examples, and now when you look at the predictions you see that a lot of them are wrong. To improve the classifier, you can try one of the following:
- Go to the "Labeling 🖊️️" step and keep labeling. More data is always better
- Go to the "Review 📖️️" step, find some wrong predictions and insert their correct labels. Then make a few more "Labeling 🖊️️" iterations to stabilize the classifier
- Create a new project with a simpler version of the texts, to make the classification task easier, e.g.:
- For classification of emails, you may remove signatures (or other decorators) to let the classifier focus on the content
- For texts with domain-specific entities, you may normalize each entity into some canonical form that conveys its meaning
If you have any questions, comments or suggestions - please reach out to Oded Avraham 👋🏼
For bug reports and feature requests - please visit our GitHub page