Rousette

Scalable Document Vectorizer

Overview

Rousette is designed to be an app that can be used to vectorize large numbers of documents. It runs as a web app and has four primary endpoints.

POST /docs
- {"doc_id": "document identifier", "body": "The actual document text"}
- This adds a document to be vectorized.
POST /models
- {"num_topics": numberOfTopics}
- This generates a model for vectorizing, where num_topics is the number of features the model will generate for each doc
- There must be documents previously added
GET /docs/doc_id/repr
- This returns the internal representation of the document
GET /docs/doc_id/vector/model_id
- This returns the vectorized model

Internals

Documents are stored in a bag of words format in an internal queue. Models are build using Latent Dirichlet allocation. Internal representations of documents and their vectorizations are stored in a queue. The system is designed to use a distributed queue system such as Apache Kafka, however at the current time only an in memory queue is supported

Install

pip install -r requirements.txt
pip install ./

Running

ROUSETTE_ENV=$(pwd)/config/dev.py FLASK_APP=rousette flask run
A test script is provided to generate dummy documents in bulk. PYTHONPATH=./ python test/test_runner.py --length 1000 --num_docs 200 will generate 200 documents of 1000 words each and submit them to the api.
A document can be viewed using curl, curl localhost:5000/docs/sample/doc_0/repr
A model can be generaged curl -d '{"num_topics": 3}' -H "Content-Type: application/json" localhost:5000/models
Wait for the model loaded and the docs to be vectorized (with the dev config, there is a 60 second refresh time)
Then the vector representation can be retrieved curl localhost:5000/docs/sample/doc_0/vector/1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Rousette

Overview

Internals

Install

Running

Files

README.md

Latest commit

History

README.md

File metadata and controls

Rousette

Overview

Internals

Install

Running