This repository has been archived by the owner on Apr 30, 2019. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 7
Home
michaelku edited this page Mar 4, 2011
·
3 revisions
- We do not actually need AMQP if we use Redis’ BLPOP
- maintain list of new messages
- Locking can be done in Redis
- cluster size can be maintained in Redis
- millions of collections of documents, with 10s to millions of items each
- each document has an associated vector representation for clustering (used by Mahout)
- Each collection maintains a dictionary of terms to vector indices (used by Mahout). Can be stored in one big cell.
- each document should contain the name of its cluster, mainly used for Redis-reconstriction.
- 10s to 10000s of clusters for each collection, with 10s to 10000s of documents each
- one list of clusters per collection (Mahout generates labels for them)
- one list of document-ids per cluster
- random inserts of new messages and clusters: some collections will be very active at any given time
- stream inserts when doing full cluster rebuilds (e.g. to try out new clustering algorithms, parameters etc)
- random reads when updating clusters
- bulk reads of clusters (when rebuilding Redis) and documents (for full cluster rebuilds)
- persistent k/v stores: riak, cassandra (with consistent hashing)
- Benefit: no hotspotting on busy collections
- Riak: Riak M/R, links
- Problem: heavy random I/O for full rebuilds (even when using Riak buckets)
- column DBs: hbase, cassandra (with ordered storage)
- Benefit: range scans to retrieve collections efficiently
- Benefit: Mahout plays along nicely (there is a classification impl that operates directly on hbase)
- problem: hotspotting (but: can be avoided without sacrificing range scans by using smart row keys)