This repository has been archived by the owner on Apr 30, 2019. It is now read-only.

Home

Jump to bottom

michaelku edited this page Mar 4, 2011 · 3 revisions

Architecture Considerations

architecture draft

Messaging

We do not actually need AMQP if we use Redis’ BLPOP

Shared Memory

maintain list of new messages
Locking can be done in Redis
cluster size can be maintained in Redis

Persistent Storage

What is stored

millions of collections of documents, with 10s to millions of items each
- each document has an associated vector representation for clustering (used by Mahout)
- Each collection maintains a dictionary of terms to vector indices (used by Mahout). Can be stored in one big cell.
- each document should contain the name of its cluster, mainly used for Redis-reconstriction.
10s to 10000s of clusters for each collection, with 10s to 10000s of documents each
- one list of clusters per collection (Mahout generates labels for them)
- one list of document-ids per cluster

Write patterns

random inserts of new messages and clusters: some collections will be very active at any given time
stream inserts when doing full cluster rebuilds (e.g. to try out new clustering algorithms, parameters etc)

Read patterns

random reads when updating clusters
bulk reads of clusters (when rebuilding Redis) and documents (for full cluster rebuilds)

Alternatives

persistent k/v stores: riak, cassandra (with consistent hashing)
- Benefit: no hotspotting on busy collections
- Riak: Riak M/R, links
- Problem: heavy random I/O for full rebuilds (even when using Riak buckets)
column DBs: hbase, cassandra (with ordered storage)
- Benefit: range scans to retrieve collections efficiently
- Benefit: Mahout plays along nicely (there is a classification impl that operates directly on hbase)
- problem: hotspotting (but: can be avoided without sacrificing range scans by using smart row keys)