Skip to content
This repository has been archived by the owner on Apr 30, 2019. It is now read-only.
michaelku edited this page Mar 4, 2011 · 3 revisions

Architecture Considerations

Messaging

  • We do not actually need AMQP if we use Redis’ BLPOP

Shared Memory

  • maintain list of new messages
  • Locking can be done in Redis
  • cluster size can be maintained in Redis

Persistent Storage

What is stored

  • millions of collections of documents, with 10s to millions of items each
    • each document has an associated vector representation for clustering (used by Mahout)
    • Each collection maintains a dictionary of terms to vector indices (used by Mahout). Can be stored in one big cell.
    • each document should contain the name of its cluster, mainly used for Redis-reconstriction.
  • 10s to 10000s of clusters for each collection, with 10s to 10000s of documents each
    • one list of clusters per collection (Mahout generates labels for them)
    • one list of document-ids per cluster

Write patterns

  • random inserts of new messages and clusters: some collections will be very active at any given time
  • stream inserts when doing full cluster rebuilds (e.g. to try out new clustering algorithms, parameters etc)

Read patterns

  • random reads when updating clusters
  • bulk reads of clusters (when rebuilding Redis) and documents (for full cluster rebuilds)

Alternatives

  • persistent k/v stores: riak, cassandra (with consistent hashing)
    • Benefit: no hotspotting on busy collections
    • Riak: Riak M/R, links
    • Problem: heavy random I/O for full rebuilds (even when using Riak buckets)
  • column DBs: hbase, cassandra (with ordered storage)
    • Benefit: range scans to retrieve collections efficiently
    • Benefit: Mahout plays along nicely (there is a classification impl that operates directly on hbase)
    • problem: hotspotting (but: can be avoided without sacrificing range scans by using smart row keys)