Skip to content
zagorsky edited this page Sep 12, 2016 · 16 revisions

Beiwe scalability

Data storage

Beiwe app data are stored as flat files on Amazon Web Services' Simple Storage Service (S3), rather than in a conventional database. The reason for this is that conventional databases run into scaling issues at certain sizes, specifically when the database grows too large to be stored on a single machine and needs to be "sharded" into multiple pieces that can be run on multiple machines. Beiwe's flat file storage architecture avoids that problem.

Beiwe generates a massive amount of data; a participant in a study with all data streams turned on could generate about 1 gigabyte of data per month. Most of that is accelerometer data, since every accelerometer datapoint is about 50 bytes of data, and the accelerometer records about 5 to 100 datapoints per second. (Other data streams produce less data simply because they record less frequently.) This means that a study using the accelerometer data stream on 100 participants for 10 months could generate about 1 terabyte of data.

Conventional databases on AWS can only scale up to about 6 terabytes before they need to be sharded onto multiple machines, and a single 6 TB database can cost over $2,000 per month. Instead, Beiwe uses Amazon S3, which is a file storage service of effectively unlimited capacity, and 6 terabytes of data stored on S3 cost much less: about $180 per month. Instagram stored its billions of photos on Amazon S3 (before Instagram was acquired by Facebook); this allowed Instagram to sign up hundreds of millions of users without having to worry about scaling its data storage.

There are two downsides of storing the data in flat files on S3:

  1. You can't run complex queries, like you can with a SQL database. You can only request blocks of data by date range, participant ID, and data stream type. To run complex queries, you have to download data blocks from S3, put them in a new database, and run queries on that new database.

  2. Data retrieval from S3 is slower than from a conventional database. Conventional databases are optimized for in/out data speeds, which are measured in milliseconds. Pulling an individual file from S3 is closer to a whole second (when you download many gigabytes, it takes minutes, but the bottleneck there is the server, not S3). When we decided to go with S3, we figured that storing massive amounts of data was more important than optimizing data read speeds. Also, increasing the amount of Beiwe data stored in S3 should have no effect on data read speeds.

The current Beiwe data storage architecture (flat files on S3) can scale effectively infinitely; we could dump petabytes of data onto S3 and it should behave the same as it does now.

Server architecture

App

The participant/user experience of using an individual Beiwe app should be relatively unaffected by the scale of the Beiwe system, or by nearly anything that happens on the server.

Most of the app's (at least the Android app's) functionality is separate from any connection to the server. Passive data collection (accelerometer, GPS, WiFi, etc., which the participant/user doesn't see or interact with) runs the same no matter how or whether the app is connected to the server; the data collection process is entirely disconnected from the data upload process. Audio recordings and surveys are designed to run offline, so they happen purely locally on the phone; there's no back-and-forth with the server. That means that if the server connection slows down, a participant/user won't notice any difference when taking a survey or making a voice recording. The app's password login process also happens locally on the phone with no server interaction, so if the connection to the server is slow, a participant/user won't experience any slowdown when logging in to their app.

There are two app functions (in the Android app) that do depend on a real-time connection to the server, and therefore are affected by a slowdown on the server:

  1. The signup/registration process, in which the participant/user enters a participant ID and a password, and the app checks with the server to see whether that participant ID is valid and not yet taken. This has a few back-and-forth steps; it also involves exchanging app configuration data for the study and encryption keys for that participant. If the server slows down, new participants will notice the slowdown when registering for a new study.

  2. Viewing past survey responses in the app. For privacy and security reasons, survey response data are asymmetrically encrypted when they're stored on the phone, and even the app can't read its own data. Because of this, when a participant/user views their own past survey answers, the app downloads (via an HTTPS connection) the data from the server. If the server slows down, the graph of past survey responses will load more slowly.

Because app functionality is so separate from the server's functionality, no significant part of the app needs to change to accommodate massive scale.

Clone this wiki locally