Skip to content
Eli Jones edited this page Oct 13, 2016 · 16 revisions

Beiwe scalability

When we designed Beiwe, we planned for it to scale to tens or hundreds of thousands of participants/users in months-long or years-long studies. There are three basic parts of the Beiwe system: the mobile app, the data storage, and the server that sits between the two. The mobile app and data storage inherently support massive scale. Scaling the server is straightforward but does require the server to be resized as traffic increases.

Data storage

Beiwe app data are stored as flat files on Amazon Web Services' Simple Storage Service (S3), rather than in a conventional database. The main reason for this is that a conventional database runs into scaling issues at certain sizes, specifically when the database grows too large to be stored on a single machine and needs to be "sharded" into multiple pieces that can be run on multiple machines. Beiwe's flat file storage architecture avoids that problem.

Beiwe generates a massive amount of data; a participant in a study with all data streams turned on could generate about 1 gigabyte of data per month. Most of that is accelerometer data, since every accelerometer datapoint is about 50 bytes of data, and the accelerometer records about 5 to 100 datapoints per second. (Other data streams produce less data simply because they record less frequently.) This means that a study using the accelerometer data stream on 100 participants for 10 months could generate about 1 terabyte of data.

Conventional databases on AWS can only scale up to about 6 terabytes before they need to be sharded onto multiple machines, and a single 6 TB database can cost over $2,000 per month. Instead, Beiwe uses Amazon S3, which is a file storage service of effectively unlimited capacity, and 6 terabytes of data stored on S3 cost much less: about $180 per month. Instagram stored its billions of photos on Amazon S3 (before Instagram was acquired by Facebook); this allowed Instagram to sign up hundreds of millions of users without having to worry about scaling its data storage.

There are two downsides of storing the data in flat files on S3:

  1. You can't run complex queries, like you can with a SQL database. You can only request blocks of data by date range, participant ID, and data stream type. To run complex queries, you have to download data blocks from S3, put them in a new database, and run queries on that new database.

  2. Data retrieval from S3 is slower than from a conventional database. Conventional databases are optimized for in/out data speeds, which are measured in milliseconds. Pulling an individual file from S3 can be closer to a whole second (when you download many gigabytes, it takes minutes, but the bottleneck there is the server, not S3). When we decided to go with S3, we figured that storing massive amounts of data was more important than optimizing data read speeds. Also, increasing the amount of Beiwe data stored in S3 should have no effect on data read speeds.

The current Beiwe data storage architecture (flat files on S3) can scale effectively infinitely; we could dump petabytes of data onto S3 and it should behave the same as it does now, without any necessary change in architecture.

Server architecture

The server is the one part of the Beiwe platform that needs to be modified to handle increased scale. The server has several main functions which are relatively independent of each other:

  • Receiving data uploads from the app and storing the data in S3. This also involves decrypting the asymmetrically-encrypted app data and re-encrypting the data with symmetric encryption for storage on S3 (this will be explained in more detail in a dedicated documentation page on encryption). The scaling concern here is that the server needs to be able to handle all the simultaneous data uploads from the installed Beiwe mobile apps, or else the Beiwe mobile apps will experience a slower upload connection.

  • Consolidating raw data imported from the Beiwe app. Raw data are imported in files covering varying durations, and this process sorts the raw data into standardized data files of uniform, one-hour length; we call the process "file chunking". The scaling concern here is that the server runs the file chunking operation every hour, and if file chunking takes longer than an hour, one chunking operation can delay the next. When the server gets more new data every hour, file chunking takes longer. The server's RAM and CPU need to increase to be able to run the hourly file chunking in less than an hour.

  • Enabling researchers to download and decrypt data stored on S3. The download functionality isn't directly affected by the number of study participants, but if enough researchers are simultaneously downloading data, it could consume resources that would otherwise be devoted to other functionality on the server.

  • Running the Beiwe researcher/administrator portal/web interface. Running the web interface is much less resource-intensive than the other functionality, because there are many fewer researchers than study participants. If there are ever more than a few hundred researchers simultaneously using the web interface, we may need to increase the size of the server to handle that traffic.

Of these functions, the first two (receiving app data uploads and consolidating data on S3) are the ones that most tax server resources, and that therefore determine how large the server needs to be.

All server functionality is currently running on one Amazon Elastic Compute Cloud (EC2) server, which has a fixed size. It's relatively easy to increase the size (RAM, hard drive, and CPU) of the EC2 server by taking a "snapshot" or "machine image" of the EC2 server and copying it onto a larger EC2 server; this creates a few minutes of server downtime.

At some point, Beiwe will become too large to run on a single EC2 server. There are two ways to handle that:

  1. Run the Beiwe backend application on multiple, identical EC2 servers behind a Load Balancer (also an AWS product). Every time a Beiwe mobile app tries to connect to the server, the Load Balancer chooses one of the EC2 servers and hands the connection to it. We have previously configured an Amazon Load Balancer on another project unrelated to Beiwe.

  2. Run the Beiwe backend application on Amazon Elastic Beanstalk. Elastic Beanstalk is an automatically-scaling server technology that we have previously used and configured on other projects unrelated to Beiwe.

  3. Run a customized scale-out of the various components of Beiwe. Some components are trivially separable, the database and the data consolidation processes, but others, website components, will require elements of options 1 or option 2 to implement effectively.

We may also separate out the data consolidation functionality so that we can run it on a separate server; this may make it easier to tailor the sizes of different servers to the needs each of them has.

We have been monitoring server resources and handling small scale increases as necessary, and we will continue to do so. But if a massive Beiwe scale increase is anticipated, a few weeks of advance planning would help us set up to handle a sudden spike. Beiwe doesn't need any radical architectural changes to handle greater scale, but it does need to reconfigure parts of the server.

Mobile app

The participant/user experience of using an individual Beiwe app should be relatively unaffected by the scale of the Beiwe system, or by nearly anything that happens on the server.

Most app functionality is unaffected by server speed or scale

Most of the app's (at least the Android app's) functionality is separate from any connection to the server. Passive data collection (accelerometer, GPS, WiFi, etc., which the participant/user doesn't see or interact with) runs the same no matter how or whether the app is connected to the server; the data collection process is entirely disconnected from the data upload process. Audio recordings and surveys are designed to run offline, so they happen purely locally on the phone; there's no back-and-forth with the server. That means that if the server connection slows down, a participant/user won't notice any difference when taking a survey or making a voice recording. The app's password login process also happens locally on the phone with no server interaction, so if the connection to the server is slow, a participant/user won't experience any slowdown when logging in to their app.

There are two app functions (in the Android app) that do depend on a real-time connection to the server, and therefore are affected by a slowdown on the server:

  1. The signup/registration process, in which the participant/user enters a participant ID and a password, and the app checks with the server to see whether that participant ID is valid and not yet taken. This has a few back-and-forth steps; it also involves exchanging app configuration data for the study and encryption keys for that participant. If the server slows down, new participants will notice the slowdown when registering for a new study.

  2. Viewing past survey responses in the app. For privacy and security reasons, survey response data are asymmetrically encrypted when they're stored on the phone, and even the app can't read its own data. Because of this, when a participant/user views their own past survey answers, the app downloads (via an HTTPS connection) the data from the server. If the server slows down, the graph of past survey responses will load more slowly.

Because app functionality is so separate from the server's functionality, no significant part of the app needs to change to accommodate massive scale.

How the app handles server downtime

The app is also built to be able to handle server downtime; if the server goes offline for a few minutes or hours to undergo maintenance or upgrades, participants/users will not be able to register new phones or view past survey responses, but everything else will perform as normal for them. During server downtime, Beiwe apps just hold on to their data until they're able to upload again, and then resume uploading as normal once the server is back online.

Clone this wiki locally