-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split the server for greater scalability #292
Comments
Here are the current servers that e-mission is running.
The OTP and nominatim servers seem to be fine. habitica server sometimes has registration issues (https://github.com/e-mission/e-mission-server/issues/522), but doesn't seem to be related to performance. The biggest issue is in the webapp. The performance of the webapp + server (without the pipeline running), seems acceptable. So the real issue is the pipeline + the database running on the same server. To be reasonable, we should probably split the server into three parts.
Technically, the pipeline can later become a really small launcher for serverless computation if that's the architecture that we choose to go with. For now, we want a memory optimized instance for the database, since mongodb caches most results in memory. The webapp and pipeline can probably remain as general-purpose instances, but a bit more powerful. |
wrt https://github.com/e-mission/e-mission-server/issues/530#issuecomment-346752431, we probably want the following:
|
Looking at the configuration in greater detail:
In this case, though, since the database is already on an EBS disk, the overhead should be low. |
EBS storage costs are apparently unpredictable, because we pay for both storage and I/O. |
unsure whether General purpose instances are 10 cents/GB-month. So the additional cost for going from
So the additional cost is minimal. Also, all the documentation says that instance storage is ephemeral, but I know for a fact that when I shut down and restart my m3 instances, the data in the root volume is retained. and except for the special database EB2 instance, are typically size 8GB. Does this means that m3 instances now include EBS storage by default? Am I paying for them? I guess so, but 8GB is so small (< 10 cents a month max) that I probably don't notice. Also, it looks like the EBS* instances also do have emphemeral storage (http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/RootDeviceStorage.html). So we should go with the |
wrt ephemeral storage for instances, they can apparently be added at the time the instance is launched (http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/add-instance-store-volumes.html)
|
Asked a question on stackoverflow But empirically, it looks like there is ephemeral storage on m3 instances but not on m4. So the m3 instance has a 32 GB m3
m4
|
I am going to create Note that the EBS volume that hosts the database is currently associated with 9216 IOPS.
The volume uses 3072 GB, so this is 3072 * 3 = 9216 = the baseline performance. |
Given those assumptions, the monthly budget for one installation is:
Storage:
So current total per month:
When I provision the servers for the eco-escort project, the costs will go up by
to Storage detailsCurrent mounts on the server: From the UI, EBS block devices are
So it looks like we have 3 EBS devices:
And we have two ephemeral volumes:
|
wrt https://github.com/e-mission/e-mission-server/issues/530#issuecomment-346866854,
It turns out that m4.* is actually cheaper than m3.* (https://serverfault.com/a/885060/437264). The difference for and we can add ephemeral disks to |
Creating a staging environment first. This can be the open data environment used by the test phones. Since this is an open data environment, we need an additional server that runs the public ipython notebook server. We can't re-use the analysis server since we need to have a read-only connection to the database. There is now a new Turns out that we can't create ephemeral storage for these instances, though. I went to the Add Storage tab and tried to add a volume, and the only option was the We also need to set up a VPC between the servers so that the database cannot be accessed from the general internet. It looks like the VPC is free as long as we don't need a VPN or a NAT. Theoretically, though, we can just configure the incoming security policy for mongodb, even without a VPC. I have created:
|
After deploying the servers, we need to set them up. The first big issue in setup is securing the database server. We will use two methods to secure the server:
Restricting network access (at least naively) is pretty simple - we just need to set up the firewall correctly. Later, we should explore the creation of a VPC for greater security. wrt authentication, the viable options are:
The first two are both username/password based authentication, which I am really reluctant to use. There is no classic public-key authentication mechanism. |
I am reluctant to use the username/password based authentication because then I would need to store the password in a filesystem somewhere and make sure to copy/configure it every time. But in terms of attack vector, it seems around the same as public-key based authentication. If the attacker gets access to the connecting hosts (webapp or analysis), it seems like she would have access to both the password and the private key. The main differences are:
|
We can do this, but we need to get SSL certificates for TLS-based encryption. I guess a self-signed certificate should be fine, since the mongodb is only going to be connected to the analysis and webapp hosts, which we control. But we can also probably avoid it if all communication is through an internal subnet on the VPC. Basically, it seems like there are multiple levels of hardening possible:
Adding authenticationIf we use option 2+ above, adding authentication does not appear to provide very much additional protection from external hackers. Assuming no firewall bugs, if a hacker wants to access the database, they need to first hack into one of the service hosts to generate the appropriate source header. And if they do that, they can always just see the auth credentials in the config file. However, it can prevent catastrophic issues if there really is a firewall or VPC bug, and a hacker is able to inject malicious packets that purportedly come from the service hosts. Unless there is an encryption bug, moving to option (3) will harden the option further. Authentication seems most useful when it is combined with Rule-based Access Control. RBAC can be used to separate read-only exploration (e.g. on a public server) from read-write computation. But it can go beyond that - we can make the webapp write to the timeseries and read-only from the aggregate, but make the analysis server read-only from the timeseries but write to to the analysis database |
wrt https://github.com/e-mission/e-mission-server/issues/530#issuecomment-350631885, Concrete proposallisten only to the private IP, all communication to/from the database is in the VPC, no auth |
It looks like all instances created in the past year are assigned to the same VPC and the same subnet in the VPC (http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/default-vpc.html). In general, we don't want to share the subnet with other servers, because then if a hacker got access to one of the other subnets, they could packet sniff all the data and potentially figure out the data. For the open data servers, this may be OK since the data is open, and we have firewall restrictions on where we can get messages from. But what about packet spoofing and potentially deleting data? Let's just make another (small) subnet. |
I can't seem to find a way to list all the instances in a particular subnet. Filed https://serverfault.com/questions/887552/aws-how-do-i-find-the-list-of-instances-associated-with-a-particular-subnet |
Ok, just to experiment with this for the future, we will set up a small subnet that hosts only the database and the analysis server. From https://aws.amazon.com/vpc/faqs/
So basically, this scenario: wait, analysis server cannot be in private subnet then because it needs to talk to external systems such as habitica and the real time bus etc. We should really split analysis server into two subnets too - external facing and internal facing. But since that will require some additional software restructuring, let's just put it in the public subnet for now. I won't provision a NAT gateway for now - will explore ipv6-only options which will not require a (paid) NAT gateway and can use the (free) egress-only-internet gateway. http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/egress-only-internet-gateway.html |
Ok so followed the VPC wizard for scenario 2 and created
Only
|
The default wizard configuration turns off "Auto-assign Public IP" because the assumption appears to be that we will use elastic IPs. Testing this scenario by editing the network interface for our provisioned servers and then turning it on later or manually assigning IPs. |
Service instancesTurns out you can't edit the network interface, but you can create a new one and attach the volumes. Before migrationIP: 54.196.134.233 Migrate
After migration
|
Database instanceMigration
Ah!
No matter - that is what I want. Ensure that the security group allows ssh from the webserver. Try to ssh from the webserver. Try to ssh from the analysis server. Try to ssh to the private address from outside Tighten up the outbound rules on all security groups to be consistent with Couple of modifications needed for this to work.
|
Attaching the database instances back, and then I think that setup is all done. I'm a bit unhappy about the NAT, but figuring out how to do DNS for ipv6 addresses is a later project, I think. |
Cannot attach instances because they are in a different availability zone. |
and our volumes don't have snapshots. creating snapshots to explore this option... Restoring.... Getting started with code now... |
Main code changes required:
|
changes to server done (e-mission/e-mission-server#535) |
installing mongodb now... |
Ok so now back to cleaning up data (based on https://github.com/e-mission/e-mission-server/issues/530#issuecomment-354711521).
Dump is done! Remaining steps:
|
Re-attach the volume to the new stack
I can see a couple of things that we should clean up. One which is easy and the other which should be done later.
However, it seems like a bad idea to have weird data sitting around. Is it
So it looks like we ran the pipeline in april, back when we were still storing Delete now or delete later? Let's just delete now, while we still have backups sitting around.
It looks like the first run was pre-cleaned trips, so we only have to delete raw-*
Ok, so now the oldest entries in the timeseries should be different.
Oops. It is now another generated result. Let's query and delete these as well.
Ok, so now the oldest entries in the timeseries should be different. Argh, we missed
Ok, so now do we have correct oldest entries in the timeseries.
|
ok, so now that we believe that that database is fine, we can run the pipeline again for the first time in forever.
Note that we now have entries in the timeseries that are client stats only and A quick check shows that we just read from the UUID database.
There are other methods that still use Ok, so let's run this script!
It's taking a long time to even just get started. I wonder if we are using Looking at the launcher logs, it is still iterating through the users and querying for the number of entries in the usercache. I don't even think we use that functionality and can probably get rid of it in the next release. Ah so how processes have been launched. Hopefully this first run will be done by tomorrow morning.
|
We should also remove the |
I'm also seeing some errors with saving data, need to make a pass through the errors.
|
AMPLab is using too many resources, so I have to trim my consumption. |
Copied
But the new dataset is only 20GB.
So what is missing? |
Yup!
And
|
Shutting down the old server now. RIP! You were a faithful friend and will be missed. |
In the past 4 days, the compute has increased by $50. The storage has increased by $200. We need to turn off some storage. Wah! Wah! What if I lose something important?! I guess you just have to deal with it... |
Deleted all related storage. |
Even with all the deletions, we spent ~ $50/day. This is a problem, because we will then end up spending an additional $1040 for the rest of the month, and we have already spent ~ $1500. This also means that we won't be under $1000 for next month. Since our reserved instances already cost $507, the m3/m4 legacy servers cost ~ $639, we have to keep our storage budget to under $500 to stay at my preferred 50% of my available budget. The storage is mostly going towards the provisioned iOPS storage. I don't think I actually need 3GB. Current storage is
We should be able to drop to:
If we still need to reduce after that, we can change to:
Let's see how easy it is to resize EBS volumes |
Not too hard, you just have to copy the data around. |
This was a bit trickier than one would expect because Note that our data is also on an encrypted filesystem, so our steps were:
At this point, the only diff between them is
which makes sense So now it's time to reverse the steps and attach new_data back to the server |
Reversed steps, restarted server. No errors so far. |
Done. Closing this issue for now. |
Burn rate is now $33/day Should go down after we turn off air quality server |
Burn rate is now roughly $13/day (1497 - 1419 = 78 over 6 days = $13/day) |
This fixes https://github.com/e-mission/e-mission-server/issues/530#issuecomment-352197949 Also add a new test case that checks for this. Also fix a small bug in the extraction script
… in the query This fixes https://github.com/e-mission/e-mission-server/issues/530#issuecomment-352206464 Basically, two sections are back to back, then the last point of the first section will overlap with the first point of the second section. So a query based on the start and end time for the first section will return the the first point of the second section as well, which causes a mismatch between the re-retrieved and stored speeds and distances. We detect and drop the last point in this case.
This fixes https://github.com/e-mission/e-mission-server/issues/530#issuecomment-352219808 dealing with using pymongo in a multi-process environment ``` /Users/shankari/OSS/anaconda/envs/emission/lib/python3.6/site-packages/pymongo/topology.py:145: UserWarning: MongoClient opened before fork. Create MongoClient with connect=False, or create client after forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#pymongo-fork-safe> "MongoClient opened before fork. Create MongoClient " /Users/shankari/OSS/anaconda/envs/emission/lib/python3.6/site-packages/pymongo/topology.py:145: UserWarning: MongoClient opened before fork. Create MongoClient with connect=False, or create client after forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#pymongo-fork-safe> "MongoClient opened before fork. Create MongoClient " /Users/shankari/OSS/anaconda/envs/emission/lib/python3.6/site-packages/pymongo/topology.py:145: UserWarning: MongoClient opened before fork. Create MongoClient with connect=False, or create client after forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#pymongo-fork-safe> "MongoClient opened before fork. Create MongoClient " ``` spawning instead of forking ensures that the subprocesses don't inherit the MongoClient object from the parent and create new ones instead. ``` storage not configured, falling back to sample, default configuration Connecting to database URL localhost debug not configured, falling back to sample, default configuration storage not configured, falling back to sample, default configuration Connecting to database URL localhost storage not configured, falling back to sample, default configuration Connecting to database URL localhost storage not configured, falling back to sample, default configuration Connecting to database URL localhost ```
i.e. from "test-phone-user-01" to the values at https://github.com/e-mission/e-mission-server/issues/530#issuecomment-351561385
The test mappings were for local testing. Actual mappings from: https://github.com/e-mission/e-mission-server/issues/530#issuecomment-351561385
See https://github.com/e-mission/e-mission-server/issues/530#issuecomment-353803676 Note that I remove all entries whose section entry is not valid and have snuck over from elsewhere. Regression described at https://github.com/e-mission/e-mission-server/issues/530#issuecomment-353803676 now fixed (I ran thrice in a row without failing)
Although people won't see the ipv6 until they start to use it. Note that there are a bunch of manual steps to turn on IPv6 for this setup. This change merely automates the tedious work of setting up the routing tables and security groups. https://github.com/e-mission/e-mission-server/issues/530#issuecomment-354061649 At this point, I declare that I am done with tweaking the configuration and will use the configuration deployed from this template (including 75d19de, 7a32bb6...) as the setup for the default/reference e-mission server.
The server scalability had deteriorated to the point where we were unable to run the pipeline even once per day. While part of this is probably just the way we are using mongodb, part of it is also that the server resources are running out.
So I turned off the pipeline around a month ago (last run was on
2017-10-24 21:41:18
).Now, I want to re-provision with a better, split architecture, and reserved instances for lower costs.
The text was updated successfully, but these errors were encountered: