Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split the server for greater scalability #292

Closed
shankari opened this issue Nov 21, 2017 · 94 comments
Closed

Split the server for greater scalability #292

shankari opened this issue Nov 21, 2017 · 94 comments

Comments

@shankari
Copy link
Contributor

The server scalability had deteriorated to the point where we were unable to run the pipeline even once per day. While part of this is probably just the way we are using mongodb, part of it is also that the server resources are running out.

So I turned off the pipeline around a month ago (last run was on 2017-10-24 21:41:18).

Now, I want to re-provision with a better, split architecture, and reserved instances for lower costs.

@shankari
Copy link
Contributor Author

Here are the current servers that e-mission is running.

  • aws-otp-server: m3.large
  • aws-nominatim: m3.large
  • habitica-server: m3.large
  • aws-webapp: m3.xlarge

The OTP and nominatim servers seem to be fine. habitica server sometimes has registration issues (https://github.com/e-mission/e-mission-server/issues/522), but doesn't seem to be related to performance.

The biggest issue is in the webapp. The performance of the webapp + server (without the pipeline running), seems acceptable. So the real issue is the pipeline + the database running on the same server. To be reasonable, we should probably split the server into three parts.

  • database
  • webapp
  • pipeline (backend)

Technically, the pipeline can later become a really small launcher for serverless computation if that's the architecture that we choose to go with.

For now, we want a memory optimized instance for the database, since mongodb caches most results in memory. The webapp and pipeline can probably remain as general-purpose instances, but a bit more powerful.

@shankari
Copy link
Contributor Author

wrt https://github.com/e-mission/e-mission-server/issues/530#issuecomment-346752431, we probably want the following:

- aws-otp-server: m3.large/m4.large
- aws-nominatim: m3.large/m4.large
- habitica-server: m3.large/m4.large

- aws-em-webapp: m3.xlarge/m4.xlarge
- aws-em-analysis: m3.xlarge/m4.xlarge
- aws-em-mongodb: m4.2xlarge/r3.xlarge/r4.xlarge/r3.2xlarge/r4.2xlarge

@shankari
Copy link
Contributor Author

Looking at the configuration in greater detail:

  1. for the m3.large/m4.large decision: the m3* series comes with SSD storage large = 32GB, but m4* only supports EBS. So we have to pay extra for storage for the m4* series. So it would be vastly preferable to use the m3 series, at least for the 3 standalone systems which have to include their own data
Instance Type vCPU Memory (GiB) Storage (GB) Networking Performance
m4.large 2 8 EBS Only Moderate
m4.xlarge 4 16 EBS Only High
m3.large 2 7.5 1 x 32 SSD Moderate
m3.xlarge 4 15 2 x 40 SSD High
  1. for the database, the difference between the r3* and r4* series seems similar - e.g.
Instance vCPU RAM Network Local storage
r4.xlarge 4 30.5 Up to 10 Gigabit EBS-Only
r4.2xlarge 8 61 Up to 10 Gigabit EBS-Only
r3.xlarge 4 30.5 Moderate 1 x 80
r3.2xlarge 8 61 Moderate 1 x 160

In this case, though, since the database is already on an EBS disk, the overhead should be low.

@shankari
Copy link
Contributor Author

EBS storage costs are apparently unpredictable, because we pay for both storage and I/O.
https://www.quora.com/Whats-cons-and-pros-for-EBS-based-AMIs-vs-instance-store-based-AMIs
Some people actively advise against using EBS. And of course, the instance based storages also have a ton of ephemeral storage and mostly work (except the habitica server) work off static datasets . So for the otp, habitica and nominatim servers, it is pretty much a no-brainer to use the m3 instances.

@shankari
Copy link
Contributor Author

shankari commented Nov 24, 2017

unsure whether m3* instances are available for reserved pricing, though.
https://aws.amazon.com/ec2/pricing/reserved-instances/pricing/
and the IOPS pricing only applies to provisioned IOPS instances.
https://aws.amazon.com/ebs/pricing/

General purpose instances are 10 cents/GB-month. So the additional cost for going from *3 -> *4 is:

  • m3.large -> m4.large: 32 * 0.1 = max $3.2/month
  • m3.xlarge -> m4.xlarge: 80 * 0.1 = max $8/month
  • r3.large -> r4.large: 80 * 0.1 = max $8/month
  • m3.xlarge -> m4.xlarge: 160 * 0.1 = max $160/month

So the additional cost is minimal.

Also, all the documentation says that instance storage is ephemeral, but I know for a fact that when I shut down and restart my m3 instances, the data in the root volume is retained.
I do see that apparently all AMIs are currently launched with EBS root volumes by default
https://stackoverflow.com/a/36688645/4040267
and this is consistent with what I see in the console.

screen shot 2017-11-24 at 12 03 24 am

and except for the special database EB2 instance, are typically size 8GB. Does this means that m3 instances now include EBS storage by default? Am I paying for them? I guess so, but 8GB is so small (< 10 cents a month max) that I probably don't notice.

screen shot 2017-11-24 at 12 05 54 am

Also, it looks like the EBS* instances also do have emphemeral storage (http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/RootDeviceStorage.html). So we should go with the *3* instances if there are reserved instances that support them - otherwise, we should go with *4* instances - the difference in both cost and functionality is negligible compared to the savings of the reserved instance.

@shankari
Copy link
Contributor Author

wrt ephemeral storage for instances, they can apparently be added at the time the instance is launched (http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/add-instance-store-volumes.html)

You can specify the instance store volumes for your instance only when you launch an instance. > You can't attach instance store volumes to an instance after you've launched it.

@shankari
Copy link
Contributor Author

from

So we should go with the 3 instances if there are reserved instances that support them - otherwise, we should go with 4 instances - the difference in both cost and functionality is negligible compared to the savings of the reserved instance.

There are reserved instances that support every single kind of on-demand instance including *3*.
screen shot 2017-11-24 at 12 16 13 am

@shankari
Copy link
Contributor Author

I looked at one m3 instance and one m4 instance and they both seem to be identical - one block device, which is the root device and is EBS.

m4.large m3.large
screen shot 2017-11-24 at 7 11 59 am screen shot 2017-11-24 at 7 11 27 am

@shankari
Copy link
Contributor Author

shankari commented Nov 24, 2017

Asked a question on stackoverflow
https://serverfault.com/questions/885042/m3-instances-have-root-ebs-volume-by-default-so-now-what-is-the-difference-betw

But empirically, it looks like there is ephemeral storage on m3 instances but not on m4. So the m3 instance has a 32 GB /dev/xvdb, but the m4 instance does not. So why would you use m4 instead of m3? More storage is always good, right?

m3

ubuntu@ip-10-157-135-115:~$ sudo fdisk -l

Disk /dev/xvda: 8589 MB, 8589934592 bytes
255 heads, 63 sectors/track, 1044 cylinders, total 16777216 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

    Device Boot      Start         End      Blocks   Id  System
/dev/xvda1   *       16065    16771859     8377897+  83  Linux

Disk /dev/xvdb: 32.2 GB, 32204390400 bytes
255 heads, 63 sectors/track, 3915 cylinders, total 62899200 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

ubuntu@ip-10-157-135-115:~$ mount | grep ext4
/dev/xvda1 on / type ext4 (rw)

m4

$ sudo fdisk -l
Disk /dev/xvda: 8 GiB, 8589934592 bytes, 16777216 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xea059137

Device     Boot Start      End  Sectors Size Id Type
/dev/xvda1 *     2048 16777182 16775135   8G 83 Linux

ubuntu@ip-172-30-0-54:~$ mount | grep ext4
/dev/xvda1 on / type ext4 (rw,relatime,discard,data=ordered)

@shankari
Copy link
Contributor Author

I am going to create m3.* reserved instances instead of m4.* instances across the board.
For the r3.* versus r4.*, there is actually some question since the r4.* instance has better network, which is important for a database.

Note that the EBS volume that hosts the database is currently associated with 9216 IOPS.
Is that used or proviosioned? Let's check. According to the docs:

baseline performance is 3 IOPS per GiB, with a minimum of 100 IOPS and a maximum of 10000 IOPS.

The volume uses 3072 GB, so this is 3072 * 3 = 9216 = the baseline performance.
Let us see the actual performance. No more than 2 IOPS. But of course, we weren't running the pipeline. I am tempted to go with r4.* for the database server, just to be on the safe side.

@shankari
Copy link
Contributor Author

shankari commented Nov 24, 2017

Given those assumptions, the monthly budget for one installation is:

- aws-otp-server: m3.large ($50) so we have storage
- aws-nominatim: m3.large ($50)
- habitica-server: m3.large ($50)

- aws-em-webapp: m3.xlarge ($90)
- aws-em-analysis: m3.xlarge ($90)
- aws-em-mongodb: r4.2xlarge ($245)

Storage:

- 3072 GB * 0.1 /GB = $307 (biggest expense by far, likely to grow bigger going forward, need to check causes of growth, but may be unavoidable)
- 40 GB * 0.1 / GB = $4 (probably want to put the e-mission server configuration on persistent storage)
- logs can stay on ephemeral storage, which we will have access to given planned m3.* creation

So current total per month:

$150 shared infrastructure,
$425 compute
$310 storage, increasing every month

$885 per month, increasing as we get more storage

When I provision the servers for the eco-escort project, the costs will go up by

$425 compute
$310 storage, increasing every month

$735 per month, increasing as we get more storage

to $885 + $735 = $1620 per month.

Storage details

Current mounts on the server:

From the UI, EBS block devices are

/dev/sda1
/dev/sdd
/dev/sdf
$ mount  | grep ext4
/dev/xvda1 on / type ext4 (rw,discard)
/dev/xvdd on /home/e-mission type ext4 (rw)
/dev/mapper/xvdb on /mnt type ext4 (rw)
/dev/mapper/xvdc on /mnt/logs type ext4 (rw)
/dev/mapper/xvdf on /mnt/e-mission-primary-db type ext4 (rw)

$ df -h
Filesystem        Size  Used Avail Use% Mounted on
/dev/xvda1        7.8G  5.2G  2.2G  71% /
/dev/xvdd         7.8G  326M  7.1G   5% /home/e-mission
/dev/mapper/xvdb   37G   14G   22G  39% /mnt
/dev/mapper/xvdc   37G   19G   17G  54% /mnt/logs
/dev/mapper/xvdf  3.0T  141G  2.7T   5% /mnt/e-mission-primary-db

$ sudo fdisk -l

Disk /dev/xvda: 8589 MB, 8589934592 bytes
    Device Boot      Start         End      Blocks   Id  System
/dev/xvda1   *       16065    16771859     8377897+  83  Linux

Disk /dev/xvdb: 40.3 GB, 40256929792 bytes
Disk /dev/xvdb doesn't contain a valid partition table

Disk /dev/xvdc: 40.3 GB, 40256929792 bytes
Disk /dev/xvdc doesn't contain a valid partition table

Disk /dev/xvdd: 8589 MB, 8589934592 bytes
Disk /dev/xvdd doesn't contain a valid partition table

Disk /dev/xvdf: 3298.5 GB, 3298534883328 bytes
Disk /dev/xvdf doesn't contain a valid partition table

Disk /dev/mapper/xvdb: 40.3 GB, 40254832640 bytes
Disk /dev/mapper/xvdb doesn't contain a valid partition table

Disk /dev/mapper/xvdc: 40.3 GB, 40254832640 bytes
Disk /dev/mapper/xvdc doesn't contain a valid partition table

Disk /dev/mapper/xvdf: 3298.5 GB, 3298532786176 bytes
Disk /dev/mapper/xvdf doesn't contain a valid partition table

So it looks like we have 3 EBS devices:

  • / which primarily has the OS, and /tmp/
    2.4G    /home
    1.7G    /tmp
    974M    /usr
    391M    /var
    
    $ du -sh /home/*
    308M    /home/e-mission
    2.1G    /home/ubuntu
    
    $ du -sh /home/ubuntu/*
    1.6G    /home/ubuntu/anaconda
    393M    /home/ubuntu/Anaconda2-4.0.0-Linux-x86_64.sh
    4.0K    /home/ubuntu/gencert
    4.0K    /home/ubuntu/tmp
    
  • /home/e-mission which primarily has some logs
    $ du -sm /home/e-mission/*
    1       /home/e-mission/app_store_review_test.stdinoutlog
    1       /home/e-mission/Berkeley_sections.stdinout.log
    1       /home/e-mission/iphone_2_test.stdinoutlog
    1       /home/e-mission/lost+found
    1       /home/e-mission/migration.log
    2       /home/e-mission/moves_collect.stdinoutlog
    2       /home/e-mission/pipeline.stdinoutlog
    1       /home/e-mission/pipeline_with_perf.log
    1       /home/e-mission/precompute_results.stdinoutlog
    65      /home/e-mission/remotePush.stdinoutlog
    240     /home/e-mission/silent_ios_push.stdinoutlog
    
  • /mnt/e-mission-primary-db which has the database

And we have two ephemeral volumes:

  • /mnt, which has the e-mission server install
  • /mnt/logs which has the periodic logs

@shankari
Copy link
Contributor Author

shankari commented Dec 7, 2017

wrt https://github.com/e-mission/e-mission-server/issues/530#issuecomment-346866854,

I am going to create m3.* reserved instances instead of m4.* instances across the board.
For the r3.* versus r4., there is actually some question since the r4. instance has better network, which is important for a database.

It turns out that m4.* is actually cheaper than m3.* (https://serverfault.com/a/885060/437264). The difference for large is $24.09/month (m3.large = $69.35, m4.large = 45.26), which is enough to pay for the equivalent EBS storage is $3/month.
https://github.com/e-mission/e-mission-server/issues/530#issuecomment-346766952

and we can add ephemeral disks to m4* instances for free when we create them.
That settles it, going with m4*.

@shankari
Copy link
Contributor Author

shankari commented Dec 8, 2017

Creating a staging environment first. This can be the open data environment used by the test phones. Since this is an open data environment, we need an additional server that runs the public ipython notebook server. We can't re-use the analysis server since we need to have a read-only connection to the database.

There is now a new m5 series, so we can just get a head start by deploying to that. It's about the same price, but has much greater EBS bandwidth.

Turns out that we can't create ephemeral storage for these instances, though. I went to the Add Storage tab and tried to add a volume, and the only option was the
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/block-device-mapping-concepts.html

We also need to set up a VPC between the servers so that the database cannot be accessed from the general internet. It looks like the VPC is free as long as we don't need a VPN or a NAT. Theoretically, though, we can just configure the incoming security policy for mongodb, even without a VPC.
https://aws.amazon.com/vpc/pricing/

I have created:

  • aws-op-webapp: m5.xlarge, 40GB storage ($90) (54.196.134.233)
  • aws-op-analysis: m5.xlarge, 40GB storage ($90) (52.87.159.49)
  • aws-op-public: m5.xlarge, 40GB storage ($90) (52.87.159.49)
  • aws-op-database: r4.2xlarge, 3 TB storage ($245) (34.201.243.180)

@shankari
Copy link
Contributor Author

shankari commented Dec 10, 2017

After deploying the servers, we need to set them up. The first big issue in setup is securing the database server. We will use two methods to secure the server:

  • we will restrict network access to the database port to the associated servers
  • we will turn on authentication and access control

Restricting network access (at least naively) is pretty simple - we just need to set up the firewall correctly. Later, we should explore the creation of a VPC for greater security.

wrt authentication, the viable options are:

  • SCRAM-SHA-1
  • MONGODB-CR
  • x.509

The first two are both username/password based authentication, which I am really reluctant to use. There is no classic public-key authentication mechanism.

@shankari
Copy link
Contributor Author

shankari commented Dec 10, 2017

I am reluctant to use the username/password based authentication because then I would need to store the password in a filesystem somewhere and make sure to copy/configure it every time. But in terms of attack vector, it seems around the same as public-key based authentication.

If the attacker gets access to the connecting hosts (webapp or analysis), it seems like she would have access to both the password and the private key.

The main differences are:

  • if the attacker gets access to the place where we have stored the passwords for the long-term, the password based solution is compromised, although the public key solution is not. We can avoid this by storing the password securely, just like the private key to the webapp.
  • if the public key is sent to the database over plaintext, the database is not compromised, but it is, if the database is compromised. We can avoid this by encrypting connections between the database and the webapp. This may also allow us to use x.509 based authentication, which is pretty close to public key authentication.

@shankari
Copy link
Contributor Author

We can avoid this by encrypting connections between the database and the webapp. This may also allow us to use x.509 based authentication

We can do this, but we need to get SSL certificates for TLS-based encryption. I guess a self-signed certificate should be fine, since the mongodb is only going to be connected to the analysis and webapp hosts, which we control. But we can also probably avoid it if all communication is through an internal subnet on the VPC.

Basically, it seems like there are multiple levels of hardening possible:

  • configure incoming and outgoing connections in the firewall, no auth
    Ease of use: 6 (easy, simple security group UI)
    Security: 1 (weak, since data transfer flows over the public internet without encryption)

  • listen only to the private IP, all communication to/from the database is in the VPC, no auth
    Ease of use: 4 (can set up VPC via UI)
    Security: 5 (pretty good, since all unencrypted data flow is internal. The only attack vector is if the hacker somehow compromises any of the services. Once this is done, she can either connect to the database directly, or run a packet sniffer on the network

  • listen only to the private IP, all communication to/from the database is in the VPC, SSL certificates used, no auth
    Ease of use: 1 (need to get SSL certificates and setup a bunch of configuration)
    Security: 7 (pretty close to optimal, since even packet sniffers can't see anything)

Adding authentication

If we use option 2+ above, adding authentication does not appear to provide very much additional protection from external hackers. Assuming no firewall bugs, if a hacker wants to access the database, they need to first hack into one of the service hosts to generate the appropriate source header. And if they do that, they can always just see the auth credentials in the config file.

However, it can prevent catastrophic issues if there really is a firewall or VPC bug, and a hacker is able to inject malicious packets that purportedly come from the service hosts. Unless there is an encryption bug, moving to option (3) will harden the option further.

Authentication seems most useful when it is combined with Rule-based Access Control. RBAC can be used to separate read-only exploration (e.g. on a public server) from read-write computation. But it can go beyond that - we can make the webapp write to the timeseries and read-only from the aggregate, but make the analysis server read-only from the timeseries but write to to the analysis database

@shankari
Copy link
Contributor Author

shankari commented Dec 11, 2017

wrt https://github.com/e-mission/e-mission-server/issues/530#issuecomment-350631885,
given the tradeoffs articulared, I have decided to go with option (2) with no auth.

Concrete proposal

listen only to the private IP, all communication to/from the database is in the VPC, no auth
Ease of use: 4 (can set up VPC via UI)
Security: 5 (pretty good, since all unencrypted data flow is internal.

@shankari
Copy link
Contributor Author

shankari commented Dec 11, 2017

It looks like all instances created in the past year are assigned to the same VPC and the same subnet in the VPC (http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/default-vpc.html). In general, we don't want to share the subnet with other servers, because then if a hacker got access to one of the other subnets, they could packet sniff all the data and potentially figure out the data. For the open data servers, this may be OK since the data is open, and we have firewall restrictions on where we can get messages from.

But what about packet spoofing and potentially deleting data? Let's just make another (small) subnet.

@shankari
Copy link
Contributor Author

I can't seem to find a way to list all the instances in a particular subnet. Filed https://serverfault.com/questions/887552/aws-how-do-i-find-the-list-of-instances-associated-with-a-particular-subnet

@shankari
Copy link
Contributor Author

shankari commented Dec 11, 2017

Ok, just to experiment with this for the future, we will set up a small subnet that hosts only the database and the analysis server.

From https://aws.amazon.com/vpc/faqs/
The minimum size of a subnet is a /28 (or 14 IP addresses.) for IPv4. Subnets cannot be larger than the VPC in which they are created.

multi-tier website, with the web servers in a public subnet and the database servers in a private subnet.

So basically, this scenario:
http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Scenario2.html

wait, analysis server cannot be in private subnet then because it needs to talk to external systems such as habitica and the real time bus etc. We should really split analysis server into two subnets too - external facing and internal facing. But since that will require some additional software restructuring, let's just put it in the public subnet for now.

I won't provision a NAT gateway for now - will explore ipv6-only options which will not require a (paid) NAT gateway and can use the (free) egress-only-internet gateway. http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/egress-only-internet-gateway.html

@shankari
Copy link
Contributor Author

shankari commented Dec 11, 2017

Ok so followed the VPC wizard for scenario 2 and created

  • aws-op-vpc
  • aws-op-public-subnet, aws-op-private-subnet
  • a NAT gateway and an egress-only internet gateway, and
  • aws-op-public-route, aws-op-private-route

Only aws-op-private-subnet has IPv6 enabled.

aws-op-public-route was associated with aws-op-public-subnet, but aws-op-private-route was marked as main and not associated with any subnet. That is consistent with

In this scenario, the VPC wizard updates the main route table used with the private subnet, and creates a custom route table and associates it with the public subnet.

In this scenario, all traffic from each subnet that is bound for AWS (for example, to the Amazon EC2 or Amazon S3 endpoints) goes over the Internet gateway. The database servers in the private subnet can't receive traffic from the Internet directly because they don't have Elastic IP addresses. However, the database servers can send and receive Internet traffic through the NAT device in the public subnet.

Any additional subnets that you create use the main route table by default, which means that they are private subnets by default. If you want to make a subnet public, you can always change the route table that it's associated with.

@shankari
Copy link
Contributor Author

shankari commented Dec 11, 2017

The default wizard configuration turns off "Auto-assign Public IP" because the assumption appears to be that we will use elastic IPs. Testing this scenario by editing the network interface for our provisioned servers and then turning it on later or manually assigning IPs.

@shankari
Copy link
Contributor Author

shankari commented Dec 11, 2017

Service instances

Turns out you can't edit the network interface, but you can create a new one and attach the volumes.

Before migration

IP: 54.196.134.233
Able to ssh in

Migrate

  • Create a new m5.xlarge instance
  • attach it to the aws-op-vpc, aws-op-public-subnet and override the assignment settings for public IP and ipv6.
  • create security groups for the different kinds of instances
    • webapp
      • incoming SSH from home and HTTPS from the eecs hostname redirect
      • all outgoing traffic to both 0.0.0.0/0 and ::/0 (seems like we can tighten this)
    • analysis
      • incoming SSH from home
      • all outgoing traffic to both 0.0.0.0/0 and ::/0 (seems like we can tighten this)
    • public
      • incoming ssh from home and ports 8888 - 9999 for ipython notebook
      • all outgoing traffic to both 0.0.0.0/0 and ::/0
    • database
      • incoming ssh from webapp and mongodb from webapp, analysis and public
      • outgoing traffic to all ports on webapp, analysis and public over ipv4 (seems like we should add routes for patches)

After migration

  • Can ssh directly to all three public-facing servers
  • attached non-root EB2 instances were also deleted! Good we figured this out now! Created instances and attached them

@shankari
Copy link
Contributor Author

Database instance

Migration

  • Recreating instance, putting it into the private subnet, no assigned ipv4 address. It looks like after the instance is created, I can add a new private ip address, but not a public one.

Ah!

You can only use the auto-assign public IPv4 feature for a single, new network interface with the device index of eth0. For more information, see Assigning a Public IPv4 Address During Instance Launch.

No matter - that is what I want.

Ensure that the security group allows ssh from the webserver.

Try to ssh from the webserver.
Works!

Try to ssh from the analysis server.
Doesn't work!

Try to ssh to the private address from outside
Obviously doesn't work.

Tighten up the outbound rules on all security groups to be consistent with
http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Scenario2.html

Couple of modifications needed for this to work.

  • outbound ssh rule from the webapp to the database server to allow us to log in

  • DNS resolution needed to be enabled for the VPC
    Looking at http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/vpc-dns.html,
    DNS resolution is supposed to be enabled for VPCs created through the wizard, but is off for our VPC although it was created using the wizard

    $ ping www.google.com
    PING www.google.com (172.217.13.228) 56(84) bytes of data.
    64 bytes from iad23s61-in-f4.1e100.net (172.217.13.228): icmp_seq=1 ttl=45 time=1.61 ms
    64 bytes from iad23s61-in-f4.1e100.net (172.217.13.228): icmp_seq=2 ttl=45 time=1.61 ms
    64 bytes from iad23s61-in-f4.1e100.net (172.217.13.228): icmp_seq=3 ttl=45 time=1.59 ms
    64 bytes from iad23s61-in-f4.1e100.net (172.217.13.228): icmp_seq=4 ttl=45 time=1.64 ms
    ^C
    
  • DNS servers only support ipv4, so if we want to access the internet from the private subnet, we need to continue using the NAT gateway instance that the wizard setup for us.

    [ec2-user@ip-192-168-1-100 ~]$ ping www.google.com
    PING www.google.com (172.217.8.4) 56(84) bytes of data.
    <HANGS>
    ^C
    --- www.google.com ping statistics ---
    5 packets transmitted, 0 received, 100% packet loss, time 4081ms
    

    This is because the incoming rules for the nat only supported the default security group. Changing it to the database security group caused everything to start working.

@shankari
Copy link
Contributor Author

Attaching the database instances back, and then I think that setup is all done. I'm a bit unhappy about the NAT, but figuring out how to do DNS for ipv6 addresses is a later project, I think.

@shankari
Copy link
Contributor Author

Cannot attach instances because they are in a different availability zone.
Per http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumes.html
you need to migrate the instances to a different zone using their snapshots.

@shankari
Copy link
Contributor Author

and our volumes don't have snapshots. creating snapshots to explore this option...
can't create snapshot - selected it and nothing happened.
so it looks like reserved iops volumes have their snapshots under "snapshots", not linked to the volume.

Restoring....
That worked.
Attached the three volumes back to the database.

Getting started with code now...

@shankari
Copy link
Contributor Author

Main code changes required:

  • support database hostname as part of configuration. There's already a field for this, but we should actually use it. Or potentially split it out into it's own conf file.
  • split out all the public stuff since it was really kludgy and is going to be on a separate server anyway

@shankari
Copy link
Contributor Author

changes to server done (e-mission/e-mission-server#535)
now it is time to deploy!

@shankari
Copy link
Contributor Author

installing mongodb now...

@shankari
Copy link
Contributor Author

shankari commented Jan 3, 2018

Ok so now back to cleaning up data (based on https://github.com/e-mission/e-mission-server/issues/530#issuecomment-354711521).

  1. This time, we port the stats first, so that they will be deleted for the public phones, and we can delete the stats collections when we are done.
$ ./e-mission-py.bash bin/historical/migrations/stats_from_db_to_ts.py
  1. Convert all ms to secs before validation.
for e in edb.get_timeseries_db().find({"metadata.key": {"$in": ["stats/client_time", "stats/client_nav_event", "stats/client_error"]}, "data.ts": {"$gt": now.timestamp}}):
         edb.get_timeseries_db().update({"_id": e["_id"]},
                 {"$set": {"data.ts": float(e["data"]["ts"])/1000,
                   "metadata.write_ts": float(e["metadata"]["write_ts"])/1000}})   
  1. validate

    1. messed up entries have been fixed
    In [1039]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["stats/client_time", "stats/client_nav_event", "stats/client_error"]}, "data.ts": {"$gt": now.timestamp}}).count()
    Out[1039]: 0
    
    1. Correlation between stats db and timeseries db exists and timestamp is valid
     In [1040]: entry_1 = edb.get_client_stats_db_backup().find_one()
    
     In [1041]: edb.get_timeseries_db().find_one({"metadata.key": "background/battery", "data.ts": float(entry_1["ts"])/1000})
     Out[1041]:
     {u'_id': ObjectId('5a4c7b5c88f663668630d290'),
     u'data': {u'battery_level_pct': 4.0, u'ts': 1413254944.995},
     u'metadata': {u'key': u'background/battery',
     u'platform': u'server',
     u'time_zone': u'America/Los_Angeles',
     u'write_fmt_time': u'2014-10-13T19:49:42.201671-07:00',
     u'write_ts': 1413254982.201671},
     u'user_id': UUID('f8fee20c-0f32-359d-ba75-bce97a7ac83b')}
    
    In [1044]: arrow.get(1413254944.995)
    Out[1044]: <Arrow [2014-10-14T02:49:04.995000+00:00]>
    
    
    1. Sorting in descending order works and timestamps are valid
    In [1042]: list(edb.get_timeseries_db().find({"user_id": UUID('96af3842-d5fb-4f13-aea0-726efaeba6ea'), "metadata.key": {"$in": ["stats/client_time", "stats/client_nav_event", "stats/client_error"]}}).sort("data.ts", -1).limit(1))
    Out[1042]:
    [{u'_id': ObjectId('59a0c57dcb17471ac08bc0c6'),
      u'data': {u'client_app_version': u'2.3.0',
       u'client_os_version': u'7.0',
       u'name': u'sync_duration',
       u'reading': 6.459,
       u'ts': 1503704747.497},
      u'metadata': {u'key': u'stats/client_time',
       u'platform': u'android',
       u'read_ts': 0,
       u'time_zone': u'America/Los_Angeles',
       u'type': u'message',
       u'write_fmt_time': u'2017-08-25T16:45:47.499000-07:00',
       u'write_ts': 1503704747.499},
      u'user_id': UUID('96af3842-d5fb-4f13-aea0-726efaeba6ea')}]
    
    In [1045]: arrow.get(1503704747.497)
    Out[1045]: <Arrow [2017-08-25T23:45:47.497000+00:00]>
    
    In [1047]: list(edb.get_timeseries_db().find({"metadata.key": {"$in": ["stats/client_time", "stats/client_nav_event", "stats/client_error"]}}).sort("data.ts", -1).limit(1))
    Out[1047]:
    [{u'_id': ObjectId('59efe782cb17471ac0cecb5a'),
      u'data': {u'client_app_version': u'2.4.0',
       u'client_os_version': u'10.3.3',
       u'name': u'sync_launched',
       u'reading': -1,
       u'ts': 1508894590.151824},
      u'metadata': {u'key': u'stats/client_nav_event',
       u'platform': u'ios',
       u'plugin': u'none',
       u'read_ts': 0,
       u'time_zone': u'America/Los_Angeles',
       u'type': u'message',
       u'write_fmt_time': u'2017-10-24T18:23:10.152289-07:00',
       u'write_local_dt': {u'day': 24,
        u'hour': 18,
        u'minute': 23,
        u'month': 10,
        u'second': 10,
        u'timezone': u'America/Los_Angeles',
        u'weekday': 1,
        u'year': 2017},
       u'write_ts': 1508894590.152289},
      u'user_id': UUID('7161343e-551e-4213-be75-3b82e1ce2448')}]
    
    In [1048]: arrow.get(1508894590.151824)
    Out[1048]: <Arrow [2017-10-25T01:23:10.151824+00:00]>
    
    
    1. Sorting in ascending order works but one of the timestamps is weird.
    In [1043]: list(edb.get_timeseries_db().find({"user_id": UUID('96af3842-d5fb-4f13-aea0-726efaeba6ea'), "metadata.key": {"$in": ["stats/client_time", "stats/client_nav_event", "stats/client_error"]}}).sort("data.ts", 1).limit(1))
    Out[1043]:
    [{u'_id': ObjectId('5a4c829288f663668638cb87'),
      u'data': {u'client_app_version': u'1.0.0',
       u'client_os_version': u'4.4.2',
       u'name': u'sync_duration',
       u'reading': 5638.0,
       u'ts': 1474414807.961},
      u'metadata': {u'key': u'stats/client_time',
       u'platform': u'server',
       u'time_zone': u'America/Los_Angeles',
       u'write_fmt_time': u'2016-09-20T18:40:45.947868-07:00',
       u'write_local_dt': {u'day': 20,
        u'hour': 18,
        u'minute': 40,
        u'month': 9,
        u'second': 45,
        u'timezone': u'America/Los_Angeles',
        u'weekday': 1,
        u'year': 2016},
       u'write_ts': 1474422045.947868},
      u'user_id': UUID('96af3842-d5fb-4f13-aea0-726efaeba6ea')}]
    
    In [1046]: arrow.get(1474414807.961)
    Out[1046]: <Arrow [2016-09-20T23:40:07.961000+00:00]>
    
    In [1049]: list(edb.get_timeseries_db().find({"metadata.key": {"$in": ["stats/client_time", "stats/client_nav_event", "stats/client_error"]}}).sort("data.ts", 1).limit(1))
    Out[1049]:
    [{u'_id': ObjectId('5a4c7e6b88f663668633e5dd'),
      u'data': {u'client_app_version': u'2.1',
       u'client_os_version': u'4.4.2',
       u'name': u'confirmlist_auth_not_done',
       u'reading': None,
       u'ts': 315965026.452},
      u'metadata': {u'key': u'stats/client_nav_event',
       u'platform': u'server',
       u'time_zone': u'America/Los_Angeles',
       u'write_fmt_time': u'2015-06-03T09:57:27.061417-07:00',
       u'write_local_dt': {u'day': 3,
        u'hour': 9,
        u'minute': 57,
        u'month': 6,
        u'second': 27,
        u'timezone': u'America/Los_Angeles',
        u'weekday': 2,
        u'year': 2015},
       u'write_ts': 1433350647.061417},
      u'user_id': UUID('077bbe67-7693-3a09-9b3c-c8ba935e46ee')}]
    
    In [1050]: arrow.get(315965026.452)
    Out[1050]: <Arrow [1980-01-06T00:03:46.452000+00:00]>
    
    1. what is going on with this?

    This is not just an invalid conversion, though, because trying to convert it
    back to seconds does not work.

     ```
     In [1051]: arrow.get(315965026452)
     ValueError: year is out of range
     ```
    

    There are 13 such entries and they are all from user UUID('077bbe67-7693-3a09-9b3c-c8ba935e46ee')

    In [1067]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["stats/client_time", "stats/client_nav_event", "stats/client_error"]}, "data.ts": {"$lt": ts_2000.timestamp}}).count()
    Out[1067]: 13
    
    In [1070]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["stats/client_time", "stats/client_nav_event", "stats/client_error"]}, "data.ts": {"$lt": ts_2000.timestamp}}).distinct("user_id")
    Out[1070]: [UUID('077bbe67-7693-3a09-9b3c-c8ba935e46ee')]
    

    There are 131 entries for this user in the client stats DB

    In [1071]: edb.get_client_stats_db_backup().find({"user": UUID('077bbe67-7693-3a09-9b3c-c8ba935e46ee')}).count()
    Out[1071]: 151
    

    Let's try to match based on the reported_ts. Bingo! The entry does indeed have an invalid ts.

    In [1073]: edb.get_client_stats_db_backup().find_one({"user": UUID('077bbe67-7693-3a09-9b3c-c8ba935e46ee'), "reported_ts":  1433350647.061417})
    Out[1073]:
    {u'_id': ObjectId('556f31f788f6636f49a1b05a'),
     u'client_app_version': u'2.1',
     u'client_os_version': u'4.4.2',
     u'reading': u'0.0',
     u'reported_ts': 1433350647.061417,
     u'stat': u'battery_level',
     u'ts': u'315965055221',
     u'user': UUID('077bbe67-7693-3a09-9b3c-c8ba935e46ee')}
    
    In [1075]: arrow.get(315965055221)
    ValueError: year is out of range
    

    There are a bunch of other entries with the same user and reported_ts, but fewer than the entries reported in the timeseries.

    In [1074]: edb.get_client_stats_db_backup().find({"user": UUID('077bbe67-7693-3a09-9b3c-c8ba935e46ee'), "reported_ts":  1433350647.061417}).count()
    Out[1074]: 37
    

    I bet the others are battery_level, similar to the above.

    In [1076]: edb.get_timeseries_db().find({"data.ts": {"$lt": ts_2000.timestamp}}).count()
    Out[1076]: 762
    
    In [1077]: edb.get_timeseries_db().find({"data.ts": {"$lt": ts_2000.timestamp}}).distinct("user_id")
    Out[1077]:
    [UUID('077bbe67-7693-3a09-9b3c-c8ba935e46ee'),
     UUID('99b29da2-989d-4e54-a211-284bde1d362d'),
     UUID('693a42f1-6f00-497e-8b40-a8339fd5af8d'),
     UUID('3d37573b-2e74-496d-a09d-a0a3f05c2467'),
     UUID('ee6dadba-53b6-421f-94d1-27bc96e023cf'),
     UUID('0109c47b-e640-411e-8d19-e481c52d7130'),
     UUID('6560950c-4ddb-41fc-8801-b9197d30f54d'),
     UUID('b82804b8-4e49-43a0-99d1-9d1da20ec1d3'),
     UUID('9d906275-8072-42d4-8dd2-3670e63e0f6e'),
     UUID('6af5afdf-b1d9-4ea7-9f10-2bddb8a0ecb3'),
     UUID('6a415e67-9025-4f29-b520-f0c5a43c8bb6'),
     UUID('a61349c6-0cc9-4902-9f13-d4236a630ad5'),
     UUID('cfbf03dc-6e3e-40bd-90de-d19d14613e47'),
     UUID('be47f46a-ce3a-4ad8-b81f-d3daa7955e95'),
     UUID('5109e62d-2152-481b-8d26-2cb8d8cc1f23'),
     UUID('f14272fe-1433-4430-b1ec-3f37dfdde5bf'),
     UUID('abf4ed3a-a018-4f40-90c5-39b592b8569b'),
     UUID('de23cac9-1996-4af5-8554-4f6d017b3459'),
     UUID('96af3842-d5fb-4f13-aea0-726efaeba6ea'),
     UUID('08b31565-f990-4d15-a4a7-89b3ba6b1340'),
     UUID('5a6a2711-c574-42f0-9940-ea1fd0cc2f09'),
     UUID('29277dd4-dc78-40c0-806c-f88a4f902436'),
     UUID('6ed1b36d-08a9-403d-b247-e426228c0492'),
     UUID('d2b923b9-68b9-4e88-9b8c-29416694efb1'),
     UUID('e82b1c5a-7c07-46b7-afd7-b53ac9db1f42'),
     UUID('dcdb5f74-071a-4e5b-a954-e613c5b46e5d'),
     UUID('ea59084e-11d4-4076-9252-3b9a29ce35e0'),
     UUID('cd6482fe-56a2-4bf8-b8a8-d74f6e3c22c8'),
     UUID('3ca88f7c-fb1a-467e-9e29-99909d92c904')]
    
    In [1078]: edb.get_timeseries_db().find({"data.ts": {"$lt": ts_2000.timestamp}}).distinct("metadata.key")
    Out[1078]:
    [u'stats/client_time',
     u'background/battery',
     u'stats/client_nav_event',
     u'statemachine/transition',
     u'background/location',
     u'background/filtered_location']
    
    In [1080]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["statemachine/transition", "background/location", "background/filtered_location"]}, "data.ts": {"$lt": ts_2000.timestamp}}).count()
    Out[1080]: 744
    
    In [1082]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["statemachine/transition", "background/location", "background/filtered_location"]}, "data.ts": {"$lt": ts_2000.timestamp}}).distinct("metadata.platform")
    Out[1082]: [u'android']
    

    Actually, no. There are a lot more, including a lot of non-stat entries, and
    they are all on android. Going to verify some of this manually and then move on.
    So they are an interesting mix of some kind of weird timestamp that is broken
    for both ts and write_ts and entries where the location apparently had ts = 0.

    >>> list(edb.get_timeseries_db().find({"metadata.key": {"$in": ["statemachine/transition", "background/location", "background/filtered_location"]}, "data.ts": {"$lt": ts_2000.timestamp}}, {"user_id":1, "data.ts": 1, "data.fmt_time": 1, "metadata.write_ts": 1, "metadata.write_fmt_time": 1}).limit(10))
    
    [{u'_id': ObjectId('56c495afeaedff78c762a711'),
      u'data': {u'fmt_time': u'1970-03-12T00:27:13+08:00', u'ts': 6020833},
      u'metadata': {u'write_fmt_time': u'1970-03-12T00:27:13+08:00',
       u'write_ts': 6020833},
      u'user_id': UUID('99b29da2-989d-4e54-a211-284bde1d362d')},
     {u'_id': ObjectId('56c495afeaedff78c762a710'),
      u'data': {u'fmt_time': u'1970-03-11T14:25:57+08:00', u'ts': 5984757},
      u'metadata': {u'write_fmt_time': u'1970-03-11T14:25:57+08:00',
       u'write_ts': 5984757},
      u'user_id': UUID('99b29da2-989d-4e54-a211-284bde1d362d')},
     {u'_id': ObjectId('575a6423383999ecb7a5e183'),
      u'data': {u'fmt_time': u'1969-12-31T16:00:00-08:00', u'ts': 0},
      u'metadata': {u'write_fmt_time': u'2016-06-09T23:43:11.391000-07:00',
       u'write_ts': 1465540991.391},
      u'user_id': UUID('693a42f1-6f00-497e-8b40-a8339fd5af8d')},
     {u'_id': ObjectId('5760c4dc383999ecb7a98155'),
      u'data': {u'fmt_time': u'1969-12-31T16:00:00-08:00', u'ts': 0},
      u'metadata': {u'write_fmt_time': u'2016-06-14T16:23:03.521000-07:00',
       u'write_ts': 1465946583.521},
    

    At some point, I should go through and throw out all this data. But it is a
    small amount of data and can wait.

    And to complete the exploration, all the broken stats are from the same user.

    In [1089]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["stats/client_time", "stats/client_nav_event", "stats/client_error", "background/battery"]}, "data.ts": {"$lt": ts_2000.timestamp}}).distinct("user_id")
    Out[1089]: [UUID('077bbe67-7693-3a09-9b3c-c8ba935e46ee')]
    
  2. Next, remove all test phone data, including pipeline states. Do NOT remove USP open data because there is data outside the open range too, at least for one user (me!)

    1. State before
    In [1053]: edb.get_uuid_db().find().count()
    Out[1053]: 516
    
    In [1054]: edb.get_timeseries_db().find().count()
    Out[1054]: 54818879
    
    In [1055]: edb.get_analysis_timeseries_db().find().count()
    Out[1055]: 16573823
    
    In [1056]: len(edb.get_pipeline_state_db().find().distinct('user_id'))
    Out[1056]: 487
    
    1. Running the script!
    $ ./e-mission-py.bash bin/debug/purge_multi_timeline_for_range.py --pipeline-purge /tmp/public_data/dump_
    INFO:root:Loading file or prefix /tmp/public_data/dump_
    INFO:root:Found 12 matching files for prefix /tmp/public_data/dump_
    INFO:root:files are ['/tmp/public_data/dump_fd7b4c2e-2c8b-3bfa-94f0-d1e3ecbd5fb7.gz', '/tmp/public_data/dump_6561431f-d4c1-4e0f-9489-ab1190341fb7.gz', '/tmp/public_data/dump_079e0f1a-c440-3d7c-b0e7-de160f748e35.gz', '/tmp/public_data/dump_92cf5840-af59-400c-ab72-bab3dcdf7818.gz', '/tmp/public_data/dump_3bc0f91f-7660-34a2-b005-5c399598a369.gz'] ... ['/tmp/public_data/dump_95e70727-a04e-3e33-b7fe-34ab19194f8b.gz', '/tmp/public_data/dump_c528bcd2-a88b-3e82-be62-ef4f2396967a.gz', '/tmp/public_data/dump_70968068-dba5-406c-8e26-09b548da0e4b.gz', '/tmp/public_data/dump_93e8a1cc-321f-4fa9-8c3c-46928668e45d.gz']
    INFO:root:==================================================
    INFO:root:Deleting data from file /tmp/public_data/dump_fd7b4c2e-2c8b-3bfa-94f0-d1e3ecbd5fb7.gz
    ...
    INFO:root:For uuid = e471711e-bd14-3dbe-80b6-9c7d92ecc296, deleting entries from the timeseries
    INFO:root:result = {u'ok': 1, u'n': 3923919}
    INFO:root:For uuid = e471711e-bd14-3dbe-80b6-9c7d92ecc296, deleting entries from the analysis_timeseries
    INFO:root:result = {u'ok': 1, u'n': 39361}
    INFO:root:For uuid e471711e-bd14-3dbe-80b6-9c7d92ecc296, deleting entries from the user_db
    INFO:root:result = {u'ok': 1, u'n': 1}
    INFO:root:For uuid e471711e-bd14-3dbe-80b6-9c7d92ecc296, deleting entries from the pipeline_state_db
    INFO:root:result = {u'ok': 1, u'n': 12}
    
    1. State after
    In [1057]: edb.get_uuid_db().find().count()
    Out[1057]: 504
    
    In [1058]: edb.get_timeseries_db().find().count()
    Out[1058]: 39797029
    
    In [1059]: edb.get_analysis_timeseries_db().find().count()
    Out[1059]: 16190498
    
    In [1060]: len(edb.get_pipeline_state_db().find().distinct('user_id'))
    Out[1060]: 475
    
  3. Next, remove all config documents from the usercache

    In [1061]: edb.get_usercache_db().find({"metadata.type": "document", "metadata.key": {"$in": ['config/consent', 'config/sensor_config', 'config/sync_config']}}).count()
    Out[1061]: 239
    
    In [1062]: edb.get_usercache_db().remove({"metadata.type": "document", "metadata.key": {"$in": ['config/consent', 'config/sensor_config', 'config/sync_config']}})
    Out[1062]: {u'n': 239, u'ok': 1}
    
  4. Next, re-remove unused collections. This time, since we have migrated all stats, we can remove those databases as well.

    In [1090]: edb.get_alternatives_db().remove()
    Out[1090]: {u'n': 101146, u'ok': 1}
    
    In [1091]: edb.get_client_db().remove()
    Out[1091]: {u'n': 3, u'ok': 1}
    
    In [1092]: edb.get_common_place_db().remove()
    Out[1092]: {u'n': 0, u'ok': 1}
    
    In [1093]: edb.get_common_trip_db().remove()
    Out[1093]: {u'n': 0, u'ok': 1}
    
    In [1094]: edb.get_pending_signup_db().remove()
    Out[1094]: {u'n': 25, u'ok': 1}
    
    In [1095]: edb._get_current_db().Stage_place.remove()
    Out[1095]: {u'n': 0, u'ok': 1}
    
    In [1096]: edb.get_routeCluster_db().remove()
    Out[1096]: {u'n': 90, u'ok': 1}
    
    In [1097]: edb._get_current_db().Stage_routeDistanceMatrix.remove()
    Out[1097]: {u'n': 7, u'ok': 1}
    
    In [1098]: edb._get_current_db().Stage_section_new.remove()
    Out[1098]: {u'n': 0, u'ok': 1}
    
    In [1099]: edb._get_current_db().Stage_stop.remove()
    Out[1099]: {u'n': 0, u'ok': 1}
    
    In [1100]: edb._get_current_db().Stage_trip_new.remove()
    Out[1100]: {u'n': 0, u'ok': 1}
    
    In [1101]: edb._get_current_db().Stage_user_moves_access.remove()
    Out[1101]: {u'n': 118, u'ok': 1}
    
    In [1102]: edb._get_current_db().Stage_utility_models.remove()
    Out[1102]: {u'n': 36, u'ok': 1}
    
    In [1103]: edb._get_current_db().Stage_Worktime.remove()
    Out[1103]: {u'n': 2662, u'ok': 1}
    
    In [1104]: edb.get_client_stats_db_backup().remove()
    Out[1104]: {u'n': 650961, u'ok': 1}
    
    In [1105]: edb.get_server_stats_db_backup().remove()
    Out[1105]: {u'n': 449523, u'ok': 1}
    
  5. Ok! I think we are done! There's plenty of room on the transfer disk, so
    let's just create a new dump and keep the old dump as backup.

    /dev/xvdg         296G   53G  228G  19% /transfer
    
    $ mongodump --out /transfer/cleanedup-jan-3
    2018-01-03T18:44:29.016+0000    Test_database.Test_Set to /transfer/cleanedup-jan-3/Test_database/Test_Set.bson
    2018-01-03T18:44:29.018+0000             1 documents
    2018-01-03T18:44:29.019+0000    Metadata for Test_database.Test_Set to /transfer/cleanedup-jan-3/Test_database/Test_Set.metadata.json
    2018-01-03T18:44:29.019+0000 DATABASE: admin     to     /transfer/cleanedup-jan-3/admin
    

Dump is done!

Remaining steps:

  • attach volume to new server
  • mongorestore
  • re-run analysis pipeline
  • DONE!!!

@shankari
Copy link
Contributor Author

shankari commented Jan 4, 2018

Re-attach the volume to the new stack

  1. Unmount from current stack

    $ sudo umount /transfer
    
  2. Snapshot

  3. Create new volume in the correct region

  4. Attach new volume to the database from the new stack

  5. Mount the new volume

    $ sudo mkdir -p /transfer
    $ sudo chown ec2-user:ec2-user /transfer/
    $ sudo mount /dev/xvdi /transfer/
    $ ls /transfer
    cleanedup-jan-3  lost+found  odc-usp-2017  original-jan-1  public_phone_stats
    
  6. Restore

  7. Validate

    1. On old server

      In [521]: edb.get_uuid_db().find().count()
      Out[521]: 504
      
      In [522]: edb.get_timeseries_db().find().count()
      Out[522]: 39797029
      
      In [523]: edb.get_analysis_timeseries_db().find().count()
      Out[523]: 16190498
      
      In [524]: len(edb.get_pipeline_state_db().find().distinct('user_id'))
      Out[524]: 475
      
      In [525]: edb.get_usercache_db().find().count()
      Out[525]: 10011104
      
    2. On new server

      In [2]: edb.get_uuid_db().find().count()
      Out[2]: 504
      
      In [3]: edb.get_timeseries_db().find().count()
      Out[3]: 39797135
      
      In [4]: edb.get_analysis_timeseries_db().find().count()
      Out[4]: 16190498
      
      In [5]: len(edb.get_pipeline_state_db().find().distinct('user_id'))
      Out[5]: 475
      
      In [6]: edb.get_usercache_db().find().count()
      Out[6]: 10011104
      
    3. Check our favourite users

      1. Me

        In [7]: edb.get_uuid_db().find_one({"user_email": "[email protected]"})
        Out[7]:
        {'_id': ObjectId('54a6bdfd39e59673fd9fba5b'),
         'update_ts': datetime.datetime(2017, 8, 20, 2, 29, 50, 275000),
         'user_email': '<shankari_email>',
         'uuid': UUID('<shankari_uuid>')}
        
        In [10]: edb.get_uuid_db().find({"user_id": UUID('<shankari_uuid>')}).count()
        Out[10]: 0
        
        In [11]: edb.get_timeseries_db().find({"user_id": UUID('<shankari_uuid>')}).count()
        Out[11]: 735156
        
        In [12]: edb.get_analysis_timeseries_db().find({"user_id": UUID('<shankari_uuid>')}).count()
        Out[12]: 166571
        
        In [23]: list(edb.get_timeseries_db().find({"user_id": shankari_entry["uuid"]}).sort("data.ts", 1).limit(1))
        Out[23]:
        [{'_id': ObjectId('5614ee7d88f663584fa03131'),
          'data': {'_id': ObjectId('5614ee7d88f663584fa03131'),
           'exit_fmt_time': '2015-08-21T18:06:16.905000-07:00',
           'exit_ts': 1440205576.905,
           'location': {'coordinates': [-122.4426899, 37.7280596], 'type': 'Point'},
           'starting_trip': ObjectId('5614ee7d88f663584fa03132'),
           'user_id': UUID('0763de67-f61e-3f5d-90e7-518e69793954')},
          'metadata': {'key': 'segmentation/raw_place',
           'platform': 'server',
           'time_zone': 'America/Los_Angeles',
           'write_fmt_time': '2016-04-25T06:32:01.332099-07:00',
           'write_ts': 1461591121.332099},
          'user_id': UUID('<shankari_uuid>')}]
        
        In [24]: list(edb.get_analysis_timeseries_db().find({"user_id": shankari_entry["uuid"]}).sort("data.ts", 1).limit(1))
        Out[24]:
        [{'_id': ObjectId('57e962fa88f66347503059e7'),
          'data': {'exit_fmt_time': '2015-07-13T15:25:56.852000-07:00',
           'exit_ts': 1436826356.852,
           'location': {'coordinates': [-122.0879696, 37.3885529], 'type': 'Point'},
           'source': 'DwellSegmentationTimeFilter',
           'starting_trip': ObjectId('57e962fa88f66347503059e8')},
          'metadata': {'key': 'segmentation/raw_place',
           'platform': 'server',
           'time_zone': 'America/Los_Angeles',
           'write_fmt_time': '2016-09-26T11:03:38.105793-07:00',
           'write_ts': 1474913018.105793},
          'user_id': UUID('<shankari_uuid>')}]
        
        In [25]: list(edb.get_timeseries_db().find({"user_id": shankari_entry["uuid"]}).sort("data.ts", -1).limit(1))
        Out[25]:
        [{'_id': ObjectId('581becd188f6630386d15ac5'),
          'data': {'battery_level_pct': 98.0, 'ts': 1478225037729.0},
          'metadata': {'key': 'background/battery',
           'platform': 'server',
           'time_zone': 'America/Los_Angeles',
           'write_ts': 1478225037729.0},
          'user_id': UUID('<shankari_uuid>')}]
        
        In [26]: list(edb.get_analysis_timeseries_db().find({"user_id": shankari_entry["uuid"]}).sort("data.ts", -1).limit(1))
        Out[26]:
        [{'_id': ObjectId('59eed4b188f66334694bfcb2'),
          'data': {'altitude': 1.0,
           'distance': 2.8965257248272045,
           'fmt_time': '2017-10-23T17:05:41-07:00',
           'heading': 31.494464698609224,
           'idx': 27,
           'latitude': 37.3909994,
           'loc': {'coordinates': [-122.0864596, 37.3909994], 'type': 'Point'},
           'longitude': -122.0864596,
           'mode': 0,
           'section': ObjectId('59eed4b188f66334694bfc96'),
           'speed': 0.12344985580418566,
           'ts': 1508803541.0},
          'metadata': {'key': 'analysis/recreated_location',
           'platform': 'server',
           'time_zone': 'America/Los_Angeles',
           'write_fmt_time': '2017-10-23T22:50:41.916285-07:00',
           'write_ts': 1508824241.916285},
          'user_id': UUID('<shankari_uuid>')}]
        
    4. Tom

       ```
       In [16]: tom_entry = edb.get_uuid_db().find_one({"user_email": "[email protected]"})
      
       In [17]: tom_entry
       Out[17]:
       {'_id': ObjectId('543c8a2239e59673fd9fb9dc'),
        'update_ts': datetime.datetime(2017, 5, 6, 21, 29, 3, 780000),
        'user_email': '<tom_email>',
        'uuid': UUID('<tom_uuid>')}
      
       In [18]: edb.get_timeseries_db().find({"user_id": tom_entry["uuid"]}).count()
       Out[18]: 592479
      
       In [19]: edb.get_analysis_timeseries_db().find({"user_id": tom_entry["uuid"]}).count()
       Out[19]: 139926
      
       In [27]: list(edb.get_timeseries_db().find({"user_id": tom_entry["uuid"]}).sort("data.ts", 1).limit(1))
       Out[27]:
       [{'_id': ObjectId('564ed10488f66311474836bd'),
         'data': {'_id': ObjectId('564ed10488f66311474836bd'),
          'exit_fmt_time': '2015-07-21T00:56:30.414000-07:00',
          'exit_ts': 1437465390.414,
          'location': {'coordinates': [-122.0862835, 37.3909556], 'type': 'Point'},
          'starting_trip': ObjectId('564ed10488f66311474836be'),
          'user_id': UUID('b0d937d0-70ef-305e-9563-440369012b39')},
         'metadata': {'key': 'segmentation/raw_place',
          'platform': 'server',
          'time_zone': 'America/Los_Angeles',
          'write_fmt_time': '2016-04-25T06:32:01.391767-07:00',
          'write_ts': 1461591121.391767},
         'user_id': UUID('<tom_uuid>')}]
      
       In [41]: list(edb.get_analysis_timeseries_db().find({"user_id": tom_entry["uuid"]}).sort("data.ts", 1).limit(1))
       Out[41]:
       [{'_id': ObjectId('57e9906b88f66347503240e7'),
         'data': {'exit_fmt_time': '2015-07-21T00:56:30.414000-07:00',
          'exit_ts': 1437465390.414,
          'location': {'coordinates': [-122.0862835, 37.3909556], 'type': 'Point'},
          'source': 'DwellSegmentationTimeFilter',
          'starting_trip': ObjectId('57e9906b88f66347503240e8')},
         'metadata': {'key': 'segmentation/raw_place',
          'platform': 'server',
          'time_zone': 'America/Los_Angeles',
          'write_fmt_time': '2016-09-26T14:17:31.475597-07:00',
          'write_ts': 1474924651.475597},
         'user_id': UUID('<tom_uuid>')}]
      
      
       In [29]: list(edb.get_timeseries_db().find({"user_id": tom_entry["uuid"]}).sort("data.ts", -1).limit(1))
       Out[29]:
       [{'_id': ObjectId('581b4c5588f6630386d0ea56'),
         'data': {'battery_level_pct': 85.0, 'ts': 1478184016437.0},
         'metadata': {'key': 'background/battery',
          'platform': 'server',
          'time_zone': 'America/Los_Angeles',
          'write_ts': 1478184016437.0},
         'user_id': UUID('<tom_uuid>')}]
      
       In [30]: list(edb.get_analysis_timeseries_db().find({"user_id": tom_entry["uuid"]}).sort("data.ts", -1).limit(1))
       Out[30]:
       [{'_id': ObjectId('59a23ae188f6632233d07f2f'),
         'data': {'altitude': 0.0,
          'distance': 1.0747850121465141,
          'fmt_time': '2017-08-25T16:21:37.192000-07:00',
          'heading': 134.91688221066724,
          'idx': 81,
          'latitude': 37.3910415,
          'loc': {'coordinates': [-122.0864408, 37.3910415], 'type': 'Point'},
          'longitude': -122.0864408,
          'mode': 4,
          'section': ObjectId('59a23ae188f6632233d07edd'),
          'speed': 0.10598412720427686,
          'ts': 1503703297.192},
         'metadata': {'key': 'analysis/recreated_location',
          'platform': 'server',
          'time_zone': 'America/Los_Angeles',
          'write_fmt_time': '2017-08-26T20:22:09.988126-07:00',
          'write_ts': 1503804129.988126},
         'user_id': UUID('<tom_uuid>')}]
       ```
      

I can see a couple of things that we should clean up. One which is easy and the other which should be done later.

  1. First, we need to adjust the timestamps on the stats objects again. We fixed all the client entries but not the battery entries.

    In [33]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["background/battery"]},"data.ts": {"$gt": now.timestamp}}).count()
    Out[33]: 625387
    
    for e in edb.get_timeseries_db().find({"metadata.key": {"$in": ["background/battery"]}, "data.ts": {"$gt": now.timestamp}}):
             edb.get_timeseries_db().update({"_id": e["_id"]},
                     {"$set": {"data.ts": float(e["data"]["ts"])/1000,
                       "metadata.write_ts": float(e["metadata"]["write_ts"])/1000}})
    
    In [35]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["background/battery"]},"data.ts": {"$gt": now.timestamp}}).count()
    Out[35]: 0
    
  2. Now the most recent entries in the timeseries should be fixed.

    In [36]: list(edb.get_timeseries_db().find({"user_id": shankari_entry["uuid"]}).sort("data.ts", -1).limit(1))
    Out[36]:
    [{'_id': ObjectId('5a4aaf3088f6636e03141d7a'),
      'data': {'name': 'POST_/usercache/get',
       'reading': 0.45238590240478516,
       'ts': 1514843952.94657},
      'metadata': {'key': 'stats/server_api_time',
       'platform': 'server',
       'time_zone': 'America/Los_Angeles',
    'write_fmt_time': '2018-01-01T13:59:12.947125-08:00',
    'write_ts': 1514843952.947125},
    'user_id': UUID('ea59084e-11d4-4076-9252-3b9a29ce35e0')}]
    
    In [37]: list(edb.get_timeseries_db().find({"user_id": tom_entry["uuid"]}).sort("data.ts", -1).limit(1))
    Out[37]:
    [{'_id': ObjectId('5a499a4688f6636e031416a9'),
      'data': {'name': 'POST_/datastreams/find_entries/timestamp',
       'reading': 0.06373810768127441,
       'ts': 1514773062.554465},
      'metadata': {'key': 'stats/server_api_time',
       'platform': 'server',
       'time_zone': 'America/Los_Angeles',
       'write_fmt_time': '2017-12-31T18:17:42.554857-08:00',
       'write_ts': 1514773062.554857},
    'user_id': UUID('<tom_uuid>')}]
    
  3. Second, we need to move all the segmentation/raw entries from the timeseries
    to the analysis_timeseries. They are not doing anything bad there - we assume
    trips and sections are only in the analysis database, and read them only from
    there.

    "segmentation/raw_trip": self.analysis_timeseries_db,
    

However, it seems like a bad idea to have weird data sitting around. Is it
actually duplicated in the analysis database? If so, can we just delete them
from the timeseries_db?

  1. Are they duplicated? For Tom, yes. for me, pretty close.

    • For me: Times are different, although ~ 1 month from each other. Locations are different.
    [{'_id': ObjectId('5614ee7d88f663584fa03131'),
      'data': {'_id': ObjectId('5614ee7d88f663584fa03131'),
       'exit_fmt_time': '2015-08-21T18:06:16.905000-07:00',
       'location': {'coordinates': [-122.4426899, 37.7280596], 'type': 'Point'},
    
    [{'_id': ObjectId('57e962fa88f66347503059e7'),
      'data': {'exit_fmt_time': '2015-07-13T15:25:56.852000-07:00',
       'location': {'coordinates': [-122.0879696, 37.3885529], 'type': 'Point'},
    
    • For Tom: only _ids are different. Eveything else is identical.
    [{'_id': ObjectId('564ed10488f66311474836bd'),
      'data': {'_id': ObjectId('564ed10488f66311474836bd'),
       'exit_fmt_time': '2015-07-21T00:56:30.414000-07:00',
       'location': {'coordinates': [-122.0862835, 37.3909556], 'type': 'Point'},
    
    [{'_id': ObjectId('57e9906b88f66347503240e7'),
      'data': {'exit_fmt_time': '2015-07-21T00:56:30.414000-07:00',
       'location': {'coordinates': [-122.0862835, 37.3909556], 'type': 'Point'},
    
  2. How many of them are there?

    In [50]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["segmentation/raw_place", "segmentation/raw_place", "segmentation/raw_section", "segmentation/raw_stop", "segmentation/raw_untracked"]}}).count()
    Out[50]: 29023
    
  3. Is there really overlap?

    • First timeseries entry

      In [66]: list(edb.get_timeseries_db().find({"metadata.key": {"$in": ["segmentation/raw_place", "segmentation/raw_place", "segmentation/raw_section", "segmentation/raw_stop", "segmentation/raw_untracked"]}}).sort([("data.enter_ts", 1), ("data.exit_ts", 1), ("data.start_ts", 1), ("data.end_ts", 1)]).limit(1))
      Out[66]:
      [{'_id': ObjectId('5674839788f66340b2fb12b9'),
        'data': {'_id': ObjectId('5674839788f66340b2fb12b9'),
         'duration': 269.63700008392334,
         'end_fmt_time': '2015-07-13T15:30:26.489000-07:00',
         'end_loc': {'coordinates': [-122.0824345, 37.3790636], 'type': 'Point'},
         'end_stop': ObjectId('5674839788f66340b2fb12bb'),
         'end_ts': 1436826626.489,
         'sensed_mode': 0,
         'source': 'SmoothedHighConfidenceMotion',
         'start_fmt_time': '2015-07-13T15:25:56.852000-07:00',
         'start_loc': {'coordinates': [-122.0879696, 37.3885529], 'type': 'Point'},
         'start_ts': 1436826356.852,
         'trip_id': ObjectId('5674838188f66340b2fb0c9c'),
         'user_id': UUID('0763de67-f61e-3f5d-90e7-518e69793954')},
        'metadata': {'key': 'segmentation/raw_section',
         'platform': 'server',
         'time_zone': 'America/Los_Angeles',
         'write_fmt_time': '2016-04-25T06:34:22.352027-07:00',
         'write_ts': 1461591262.352027},
        'user_id': UUID('ea59084e-11d4-4076-9252-3b9a29ce35e0')}]
      
    • Last timeseries entry

      In [65]: list(edb.get_timeseries_db().find({"metadata.key": {"$in": ["segmentation/raw_place", "segmentation/raw_place", "segmentation/raw_section", "segmentation/raw_stop", "segmentation/raw_untracked"]}}).sort([("data.enter_ts", -1), ("data.exit_ts", -1), ("data.start_ts", -1), ("data.end_ts", -1)]).limit(1))
      Out[65]:
      [{'_id': ObjectId('571dfacb88f66333657dafb5'),
        'data': {'_id': ObjectId('571dfacb88f66333657dafb5'),
         'ending_trip': ObjectId('571dfacb88f66333657dafb4'),
         'enter_fmt_time': '2016-04-25T01:54:55.062190-07:00',
         'enter_ts': 1461574495.06219,
         'location': {'coordinates': [-122.2528321669644, 37.86827700681786],
          'type': 'Point'},
         'source': 'DwellSegmentationDistFilter',
         'user_id': UUID('788f46af-9e6d-300b-93e1-981ba9b3390b')},
        'metadata': {'key': 'segmentation/raw_place',
         'platform': 'server',
         'time_zone': 'America/Los_Angeles',
         'write_fmt_time': '2016-04-25T06:32:12.557781-07:00',
         'write_ts': 1461591132.557781},
        'user_id': UUID('43f9361e-1cb6-4026-99ba-458be357d245')}]
      
      
    • First analysis entry

      In [63]: list(edb.get_analysis_timeseries_db().find({"metadata.key": {"$in": ["segmentation/raw_place", "segmentation/raw_place", "segmentation/raw_section", "segmentation/raw_stop", "segmentation/raw_untracked"]}}).sort([("data.enter_ts", 1), ("data.exit_ts", 1), ("data.start_ts", 1), ("data.end_ts", 1)]).limit(1))
      Out[63]:
      [{'_id': ObjectId('57e9648a88f6634750306dc5'),
        'data': {'duration': 269.63700008392334,
         'end_fmt_time': '2015-07-13T15:30:26.489000-07:00',
         'end_loc': {'coordinates': [-122.0824345, 37.3790636], 'type': 'Point'},
         'end_stop': ObjectId('57e9648a88f6634750306dc7'),
         'end_ts': 1436826626.489,
         'sensed_mode': 0,
         'source': 'SmoothedHighConfidenceMotion',
         'start_fmt_time': '2015-07-13T15:25:56.852000-07:00',
         'start_loc': {'coordinates': [-122.0879696, 37.3885529], 'type': 'Point'},
         'start_ts': 1436826356.852,
         'trip_id': ObjectId('57e962fa88f66347503059e8')},
        'metadata': {'key': 'segmentation/raw_section',
         'platform': 'server',
         'time_zone': 'America/Los_Angeles',
         'write_fmt_time': '2016-09-26T11:10:18.444634-07:00',
         'write_ts': 1474913418.444634},
        'user_id': UUID('ea59084e-11d4-4076-9252-3b9a29ce35e0')}]
      
    • Last analysis entry

      In [64]: list(edb.get_analysis_timeseries_db().find({"metadata.key": {"$in": ["segmentation/raw_place", "segmentation/raw_place", "segmentation/raw_section", "segmentation/raw_stop", "segmentation/raw_untracked"]}}).sort([("data.enter_ts", -1), ("data.exit_ts", -1), ("data.start_ts", -1), ("data.end_ts", -1)]).limit(1))
      Out[64]:
      [{'_id': ObjectId('59db353088f6636450bb268e'),
        'data': {'duration': -1490206.003000021,
         'ending_trip': ObjectId('59db353088f6636450bb268d'),
         'enter_fmt_time': '2017-10-25T22:36:49.223000-07:00',
         'enter_ts': 1508996209.223,
         'exit_fmt_time': '2017-10-08T16:40:03.220000-07:00',
         'exit_ts': 1507506003.22,
         'location': {'coordinates': [-122.2579113, 37.873973], 'type': 'Point'},
         'source': 'DwellSegmentationTimeFilter',
         'starting_trip': ObjectId('59db353088f6636450bb268f')},
        'metadata': {'key': 'segmentation/raw_place',
         'platform': 'server',
         'time_zone': 'America/Los_Angeles',
         'write_fmt_time': '2017-10-09T01:37:04.036013-07:00',
         'write_ts': 1507538224.036013},
        'user_id': UUID('06f82876-4090-482f-a7be-91345df47bb2')}]
      

So it looks like we ran the pipeline in april, back when we were still storing
entries to the timeseries. Then we split and re-ran the pipeline but did not
delete the old entries. So the timeseries entries are from
2015-07-13T15:25:56.852000-07:00 to 2016-04-25T01:54:55.062190-07:00,
while the analysis timeseries entries are from
2015-07-13T15:25:56.852000-07:00 -> 2017-10-08T16:40:03.220000-07:00.
So there is a clear overlap and we can delete the entries from the timeseries.

Delete now or delete later?

Let's just delete now, while we still have backups sitting around.

In [67]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["segmentation/raw_place", "segmentation/raw_place", "segmentation/raw_section", "segmentation/raw_stop", "segmentation/raw_untracked"]}}).count()
Out[67]: 29023

In [68]:  edb.get_timeseries_db().find({"metadata.key": {"$in": ["analysis/cleaned_place", "analysis/cleaned_trip", "analysis/cleaned_section", "analysis/cleaned_stop", "analysis/cleaned_untracked"]}}).count()
Out[68]: 0

In [69]:  edb.get_analysis_timeseries_db().find({"metadata.key": {"$in": ["analysis/cleaned_place", "analysis/cleaned_trip", "analysis/cleaned_section", "analysis/cleaned_stop", "analysis/cleaned_untracked"]}}).count()
Out[69]: 505455

It looks like the first run was pre-cleaned trips, so we only have to delete raw-*

In [76]: edb.get_timeseries_db().delete_many({"metadata.key": {"$in": ["segmentation/raw_place", "segmentation/raw_place", "segmentation/raw_section", "segmentation/raw_stop", "segmentation/raw_untracked"]}}).raw_result
Out[76]: {'n': 29023, 'ok': 1.0}

In [77]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["segmentation/raw_place", "segmentation/raw_place", "segmentation/raw_section", "segmentation/raw_stop", "segmentation/raw_untracked"]}}).count()
Out[77]: 0

Ok, so now the oldest entries in the timeseries should be different.

In [78]: list(edb.get_timeseries_db().find({"user_id": shankari_entry["uuid"]}).sort("data.ts", 1).limit(1))
Out[78]:
[{'_id': ObjectId('564f7d6388f66343e476e832'),
  'data': {'deleted_points': [],
   'filtering_algo': 'SmoothZigzag',
   'outlier_algo': 'BoxplotOutlier',
   'section': ObjectId('564f7d3888f66343e476e518')},
  'metadata': {'key': 'analysis/smoothing',
   'platform': 'server',
   'time_zone': 'America/Los_Angeles',
   'write_fmt_time': '2015-11-20T12:06:59.574516-08:00',
   'write_local_dt': datetime.datetime(2015, 11, 20, 20, 6, 59, 574000),
   'write_ts': 1448050019.574516},
  'user_id': UUID('ea59084e-11d4-4076-9252-3b9a29ce35e0')}]

Oops. It is now another generated result. Let's query and delete these as well.

In [79]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["analysis/smoothing"]}}).count()
Out[79]: 9033

In [82]: edb.get_timeseries_db().delete_many({"metadata.key": {"$in": ["analysis/smoothing"]}}).raw_result
Out[82]: {'n': 9033, 'ok': 1.0}

In [83]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["analysis/smoothing"]}}).count()
Out[83]: 0

Ok, so now the oldest entries in the timeseries should be different. Argh, we missed segmentation/raw_trip in the first query.

In [85]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["segmentation/raw_trip"]}}).count()
Out[85]: 7815

In [86]: edb.get_timeseries_db().delete_many({"metadata.key": {"$in": ["segmentation/raw_trip"]}}).raw_result
Out[86]: {'n': 7815, 'ok': 1.0}

In [87]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["segmentation/raw_trip"]}}).count()
Out[87]: 0

Ok, so now do we have correct oldest entries in the timeseries.
Yes, finally. Although it doesn't really have a ts.

In [88]: list(edb.get_timeseries_db().find({"user_id": shankari_entry["uuid"]}).sort("data.ts", 1).limit(1))
Out[88]:
[{'_id': ObjectId('579fb85988f66357dde496ae'),
  'data': {'approval_date': '2016-07-14',
   'category': 'emSensorDataCollectionProtocol',
   'protocol_id': '2014-04-6267'},
  'metadata': {'key': 'config/consent',
   'platform': 'android',
   'read_ts': 0,
   'time_zone': 'Pacific/Honolulu',
   'type': 'rw-document',
   'write_fmt_time': '2016-08-01T06:02:09.845000-10:00',
   'write_ts': 1470067329.845},
  'user_id': UUID('ea59084e-11d4-4076-9252-3b9a29ce35e0')}]

In [89]: list(edb.get_timeseries_db().find({"user_id": shankari_entry["uuid"], "data.ts": {"$exists": True}}).sort("data.ts", 1).limit(1))
Out[89]:
[{'_id': ObjectId('59822c3fcb17471ac0667b86'),
  'data': {'accuracy': 1086.116,
   'altitude': 0,
   'elapsedRealtimeNanos': 7479231065135,
   'filter': 'time',
   'fmt_time': '1969-12-31T16:00:00-08:00',
   'heading': 0,
   'latitude': -23.56214940547943,
   'loc': {'coordinates': [-46.72179579734802, -23.56214940547943],
    'type': 'Point'},
   'longitude': -46.72179579734802,
   'sensed_speed': 0,
   'ts': 0},
  'metadata': {'key': 'background/location',
   'platform': 'android',
   'read_ts': 0,
   'time_zone': 'America/Los_Angeles',
   'type': 'sensor-data',
   'write_fmt_time': '2017-08-02T12:01:48.909000-07:00',
   'write_ts': 1501700508.909},
  'user_id': UUID('ea59084e-11d4-4076-9252-3b9a29ce35e0')}]

@shankari
Copy link
Contributor Author

shankari commented Jan 4, 2018

ok, so now that we believe that that database is fine, we can run the pipeline again for the first time in forever.

  1. unmount the transfer drive now that its job is done
    $ sudo umount /transfer
    $
    
  2. detach volume
  3. while that is running, turn on the pipeline after multiple months. After
    this runs successfully at least once, we can put it into a cronjob.

Note that we now have entries in the timeseries that are client stats only and
have no uuid entry or any real data. These are zombie entries from before the
uuid change. Are we ignoring these correctly?

A quick check shows that we just read from the UUID database.

def get_all_uuids():
    all_uuids = [e["uuid"] for e in edb.get_uuid_db().find()]
    return all_uuids

There are other methods that still use distinct, notably aggregate_timeseries.get_distinct_users and builtin_timeseries.get_uuid_list, but they don't seem to be used anywhere in the code. Let's remove them as part of cleanup so people are not tempted to use them. Alternatively, we can move zombie entries into a separate timeseries where they won't pollute anything.

Ok, so let's run this script!

$ ./e-mission-ipy.bash bin/intake_multiprocess.py 4 > /log/intake.stdinout.log 2>&1
$ date
Thu Jan  4 08:17:24 UTC 2018

It's taking a long time to even just get started. I wonder if we are using distinct somewhere...

Looking at the launcher logs, it is still iterating through the users and querying for the number of entries in the usercache. I don't even think we use that functionality and can probably get rid of it in the next release.

Ah so how processes have been launched. Hopefully this first run will be done by tomorrow morning.

  1. Now check back on the volume - it has been detached
  2. Delete both volumes and the snapshot. At this point, there is no unencrypted copy of the data. Will delete other server with encrypted data once everything is done.

@shankari
Copy link
Contributor Author

shankari commented Jan 4, 2018

We should also remove the filter_accuracy step from the real server since it is only applicable for test phones/open data.

@shankari
Copy link
Contributor Author

shankari commented Jan 4, 2018

  • Habitica API is at port 3000, so I have to open it as an outgoing port
  • Performance is significantly better. If filter accuracy is removed, then my data for 3 months is processed in ~ 15 mins. We might be able to get back to running every hour.
2018-01-04T12:02:11.764406+00:00**********UUID <shankari_uuid>: moving to long term**********
2018-01-04T12:04:29.687814+00:00**********UUID <shankari_uuid>: filter accuracy if needed**********
2018-01-04T12:26:01.563703+00:00**********UUID <shankari_uuid>: segmenting into trips**********
2018-01-04T12:33:06.101569+00:00**********UUID <shankari_uuid>: segmenting into sections**********
2018-01-04T12:33:33.701075+00:00**********UUID <shankari_uuid>: smoothing sections**********
2018-01-04T12:33:54.020444+00:00**********UUID <shankari_uuid>: cleaning and resampling timeline**********
2018-01-04T12:40:12.439267+00:00**********UUID <shankari_uuid>: checking active mode trips to autocheck habits**********

I'm also seeing some errors with saving data, need to make a pass through the errors.

Got error None while saving entry AttrDict({'_id': ObjectId('59f11f15cb17471ac0cfc059'), 'metadata': {'write_ts': 1508533079.587702, 'plugin': 'none', 'time_zone': 'America/Montreal', 'platform': 'ios', 'key': 'statemachine/transition', 'read_ts': 0, 'type': 'message'}, 'user_id': UUID('e95bfd0b-1cfc-4ea3-af01-c41dc2fad0ed'), 'data': {'transition': None, 'ts': 1508533079.587526, 'currState': 'STATE_ONGOING_TRIP'}}) -> None

@shankari
Copy link
Contributor Author

shankari commented Jul 6, 2018

AMPLab is using too many resources, so I have to trim my consumption.
Let's finally finish cleaning up the old servers

@shankari shankari reopened this Jul 6, 2018
@shankari
Copy link
Contributor Author

shankari commented Jul 6, 2018

Copied dump_team_trajectories.py off the old server.
There shouldn't be anything else.
One caveat is that the original data is 150 GB

143G    /mnt/e-mission-primary-db/mongodb

But the new dataset is only 20GB.

/dev/mapper/xvdf  3.0T   20G  3.0T   1% /data
/dev/mapper/xvdg   25G  633M   25G   3% /journal
/dev/mapper/xvdh   10G  638M  9.4G   7% /log

So what is missing?

@shankari
Copy link
Contributor Author

shankari commented Jul 6, 2018

It doesn't appear to be the data. These are the only differences in collections between the old and new databases. There is nothing missing except for system.indexes, which should be an auto-generated collection.

screen shot 2018-07-06 at 4 18 57 am

@shankari
Copy link
Contributor Author

shankari commented Jul 6, 2018

Yup! system.indexes is now deprecated.
https://docs.mongodb.com/manual/reference/system-collections/#%3Cdatabase%3E.system.indexes

Deprecated since version 3.0: Access this data using listIndexes.

And listIndexes does have the data.

> db.Stage_timeseries.getIndexes()
[
        {
                "v" : 2,
                "key" : {
                        "_id" : 1
                },
                "name" : "_id_",
                "ns" : "Stage_database.Stage_timeseries"
        },
        {
                "v" : 2,
                "key" : {
                        "user_id" : "hashed"
                },
                "name" : "user_id_hashed",
                "ns" : "Stage_database.Stage_timeseries"
        },
        {
                "v" : 2,
                "key" : {
                        "metadata.key" : "hashed"
                },
                "name" : "metadata.key_hashed",
                "ns" : "Stage_database.Stage_timeseries"
        },
        {
                "v" : 2,
                "key" : {
                        "metadata.write_ts" : -1
                },
                "name" : "metadata.write_ts_-1",
                "ns" : "Stage_database.Stage_timeseries"
        },
        {
                "v" : 2,
                "key" : {
                        "data.ts" : -1
                },
                "name" : "data.ts_-1",
                "ns" : "Stage_database.Stage_timeseries",
                "sparse" : true
        },
...

@shankari
Copy link
Contributor Author

shankari commented Jul 6, 2018

Shutting down the old server now. RIP! You were a faithful friend and will be missed.

@shankari
Copy link
Contributor Author

shankari commented Jul 9, 2018

In the past 4 days, the compute has increased by $50. The storage has increased by $200. We need to turn off some storage. Wah! Wah! What if I lose something important?! I guess you just have to deal with it...

@shankari
Copy link
Contributor Author

Deleted all related storage.

@shankari
Copy link
Contributor Author

Even with all the deletions, we spent ~ $50/day. This is a problem, because we will then end up spending an additional $1040 for the rest of the month, and we have already spent ~ $1500. This also means that we won't be under $1000 for next month.

Since our reserved instances already cost $507, the m3/m4 legacy servers cost ~ $639, we have to keep our storage budget to under $500 to stay at my preferred 50% of my available budget.

The storage is mostly going towards the provisioned iOPS storage. I don't think I actually need 3GB.

Current storage is

Filesystem        Size  Used Avail Use% Mounted on
devtmpfs           30G   84K   30G   1% /dev
tmpfs              30G     0   30G   0% /dev/shm
/dev/xvda1        7.8G  2.3G  5.5G  30% /
/dev/mapper/xvdf  3.0T   21G  3.0T   1% /data
/dev/mapper/xvdg   25G  633M   25G   3% /journal
/dev/mapper/xvdh   10G  638M  9.4G   7% /log

We should be able to drop to:

  • /data: 200G

If we still need to reduce after that, we can change to:

  • /journal: 10G
  • /log: 5G

Let's see how easy it is to resize EBS volumes

@shankari
Copy link
Contributor Author

shankari commented Jul 10, 2018

Let's see how easy it is to resize EBS volumes

Not too hard, you just have to copy the data around.
https://matt.berther.io/2015/02/03/how-to-resize-aws-ec2-ebs-volumes/
Let's turn off everything and start copying the data tomorrow morning

@shankari
Copy link
Contributor Author

shankari commented Jul 11, 2018

This was a bit trickier than one would expect because xfs does not support resize2fs and in fact, does not support shrinking the file system at all. So we had to follow the instructions to use the temp instance, as in the link above.

Note that our data is also on an encrypted filesystem, so our steps were:

  • unmount + turn encryption off for disk
  • detach
  • create new volume
  • create new instance
  • attach both volumes
  • mount old_data
  • crypt setup new volume
  • mount new data
  • copy data over using xfsdump and xfsrestore. Note that xfsdump does not support a / at the end of the mounted filename
  • chmod new_data to the same ids as old_data

At this point, the only diff between them is

--- /tmp/old_data_list  2018-07-11 10:07:36.038063450 +0000
+++ /tmp/new_data_list  2018-07-11 10:07:24.266143934 +0000
@@ -140,7 +140,7 @@
 -rw-r--r-- 1  498  497 4.0K Jul 11 09:23 index-94-2297001609533616747.wt
 -rw-r--r-- 1  498  497 4.0K Jul 11 09:23 index-96-2297001609533616747.wt
 -rw-r--r-- 1  498  497 4.0K Jul 11 09:23 index-98-2297001609533616747.wt
-lrwxrwxrwx 1 root root    8 Jan  1  2018 journal -> /journal
+lrwxrwxrwx 1 498 497    8 Jul 11 09:53 journal -> /journal
 -rw-r--r-- 1  498  497  48K Jul 11 09:23 _mdb_catalog.wt
 -rw-r--r-- 1  498  497    0 Jul 11 09:23 mongod.lock
 -rw-r--r-- 1  498  497  36K Jul 11 09:23 sizeStorer.wt

which makes sense

So now it's time to reverse the steps and attach new_data back to the server

@shankari
Copy link
Contributor Author

Reversed steps, restarted server. No errors so far.
Deleting old disk and migration instance.

@shankari
Copy link
Contributor Author

Done. Closing this issue for now.

@shankari
Copy link
Contributor Author

Burn rate is now $33/day
1419.11 - 1385.76 = 33.35

Should go down after we turn off air quality server
but 33 * 15 ~ $500
so we are on track for $2000 for the month, not $1500 as originally planned

@shankari
Copy link
Contributor Author

Burn rate is now roughly $13/day (1497 - 1419 = 78 over 6 days = $13/day)
So this month should be $1497 + $156 = $1653
Next month should be 524 (reserved instances) + 13 * 30 = 524 + 390 = 914 (< $1000)

valin1 referenced this issue in valin1/e-mission-server Oct 25, 2018
This fixes
https://github.com/e-mission/e-mission-server/issues/530#issuecomment-352197949

Also add a new test case that checks for this.
Also fix a small bug in the extraction script
valin1 referenced this issue in valin1/e-mission-server Oct 25, 2018
… in the query

This fixes
https://github.com/e-mission/e-mission-server/issues/530#issuecomment-352206464

Basically, two sections are back to back, then the last point of the first
section will overlap with the first point of the second section. So a query
based on the start and end time for the first section will return the the first
point of the second section as well, which causes a mismatch between the
re-retrieved and stored speeds and distances.

We detect and drop the last point in this case.
valin1 referenced this issue in valin1/e-mission-server Oct 25, 2018
This fixes
https://github.com/e-mission/e-mission-server/issues/530#issuecomment-352219808

dealing with using pymongo in a multi-process environment

```
/Users/shankari/OSS/anaconda/envs/emission/lib/python3.6/site-packages/pymongo/topology.py:145: UserWarning: MongoClient opened before fork. Create MongoClient with connect=False, or create client after forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#pymongo-fork-safe>
  "MongoClient opened before fork. Create MongoClient "
/Users/shankari/OSS/anaconda/envs/emission/lib/python3.6/site-packages/pymongo/topology.py:145: UserWarning: MongoClient opened before fork. Create MongoClient with connect=False, or create client after forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#pymongo-fork-safe>
  "MongoClient opened before fork. Create MongoClient "
/Users/shankari/OSS/anaconda/envs/emission/lib/python3.6/site-packages/pymongo/topology.py:145: UserWarning: MongoClient opened before fork. Create MongoClient with connect=False, or create client after forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#pymongo-fork-safe>
  "MongoClient opened before fork. Create MongoClient "
```

spawning instead of forking ensures that the subprocesses don't inherit the MongoClient object from the parent and create new ones instead.

```
storage not configured, falling back to sample, default configuration
Connecting to database URL localhost
debug not configured, falling back to sample, default configuration
storage not configured, falling back to sample, default configuration
Connecting to database URL localhost
storage not configured, falling back to sample, default configuration
Connecting to database URL localhost
storage not configured, falling back to sample, default configuration
Connecting to database URL localhost
```
valin1 referenced this issue in valin1/e-mission-server Oct 25, 2018
valin1 referenced this issue in valin1/e-mission-server Oct 25, 2018
See https://github.com/e-mission/e-mission-server/issues/530#issuecomment-353803676
Note that I remove all entries whose section entry is not valid and have snuck
over from elsewhere.

Regression described at https://github.com/e-mission/e-mission-server/issues/530#issuecomment-353803676 now fixed (I ran thrice in a row without failing)
valin1 referenced this issue in valin1/e-mission-server Oct 25, 2018
Although people won't see the ipv6 until they start to use it.  Note that there
are a bunch of manual steps to turn on IPv6 for this setup.  This change merely
automates the tedious work of setting up the routing tables and security
groups.
https://github.com/e-mission/e-mission-server/issues/530#issuecomment-354061649

At this point, I declare that I am done with tweaking the configuration and
will use the configuration deployed from this template (including
75d19de,
7a32bb6...) as the setup for the
default/reference e-mission server.
@shankari shankari transferred this issue from e-mission/e-mission-server Feb 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant