Split the server for greater scalability #292

shankari · 2017-11-21T23:14:45Z

The server scalability had deteriorated to the point where we were unable to run the pipeline even once per day. While part of this is probably just the way we are using mongodb, part of it is also that the server resources are running out.

So I turned off the pipeline around a month ago (last run was on 2017-10-24 21:41:18).

Now, I want to re-provision with a better, split architecture, and reserved instances for lower costs.

The text was updated successfully, but these errors were encountered:

shankari · 2017-11-24T06:31:02Z

Here are the current servers that e-mission is running.

aws-otp-server: m3.large
aws-nominatim: m3.large
habitica-server: m3.large
aws-webapp: m3.xlarge

The OTP and nominatim servers seem to be fine. habitica server sometimes has registration issues (https://github.com/e-mission/e-mission-server/issues/522), but doesn't seem to be related to performance.

The biggest issue is in the webapp. The performance of the webapp + server (without the pipeline running), seems acceptable. So the real issue is the pipeline + the database running on the same server. To be reasonable, we should probably split the server into three parts.

database
webapp
pipeline (backend)

Technically, the pipeline can later become a really small launcher for serverless computation if that's the architecture that we choose to go with.

For now, we want a memory optimized instance for the database, since mongodb caches most results in memory. The webapp and pipeline can probably remain as general-purpose instances, but a bit more powerful.

shankari · 2017-11-24T06:46:51Z

wrt https://github.com/e-mission/e-mission-server/issues/530#issuecomment-346752431, we probably want the following:

- aws-otp-server: m3.large/m4.large
- aws-nominatim: m3.large/m4.large
- habitica-server: m3.large/m4.large

- aws-em-webapp: m3.xlarge/m4.xlarge
- aws-em-analysis: m3.xlarge/m4.xlarge
- aws-em-mongodb: m4.2xlarge/r3.xlarge/r4.xlarge/r3.2xlarge/r4.2xlarge

shankari · 2017-11-24T07:15:54Z

Looking at the configuration in greater detail:

for the m3.large/m4.large decision: the m3* series comes with SSD storage large = 32GB, but m4* only supports EBS. So we have to pay extra for storage for the m4* series. So it would be vastly preferable to use the m3 series, at least for the 3 standalone systems which have to include their own data

Instance Type	vCPU	Memory (GiB)	Storage (GB)	Networking Performance
m4.large	2	8	EBS Only	Moderate
m4.xlarge	4	16	EBS Only	High
m3.large	2	7.5	1 x 32 SSD	Moderate
m3.xlarge	4	15	2 x 40 SSD	High

for the database, the difference between the r3* and r4* series seems similar - e.g.

Instance	vCPU	RAM	Network	Local storage
r4.xlarge	4	30.5	Up to 10 Gigabit	EBS-Only
r4.2xlarge	8	61	Up to 10 Gigabit	EBS-Only
r3.xlarge	4	30.5	Moderate	1 x 80
r3.2xlarge	8	61	Moderate	1 x 160

In this case, though, since the database is already on an EBS disk, the overhead should be low.

shankari · 2017-11-24T07:24:00Z

EBS storage costs are apparently unpredictable, because we pay for both storage and I/O.
https://www.quora.com/Whats-cons-and-pros-for-EBS-based-AMIs-vs-instance-store-based-AMIs
Some people actively advise against using EBS. And of course, the instance based storages also have a ton of ephemeral storage and mostly work (except the habitica server) work off static datasets . So for the otp, habitica and nominatim servers, it is pretty much a no-brainer to use the m3 instances.

shankari · 2017-11-24T08:10:39Z

unsure whether m3* instances are available for reserved pricing, though.
https://aws.amazon.com/ec2/pricing/reserved-instances/pricing/
and the IOPS pricing only applies to provisioned IOPS instances.
https://aws.amazon.com/ebs/pricing/

General purpose instances are 10 cents/GB-month. So the additional cost for going from *3 -> *4 is:

m3.large -> m4.large: 32 * 0.1 = max $3.2/month
m3.xlarge -> m4.xlarge: 80 * 0.1 = max $8/month
r3.large -> r4.large: 80 * 0.1 = max $8/month
m3.xlarge -> m4.xlarge: 160 * 0.1 = max $160/month

So the additional cost is minimal.

Also, all the documentation says that instance storage is ephemeral, but I know for a fact that when I shut down and restart my m3 instances, the data in the root volume is retained.
I do see that apparently all AMIs are currently launched with EBS root volumes by default
https://stackoverflow.com/a/36688645/4040267
and this is consistent with what I see in the console.

and except for the special database EB2 instance, are typically size 8GB. Does this means that m3 instances now include EBS storage by default? Am I paying for them? I guess so, but 8GB is so small (< 10 cents a month max) that I probably don't notice.

Also, it looks like the EBS* instances also do have emphemeral storage (http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/RootDeviceStorage.html). So we should go with the *3* instances if there are reserved instances that support them - otherwise, we should go with *4* instances - the difference in both cost and functionality is negligible compared to the savings of the reserved instance.

shankari · 2017-11-24T08:16:00Z

wrt ephemeral storage for instances, they can apparently be added at the time the instance is launched (http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/add-instance-store-volumes.html)

You can specify the instance store volumes for your instance only when you launch an instance. > You can't attach instance store volumes to an instance after you've launched it.

shankari · 2017-11-24T08:18:25Z

from

So we should go with the 3 instances if there are reserved instances that support them - otherwise, we should go with 4 instances - the difference in both cost and functionality is negligible compared to the savings of the reserved instance.

There are reserved instances that support every single kind of on-demand instance including *3*.

shankari · 2017-11-24T15:16:46Z

I looked at one m3 instance and one m4 instance and they both seem to be identical - one block device, which is the root device and is EBS.

`m4.large`	`m3.large`

shankari · 2017-11-24T16:19:15Z

Asked a question on stackoverflow
https://serverfault.com/questions/885042/m3-instances-have-root-ebs-volume-by-default-so-now-what-is-the-difference-betw

But empirically, it looks like there is ephemeral storage on m3 instances but not on m4. So the m3 instance has a 32 GB /dev/xvdb, but the m4 instance does not. So why would you use m4 instead of m3? More storage is always good, right?

m3

ubuntu@ip-10-157-135-115:~$ sudo fdisk -l

Disk /dev/xvda: 8589 MB, 8589934592 bytes
255 heads, 63 sectors/track, 1044 cylinders, total 16777216 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

    Device Boot      Start         End      Blocks   Id  System
/dev/xvda1   *       16065    16771859     8377897+  83  Linux

Disk /dev/xvdb: 32.2 GB, 32204390400 bytes
255 heads, 63 sectors/track, 3915 cylinders, total 62899200 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

ubuntu@ip-10-157-135-115:~$ mount | grep ext4
/dev/xvda1 on / type ext4 (rw)

m4

$ sudo fdisk -l
Disk /dev/xvda: 8 GiB, 8589934592 bytes, 16777216 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xea059137

Device     Boot Start      End  Sectors Size Id Type
/dev/xvda1 *     2048 16777182 16775135   8G 83 Linux

ubuntu@ip-172-30-0-54:~$ mount | grep ext4
/dev/xvda1 on / type ext4 (rw,relatime,discard,data=ordered)

shankari · 2017-11-24T16:44:17Z

I am going to create m3.* reserved instances instead of m4.* instances across the board.
For the r3.* versus r4.*, there is actually some question since the r4.* instance has better network, which is important for a database.

Note that the EBS volume that hosts the database is currently associated with 9216 IOPS.
Is that used or proviosioned? Let's check. According to the docs:

baseline performance is 3 IOPS per GiB, with a minimum of 100 IOPS and a maximum of 10000 IOPS.

The volume uses 3072 GB, so this is 3072 * 3 = 9216 = the baseline performance.
Let us see the actual performance. No more than 2 IOPS. But of course, we weren't running the pipeline. I am tempted to go with r4.* for the database server, just to be on the safe side.

shankari · 2017-11-24T17:12:54Z

Given those assumptions, the monthly budget for one installation is:

- aws-otp-server: m3.large ($50) so we have storage
- aws-nominatim: m3.large ($50)
- habitica-server: m3.large ($50)

- aws-em-webapp: m3.xlarge ($90)
- aws-em-analysis: m3.xlarge ($90)
- aws-em-mongodb: r4.2xlarge ($245)

Storage:

- 3072 GB * 0.1 /GB = $307 (biggest expense by far, likely to grow bigger going forward, need to check causes of growth, but may be unavoidable)
- 40 GB * 0.1 / GB = $4 (probably want to put the e-mission server configuration on persistent storage)
- logs can stay on ephemeral storage, which we will have access to given planned m3.* creation

So current total per month:

$150 shared infrastructure,
$425 compute
$310 storage, increasing every month

$885 per month, increasing as we get more storage

When I provision the servers for the eco-escort project, the costs will go up by

$425 compute
$310 storage, increasing every month

$735 per month, increasing as we get more storage

to $885 + $735 = $1620 per month.

Storage details

Current mounts on the server:

From the UI, EBS block devices are

/dev/sda1
/dev/sdd
/dev/sdf

$ mount  | grep ext4
/dev/xvda1 on / type ext4 (rw,discard)
/dev/xvdd on /home/e-mission type ext4 (rw)
/dev/mapper/xvdb on /mnt type ext4 (rw)
/dev/mapper/xvdc on /mnt/logs type ext4 (rw)
/dev/mapper/xvdf on /mnt/e-mission-primary-db type ext4 (rw)

$ df -h
Filesystem        Size  Used Avail Use% Mounted on
/dev/xvda1        7.8G  5.2G  2.2G  71% /
/dev/xvdd         7.8G  326M  7.1G   5% /home/e-mission
/dev/mapper/xvdb   37G   14G   22G  39% /mnt
/dev/mapper/xvdc   37G   19G   17G  54% /mnt/logs
/dev/mapper/xvdf  3.0T  141G  2.7T   5% /mnt/e-mission-primary-db

$ sudo fdisk -l

Disk /dev/xvda: 8589 MB, 8589934592 bytes
    Device Boot      Start         End      Blocks   Id  System
/dev/xvda1   *       16065    16771859     8377897+  83  Linux

Disk /dev/xvdb: 40.3 GB, 40256929792 bytes
Disk /dev/xvdb doesn't contain a valid partition table

Disk /dev/xvdc: 40.3 GB, 40256929792 bytes
Disk /dev/xvdc doesn't contain a valid partition table

Disk /dev/xvdd: 8589 MB, 8589934592 bytes
Disk /dev/xvdd doesn't contain a valid partition table

Disk /dev/xvdf: 3298.5 GB, 3298534883328 bytes
Disk /dev/xvdf doesn't contain a valid partition table

Disk /dev/mapper/xvdb: 40.3 GB, 40254832640 bytes
Disk /dev/mapper/xvdb doesn't contain a valid partition table

Disk /dev/mapper/xvdc: 40.3 GB, 40254832640 bytes
Disk /dev/mapper/xvdc doesn't contain a valid partition table

Disk /dev/mapper/xvdf: 3298.5 GB, 3298532786176 bytes
Disk /dev/mapper/xvdf doesn't contain a valid partition table

So it looks like we have 3 EBS devices:

/ which primarily has the OS, and /tmp/

2.4G    /home
1.7G    /tmp
974M    /usr
391M    /var

$ du -sh /home/*
308M    /home/e-mission
2.1G    /home/ubuntu

$ du -sh /home/ubuntu/*
1.6G    /home/ubuntu/anaconda
393M    /home/ubuntu/Anaconda2-4.0.0-Linux-x86_64.sh
4.0K    /home/ubuntu/gencert
4.0K    /home/ubuntu/tmp

/home/e-mission which primarily has some logs

$ du -sm /home/e-mission/*
1       /home/e-mission/app_store_review_test.stdinoutlog
1       /home/e-mission/Berkeley_sections.stdinout.log
1       /home/e-mission/iphone_2_test.stdinoutlog
1       /home/e-mission/lost+found
1       /home/e-mission/migration.log
2       /home/e-mission/moves_collect.stdinoutlog
2       /home/e-mission/pipeline.stdinoutlog
1       /home/e-mission/pipeline_with_perf.log
1       /home/e-mission/precompute_results.stdinoutlog
65      /home/e-mission/remotePush.stdinoutlog
240     /home/e-mission/silent_ios_push.stdinoutlog

/mnt/e-mission-primary-db which has the database

And we have two ephemeral volumes:

/mnt, which has the e-mission server install
/mnt/logs which has the periodic logs

shankari · 2017-12-07T23:46:43Z

wrt https://github.com/e-mission/e-mission-server/issues/530#issuecomment-346866854,

I am going to create m3.* reserved instances instead of m4.* instances across the board.
For the r3.* versus r4., there is actually some question since the r4. instance has better network, which is important for a database.

It turns out that m4.* is actually cheaper than m3.* (https://serverfault.com/a/885060/437264). The difference for large is $24.09/month (m3.large = $69.35, m4.large = 45.26), which is enough to pay for the equivalent EBS storage is $3/month.
https://github.com/e-mission/e-mission-server/issues/530#issuecomment-346766952

and we can add ephemeral disks to m4* instances for free when we create them.
That settles it, going with m4*.

shankari · 2017-12-08T02:09:29Z

Creating a staging environment first. This can be the open data environment used by the test phones. Since this is an open data environment, we need an additional server that runs the public ipython notebook server. We can't re-use the analysis server since we need to have a read-only connection to the database.

There is now a new m5 series, so we can just get a head start by deploying to that. It's about the same price, but has much greater EBS bandwidth.

Turns out that we can't create ephemeral storage for these instances, though. I went to the Add Storage tab and tried to add a volume, and the only option was the
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/block-device-mapping-concepts.html

We also need to set up a VPC between the servers so that the database cannot be accessed from the general internet. It looks like the VPC is free as long as we don't need a VPN or a NAT. Theoretically, though, we can just configure the incoming security policy for mongodb, even without a VPC.
https://aws.amazon.com/vpc/pricing/

I have created:

aws-op-webapp: m5.xlarge, 40GB storage ($90) (54.196.134.233)
aws-op-analysis: m5.xlarge, 40GB storage ($90) (52.87.159.49)
aws-op-public: m5.xlarge, 40GB storage ($90) (52.87.159.49)
aws-op-database: r4.2xlarge, 3 TB storage ($245) (34.201.243.180)

shankari · 2017-12-10T13:55:50Z

After deploying the servers, we need to set them up. The first big issue in setup is securing the database server. We will use two methods to secure the server:

we will restrict network access to the database port to the associated servers
we will turn on authentication and access control

Restricting network access (at least naively) is pretty simple - we just need to set up the firewall correctly. Later, we should explore the creation of a VPC for greater security.

wrt authentication, the viable options are:

SCRAM-SHA-1
MONGODB-CR
x.509

The first two are both username/password based authentication, which I am really reluctant to use. There is no classic public-key authentication mechanism.

shankari · 2017-12-10T15:54:22Z

I am reluctant to use the username/password based authentication because then I would need to store the password in a filesystem somewhere and make sure to copy/configure it every time. But in terms of attack vector, it seems around the same as public-key based authentication.

If the attacker gets access to the connecting hosts (webapp or analysis), it seems like she would have access to both the password and the private key.

The main differences are:

if the attacker gets access to the place where we have stored the passwords for the long-term, the password based solution is compromised, although the public key solution is not. We can avoid this by storing the password securely, just like the private key to the webapp.
if the public key is sent to the database over plaintext, the database is not compromised, but it is, if the database is compromised. We can avoid this by encrypting connections between the database and the webapp. This may also allow us to use x.509 based authentication, which is pretty close to public key authentication.

shankari · 2017-12-11T06:15:45Z

We can avoid this by encrypting connections between the database and the webapp. This may also allow us to use x.509 based authentication

We can do this, but we need to get SSL certificates for TLS-based encryption. I guess a self-signed certificate should be fine, since the mongodb is only going to be connected to the analysis and webapp hosts, which we control. But we can also probably avoid it if all communication is through an internal subnet on the VPC.

Basically, it seems like there are multiple levels of hardening possible:

configure incoming and outgoing connections in the firewall, no auth
Ease of use: 6 (easy, simple security group UI)
Security: 1 (weak, since data transfer flows over the public internet without encryption)
listen only to the private IP, all communication to/from the database is in the VPC, no auth
Ease of use: 4 (can set up VPC via UI)
Security: 5 (pretty good, since all unencrypted data flow is internal. The only attack vector is if the hacker somehow compromises any of the services. Once this is done, she can either connect to the database directly, or run a packet sniffer on the network
listen only to the private IP, all communication to/from the database is in the VPC, SSL certificates used, no auth
Ease of use: 1 (need to get SSL certificates and setup a bunch of configuration)
Security: 7 (pretty close to optimal, since even packet sniffers can't see anything)

Adding authentication

If we use option 2+ above, adding authentication does not appear to provide very much additional protection from external hackers. Assuming no firewall bugs, if a hacker wants to access the database, they need to first hack into one of the service hosts to generate the appropriate source header. And if they do that, they can always just see the auth credentials in the config file.

However, it can prevent catastrophic issues if there really is a firewall or VPC bug, and a hacker is able to inject malicious packets that purportedly come from the service hosts. Unless there is an encryption bug, moving to option (3) will harden the option further.

Authentication seems most useful when it is combined with Rule-based Access Control. RBAC can be used to separate read-only exploration (e.g. on a public server) from read-write computation. But it can go beyond that - we can make the webapp write to the timeseries and read-only from the aggregate, but make the analysis server read-only from the timeseries but write to to the analysis database

shankari · 2017-12-11T06:18:21Z

wrt https://github.com/e-mission/e-mission-server/issues/530#issuecomment-350631885,
given the tradeoffs articulared, I have decided to go with option (2) with no auth.

Concrete proposal

listen only to the private IP, all communication to/from the database is in the VPC, no auth
Ease of use: 4 (can set up VPC via UI)
Security: 5 (pretty good, since all unencrypted data flow is internal.

shankari · 2017-12-11T15:59:34Z

It looks like all instances created in the past year are assigned to the same VPC and the same subnet in the VPC (http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/default-vpc.html). In general, we don't want to share the subnet with other servers, because then if a hacker got access to one of the other subnets, they could packet sniff all the data and potentially figure out the data. For the open data servers, this may be OK since the data is open, and we have firewall restrictions on where we can get messages from.

But what about packet spoofing and potentially deleting data? Let's just make another (small) subnet.

shankari · 2017-12-11T17:48:40Z

I can't seem to find a way to list all the instances in a particular subnet. Filed https://serverfault.com/questions/887552/aws-how-do-i-find-the-list-of-instances-associated-with-a-particular-subnet

shankari · 2017-12-11T18:02:09Z

Ok, just to experiment with this for the future, we will set up a small subnet that hosts only the database and the analysis server.

From https://aws.amazon.com/vpc/faqs/
The minimum size of a subnet is a /28 (or 14 IP addresses.) for IPv4. Subnets cannot be larger than the VPC in which they are created.

multi-tier website, with the web servers in a public subnet and the database servers in a private subnet.

So basically, this scenario:
http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Scenario2.html

wait, analysis server cannot be in private subnet then because it needs to talk to external systems such as habitica and the real time bus etc. We should really split analysis server into two subnets too - external facing and internal facing. But since that will require some additional software restructuring, let's just put it in the public subnet for now.

I won't provision a NAT gateway for now - will explore ipv6-only options which will not require a (paid) NAT gateway and can use the (free) egress-only-internet gateway. http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/egress-only-internet-gateway.html

shankari · 2017-12-11T18:58:23Z

Ok so followed the VPC wizard for scenario 2 and created

aws-op-vpc
aws-op-public-subnet, aws-op-private-subnet
a NAT gateway and an egress-only internet gateway, and
aws-op-public-route, aws-op-private-route

Only aws-op-private-subnet has IPv6 enabled.

aws-op-public-route was associated with aws-op-public-subnet, but aws-op-private-route was marked as main and not associated with any subnet. That is consistent with

In this scenario, the VPC wizard updates the main route table used with the private subnet, and creates a custom route table and associates it with the public subnet.

In this scenario, all traffic from each subnet that is bound for AWS (for example, to the Amazon EC2 or Amazon S3 endpoints) goes over the Internet gateway. The database servers in the private subnet can't receive traffic from the Internet directly because they don't have Elastic IP addresses. However, the database servers can send and receive Internet traffic through the NAT device in the public subnet.

Any additional subnets that you create use the main route table by default, which means that they are private subnets by default. If you want to make a subnet public, you can always change the route table that it's associated with.

shankari · 2017-12-11T19:10:51Z

The default wizard configuration turns off "Auto-assign Public IP" because the assumption appears to be that we will use elastic IPs. Testing this scenario by editing the network interface for our provisioned servers and then turning it on later or manually assigning IPs.

shankari · 2017-12-11T20:28:27Z

Service instances

Turns out you can't edit the network interface, but you can create a new one and attach the volumes.

Before migration

IP: 54.196.134.233
Able to ssh in

Migrate

Create a new m5.xlarge instance
attach it to the aws-op-vpc, aws-op-public-subnet and override the assignment settings for public IP and ipv6.
create security groups for the different kinds of instances
- webapp
  - incoming SSH from home and HTTPS from the eecs hostname redirect
  - all outgoing traffic to both 0.0.0.0/0 and ::/0 (seems like we can tighten this)
- analysis
  - incoming SSH from home
  - all outgoing traffic to both 0.0.0.0/0 and ::/0 (seems like we can tighten this)
- public
  - incoming ssh from home and ports 8888 - 9999 for ipython notebook
  - all outgoing traffic to both 0.0.0.0/0 and ::/0
- database
  - incoming ssh from webapp and mongodb from webapp, analysis and public
  - outgoing traffic to all ports on webapp, analysis and public over ipv4 (seems like we should add routes for patches)

After migration

Can ssh directly to all three public-facing servers
attached non-root EB2 instances were also deleted! Good we figured this out now! Created instances and attached them

shankari · 2017-12-11T22:27:11Z

Database instance

Migration

Recreating instance, putting it into the private subnet, no assigned ipv4 address. It looks like after the instance is created, I can add a new private ip address, but not a public one.

Ah!

You can only use the auto-assign public IPv4 feature for a single, new network interface with the device index of eth0. For more information, see Assigning a Public IPv4 Address During Instance Launch.

No matter - that is what I want.

Ensure that the security group allows ssh from the webserver.

Try to ssh from the webserver.
Works!

Try to ssh from the analysis server.
Doesn't work!

Try to ssh to the private address from outside
Obviously doesn't work.

Tighten up the outbound rules on all security groups to be consistent with
http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Scenario2.html

Couple of modifications needed for this to work.

outbound ssh rule from the webapp to the database server to allow us to log in

DNS resolution needed to be enabled for the VPC
Looking at http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/vpc-dns.html,
DNS resolution is supposed to be enabled for VPCs created through the wizard, but is off for our VPC although it was created using the wizard

$ ping www.google.com
PING www.google.com (172.217.13.228) 56(84) bytes of data.
64 bytes from iad23s61-in-f4.1e100.net (172.217.13.228): icmp_seq=1 ttl=45 time=1.61 ms
64 bytes from iad23s61-in-f4.1e100.net (172.217.13.228): icmp_seq=2 ttl=45 time=1.61 ms
64 bytes from iad23s61-in-f4.1e100.net (172.217.13.228): icmp_seq=3 ttl=45 time=1.59 ms
64 bytes from iad23s61-in-f4.1e100.net (172.217.13.228): icmp_seq=4 ttl=45 time=1.64 ms
^C

DNS servers only support ipv4, so if we want to access the internet from the private subnet, we need to continue using the NAT gateway instance that the wizard setup for us.
```
[ec2-user@ip-192-168-1-100 ~]$ ping www.google.com
PING www.google.com (172.217.8.4) 56(84) bytes of data.
<HANGS>
^C
--- www.google.com ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 4081ms
```
This is because the incoming rules for the nat only supported the default security group. Changing it to the database security group caused everything to start working.

shankari · 2017-12-11T22:31:51Z

Attaching the database instances back, and then I think that setup is all done. I'm a bit unhappy about the NAT, but figuring out how to do DNS for ipv6 addresses is a later project, I think.

shankari · 2017-12-11T22:37:36Z

Cannot attach instances because they are in a different availability zone.
Per http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumes.html
you need to migrate the instances to a different zone using their snapshots.

shankari · 2017-12-12T00:24:05Z

and our volumes don't have snapshots. creating snapshots to explore this option...
can't create snapshot - selected it and nothing happened.
so it looks like reserved iops volumes have their snapshots under "snapshots", not linked to the volume.

Restoring....
That worked.
Attached the three volumes back to the database.

Getting started with code now...

shankari · 2017-12-12T00:33:15Z

Main code changes required:

support database hostname as part of configuration. There's already a field for this, but we should actually use it. Or potentially split it out into it's own conf file.
split out all the public stuff since it was really kludgy and is going to be on a separate server anyway

shankari · 2017-12-12T07:30:34Z

changes to server done (e-mission/e-mission-server#535)
now it is time to deploy!

shankari · 2017-12-12T07:35:36Z

installing mongodb now...

shankari · 2018-01-03T19:20:09Z

Ok so now back to cleaning up data (based on https://github.com/e-mission/e-mission-server/issues/530#issuecomment-354711521).

This time, we port the stats first, so that they will be deleted for the public phones, and we can delete the stats collections when we are done.

$ ./e-mission-py.bash bin/historical/migrations/stats_from_db_to_ts.py

Convert all ms to secs before validation.

for e in edb.get_timeseries_db().find({"metadata.key": {"$in": ["stats/client_time", "stats/client_nav_event", "stats/client_error"]}, "data.ts": {"$gt": now.timestamp}}):
         edb.get_timeseries_db().update({"_id": e["_id"]},
                 {"$set": {"data.ts": float(e["data"]["ts"])/1000,
                   "metadata.write_ts": float(e["metadata"]["write_ts"])/1000}})

validate

messed up entries have been fixed

In [1039]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["stats/client_time", "stats/client_nav_event", "stats/client_error"]}, "data.ts": {"$gt": now.timestamp}}).count()
Out[1039]: 0

Correlation between stats db and timeseries db exists and timestamp is valid

 In [1040]: entry_1 = edb.get_client_stats_db_backup().find_one()

 In [1041]: edb.get_timeseries_db().find_one({"metadata.key": "background/battery", "data.ts": float(entry_1["ts"])/1000})
 Out[1041]:
 {u'_id': ObjectId('5a4c7b5c88f663668630d290'),
 u'data': {u'battery_level_pct': 4.0, u'ts': 1413254944.995},
 u'metadata': {u'key': u'background/battery',
 u'platform': u'server',
 u'time_zone': u'America/Los_Angeles',
 u'write_fmt_time': u'2014-10-13T19:49:42.201671-07:00',
 u'write_ts': 1413254982.201671},
 u'user_id': UUID('f8fee20c-0f32-359d-ba75-bce97a7ac83b')}

In [1044]: arrow.get(1413254944.995)
Out[1044]: <Arrow [2014-10-14T02:49:04.995000+00:00]>

Sorting in descending order works and timestamps are valid

In [1042]: list(edb.get_timeseries_db().find({"user_id": UUID('96af3842-d5fb-4f13-aea0-726efaeba6ea'), "metadata.key": {"$in": ["stats/client_time", "stats/client_nav_event", "stats/client_error"]}}).sort("data.ts", -1).limit(1))
Out[1042]:
[{u'_id': ObjectId('59a0c57dcb17471ac08bc0c6'),
  u'data': {u'client_app_version': u'2.3.0',
   u'client_os_version': u'7.0',
   u'name': u'sync_duration',
   u'reading': 6.459,
   u'ts': 1503704747.497},
  u'metadata': {u'key': u'stats/client_time',
   u'platform': u'android',
   u'read_ts': 0,
   u'time_zone': u'America/Los_Angeles',
   u'type': u'message',
   u'write_fmt_time': u'2017-08-25T16:45:47.499000-07:00',
   u'write_ts': 1503704747.499},
  u'user_id': UUID('96af3842-d5fb-4f13-aea0-726efaeba6ea')}]

In [1045]: arrow.get(1503704747.497)
Out[1045]: <Arrow [2017-08-25T23:45:47.497000+00:00]>

In [1047]: list(edb.get_timeseries_db().find({"metadata.key": {"$in": ["stats/client_time", "stats/client_nav_event", "stats/client_error"]}}).sort("data.ts", -1).limit(1))
Out[1047]:
[{u'_id': ObjectId('59efe782cb17471ac0cecb5a'),
  u'data': {u'client_app_version': u'2.4.0',
   u'client_os_version': u'10.3.3',
   u'name': u'sync_launched',
   u'reading': -1,
   u'ts': 1508894590.151824},
  u'metadata': {u'key': u'stats/client_nav_event',
   u'platform': u'ios',
   u'plugin': u'none',
   u'read_ts': 0,
   u'time_zone': u'America/Los_Angeles',
   u'type': u'message',
   u'write_fmt_time': u'2017-10-24T18:23:10.152289-07:00',
   u'write_local_dt': {u'day': 24,
    u'hour': 18,
    u'minute': 23,
    u'month': 10,
    u'second': 10,
    u'timezone': u'America/Los_Angeles',
    u'weekday': 1,
    u'year': 2017},
   u'write_ts': 1508894590.152289},
  u'user_id': UUID('7161343e-551e-4213-be75-3b82e1ce2448')}]

In [1048]: arrow.get(1508894590.151824)
Out[1048]: <Arrow [2017-10-25T01:23:10.151824+00:00]>

Sorting in ascending order works but one of the timestamps is weird.

In [1043]: list(edb.get_timeseries_db().find({"user_id": UUID('96af3842-d5fb-4f13-aea0-726efaeba6ea'), "metadata.key": {"$in": ["stats/client_time", "stats/client_nav_event", "stats/client_error"]}}).sort("data.ts", 1).limit(1))
Out[1043]:
[{u'_id': ObjectId('5a4c829288f663668638cb87'),
  u'data': {u'client_app_version': u'1.0.0',
   u'client_os_version': u'4.4.2',
   u'name': u'sync_duration',
   u'reading': 5638.0,
   u'ts': 1474414807.961},
  u'metadata': {u'key': u'stats/client_time',
   u'platform': u'server',
   u'time_zone': u'America/Los_Angeles',
   u'write_fmt_time': u'2016-09-20T18:40:45.947868-07:00',
   u'write_local_dt': {u'day': 20,
    u'hour': 18,
    u'minute': 40,
    u'month': 9,
    u'second': 45,
    u'timezone': u'America/Los_Angeles',
    u'weekday': 1,
    u'year': 2016},
   u'write_ts': 1474422045.947868},
  u'user_id': UUID('96af3842-d5fb-4f13-aea0-726efaeba6ea')}]

In [1046]: arrow.get(1474414807.961)
Out[1046]: <Arrow [2016-09-20T23:40:07.961000+00:00]>

In [1049]: list(edb.get_timeseries_db().find({"metadata.key": {"$in": ["stats/client_time", "stats/client_nav_event", "stats/client_error"]}}).sort("data.ts", 1).limit(1))
Out[1049]:
[{u'_id': ObjectId('5a4c7e6b88f663668633e5dd'),
  u'data': {u'client_app_version': u'2.1',
   u'client_os_version': u'4.4.2',
   u'name': u'confirmlist_auth_not_done',
   u'reading': None,
   u'ts': 315965026.452},
  u'metadata': {u'key': u'stats/client_nav_event',
   u'platform': u'server',
   u'time_zone': u'America/Los_Angeles',
   u'write_fmt_time': u'2015-06-03T09:57:27.061417-07:00',
   u'write_local_dt': {u'day': 3,
    u'hour': 9,
    u'minute': 57,
    u'month': 6,
    u'second': 27,
    u'timezone': u'America/Los_Angeles',
    u'weekday': 2,
    u'year': 2015},
   u'write_ts': 1433350647.061417},
  u'user_id': UUID('077bbe67-7693-3a09-9b3c-c8ba935e46ee')}]

In [1050]: arrow.get(315965026.452)
Out[1050]: <Arrow [1980-01-06T00:03:46.452000+00:00]>

what is going on with this?

This is not just an invalid conversion, though, because trying to convert it
back to seconds does not work.

 ```
 In [1051]: arrow.get(315965026452)
 ValueError: year is out of range
 ```

There are 13 such entries and they are all from user UUID('077bbe67-7693-3a09-9b3c-c8ba935e46ee')

In [1067]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["stats/client_time", "stats/client_nav_event", "stats/client_error"]}, "data.ts": {"$lt": ts_2000.timestamp}}).count()
Out[1067]: 13

In [1070]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["stats/client_time", "stats/client_nav_event", "stats/client_error"]}, "data.ts": {"$lt": ts_2000.timestamp}}).distinct("user_id")
Out[1070]: [UUID('077bbe67-7693-3a09-9b3c-c8ba935e46ee')]

There are 131 entries for this user in the client stats DB

In [1071]: edb.get_client_stats_db_backup().find({"user": UUID('077bbe67-7693-3a09-9b3c-c8ba935e46ee')}).count()
Out[1071]: 151

Let's try to match based on the reported_ts. Bingo! The entry does indeed have an invalid ts.

In [1073]: edb.get_client_stats_db_backup().find_one({"user": UUID('077bbe67-7693-3a09-9b3c-c8ba935e46ee'), "reported_ts":  1433350647.061417})
Out[1073]:
{u'_id': ObjectId('556f31f788f6636f49a1b05a'),
 u'client_app_version': u'2.1',
 u'client_os_version': u'4.4.2',
 u'reading': u'0.0',
 u'reported_ts': 1433350647.061417,
 u'stat': u'battery_level',
 u'ts': u'315965055221',
 u'user': UUID('077bbe67-7693-3a09-9b3c-c8ba935e46ee')}

In [1075]: arrow.get(315965055221)
ValueError: year is out of range

There are a bunch of other entries with the same user and reported_ts, but fewer than the entries reported in the timeseries.

In [1074]: edb.get_client_stats_db_backup().find({"user": UUID('077bbe67-7693-3a09-9b3c-c8ba935e46ee'), "reported_ts":  1433350647.061417}).count()
Out[1074]: 37

I bet the others are battery_level, similar to the above.

In [1076]: edb.get_timeseries_db().find({"data.ts": {"$lt": ts_2000.timestamp}}).count()
Out[1076]: 762

In [1077]: edb.get_timeseries_db().find({"data.ts": {"$lt": ts_2000.timestamp}}).distinct("user_id")
Out[1077]:
[UUID('077bbe67-7693-3a09-9b3c-c8ba935e46ee'),
 UUID('99b29da2-989d-4e54-a211-284bde1d362d'),
 UUID('693a42f1-6f00-497e-8b40-a8339fd5af8d'),
 UUID('3d37573b-2e74-496d-a09d-a0a3f05c2467'),
 UUID('ee6dadba-53b6-421f-94d1-27bc96e023cf'),
 UUID('0109c47b-e640-411e-8d19-e481c52d7130'),
 UUID('6560950c-4ddb-41fc-8801-b9197d30f54d'),
 UUID('b82804b8-4e49-43a0-99d1-9d1da20ec1d3'),
 UUID('9d906275-8072-42d4-8dd2-3670e63e0f6e'),
 UUID('6af5afdf-b1d9-4ea7-9f10-2bddb8a0ecb3'),
 UUID('6a415e67-9025-4f29-b520-f0c5a43c8bb6'),
 UUID('a61349c6-0cc9-4902-9f13-d4236a630ad5'),
 UUID('cfbf03dc-6e3e-40bd-90de-d19d14613e47'),
 UUID('be47f46a-ce3a-4ad8-b81f-d3daa7955e95'),
 UUID('5109e62d-2152-481b-8d26-2cb8d8cc1f23'),
 UUID('f14272fe-1433-4430-b1ec-3f37dfdde5bf'),
 UUID('abf4ed3a-a018-4f40-90c5-39b592b8569b'),
 UUID('de23cac9-1996-4af5-8554-4f6d017b3459'),
 UUID('96af3842-d5fb-4f13-aea0-726efaeba6ea'),
 UUID('08b31565-f990-4d15-a4a7-89b3ba6b1340'),
 UUID('5a6a2711-c574-42f0-9940-ea1fd0cc2f09'),
 UUID('29277dd4-dc78-40c0-806c-f88a4f902436'),
 UUID('6ed1b36d-08a9-403d-b247-e426228c0492'),
 UUID('d2b923b9-68b9-4e88-9b8c-29416694efb1'),
 UUID('e82b1c5a-7c07-46b7-afd7-b53ac9db1f42'),
 UUID('dcdb5f74-071a-4e5b-a954-e613c5b46e5d'),
 UUID('ea59084e-11d4-4076-9252-3b9a29ce35e0'),
 UUID('cd6482fe-56a2-4bf8-b8a8-d74f6e3c22c8'),
 UUID('3ca88f7c-fb1a-467e-9e29-99909d92c904')]

In [1078]: edb.get_timeseries_db().find({"data.ts": {"$lt": ts_2000.timestamp}}).distinct("metadata.key")
Out[1078]:
[u'stats/client_time',
 u'background/battery',
 u'stats/client_nav_event',
 u'statemachine/transition',
 u'background/location',
 u'background/filtered_location']

In [1080]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["statemachine/transition", "background/location", "background/filtered_location"]}, "data.ts": {"$lt": ts_2000.timestamp}}).count()
Out[1080]: 744

In [1082]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["statemachine/transition", "background/location", "background/filtered_location"]}, "data.ts": {"$lt": ts_2000.timestamp}}).distinct("metadata.platform")
Out[1082]: [u'android']

Actually, no. There are a lot more, including a lot of non-stat entries, and
they are all on android. Going to verify some of this manually and then move on.
So they are an interesting mix of some kind of weird timestamp that is broken
for both ts and write_ts and entries where the location apparently had ts = 0.

>>> list(edb.get_timeseries_db().find({"metadata.key": {"$in": ["statemachine/transition", "background/location", "background/filtered_location"]}, "data.ts": {"$lt": ts_2000.timestamp}}, {"user_id":1, "data.ts": 1, "data.fmt_time": 1, "metadata.write_ts": 1, "metadata.write_fmt_time": 1}).limit(10))

[{u'_id': ObjectId('56c495afeaedff78c762a711'),
  u'data': {u'fmt_time': u'1970-03-12T00:27:13+08:00', u'ts': 6020833},
  u'metadata': {u'write_fmt_time': u'1970-03-12T00:27:13+08:00',
   u'write_ts': 6020833},
  u'user_id': UUID('99b29da2-989d-4e54-a211-284bde1d362d')},
 {u'_id': ObjectId('56c495afeaedff78c762a710'),
  u'data': {u'fmt_time': u'1970-03-11T14:25:57+08:00', u'ts': 5984757},
  u'metadata': {u'write_fmt_time': u'1970-03-11T14:25:57+08:00',
   u'write_ts': 5984757},
  u'user_id': UUID('99b29da2-989d-4e54-a211-284bde1d362d')},
 {u'_id': ObjectId('575a6423383999ecb7a5e183'),
  u'data': {u'fmt_time': u'1969-12-31T16:00:00-08:00', u'ts': 0},
  u'metadata': {u'write_fmt_time': u'2016-06-09T23:43:11.391000-07:00',
   u'write_ts': 1465540991.391},
  u'user_id': UUID('693a42f1-6f00-497e-8b40-a8339fd5af8d')},
 {u'_id': ObjectId('5760c4dc383999ecb7a98155'),
  u'data': {u'fmt_time': u'1969-12-31T16:00:00-08:00', u'ts': 0},
  u'metadata': {u'write_fmt_time': u'2016-06-14T16:23:03.521000-07:00',
   u'write_ts': 1465946583.521},

At some point, I should go through and throw out all this data. But it is a
small amount of data and can wait.

And to complete the exploration, all the broken stats are from the same user.

In [1089]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["stats/client_time", "stats/client_nav_event", "stats/client_error", "background/battery"]}, "data.ts": {"$lt": ts_2000.timestamp}}).distinct("user_id")
Out[1089]: [UUID('077bbe67-7693-3a09-9b3c-c8ba935e46ee')]

Next, remove all test phone data, including pipeline states. Do NOT remove USP open data because there is data outside the open range too, at least for one user (me!)

State before

In [1053]: edb.get_uuid_db().find().count()
Out[1053]: 516

In [1054]: edb.get_timeseries_db().find().count()
Out[1054]: 54818879

In [1055]: edb.get_analysis_timeseries_db().find().count()
Out[1055]: 16573823

In [1056]: len(edb.get_pipeline_state_db().find().distinct('user_id'))
Out[1056]: 487

Running the script!

$ ./e-mission-py.bash bin/debug/purge_multi_timeline_for_range.py --pipeline-purge /tmp/public_data/dump_
INFO:root:Loading file or prefix /tmp/public_data/dump_
INFO:root:Found 12 matching files for prefix /tmp/public_data/dump_
INFO:root:files are ['/tmp/public_data/dump_fd7b4c2e-2c8b-3bfa-94f0-d1e3ecbd5fb7.gz', '/tmp/public_data/dump_6561431f-d4c1-4e0f-9489-ab1190341fb7.gz', '/tmp/public_data/dump_079e0f1a-c440-3d7c-b0e7-de160f748e35.gz', '/tmp/public_data/dump_92cf5840-af59-400c-ab72-bab3dcdf7818.gz', '/tmp/public_data/dump_3bc0f91f-7660-34a2-b005-5c399598a369.gz'] ... ['/tmp/public_data/dump_95e70727-a04e-3e33-b7fe-34ab19194f8b.gz', '/tmp/public_data/dump_c528bcd2-a88b-3e82-be62-ef4f2396967a.gz', '/tmp/public_data/dump_70968068-dba5-406c-8e26-09b548da0e4b.gz', '/tmp/public_data/dump_93e8a1cc-321f-4fa9-8c3c-46928668e45d.gz']
INFO:root:==================================================
INFO:root:Deleting data from file /tmp/public_data/dump_fd7b4c2e-2c8b-3bfa-94f0-d1e3ecbd5fb7.gz
...
INFO:root:For uuid = e471711e-bd14-3dbe-80b6-9c7d92ecc296, deleting entries from the timeseries
INFO:root:result = {u'ok': 1, u'n': 3923919}
INFO:root:For uuid = e471711e-bd14-3dbe-80b6-9c7d92ecc296, deleting entries from the analysis_timeseries
INFO:root:result = {u'ok': 1, u'n': 39361}
INFO:root:For uuid e471711e-bd14-3dbe-80b6-9c7d92ecc296, deleting entries from the user_db
INFO:root:result = {u'ok': 1, u'n': 1}
INFO:root:For uuid e471711e-bd14-3dbe-80b6-9c7d92ecc296, deleting entries from the pipeline_state_db
INFO:root:result = {u'ok': 1, u'n': 12}

State after

In [1057]: edb.get_uuid_db().find().count()
Out[1057]: 504

In [1058]: edb.get_timeseries_db().find().count()
Out[1058]: 39797029

In [1059]: edb.get_analysis_timeseries_db().find().count()
Out[1059]: 16190498

In [1060]: len(edb.get_pipeline_state_db().find().distinct('user_id'))
Out[1060]: 475

Next, remove all config documents from the usercache

In [1061]: edb.get_usercache_db().find({"metadata.type": "document", "metadata.key": {"$in": ['config/consent', 'config/sensor_config', 'config/sync_config']}}).count()
Out[1061]: 239

In [1062]: edb.get_usercache_db().remove({"metadata.type": "document", "metadata.key": {"$in": ['config/consent', 'config/sensor_config', 'config/sync_config']}})
Out[1062]: {u'n': 239, u'ok': 1}

Next, re-remove unused collections. This time, since we have migrated all stats, we can remove those databases as well.

In [1090]: edb.get_alternatives_db().remove()
Out[1090]: {u'n': 101146, u'ok': 1}

In [1091]: edb.get_client_db().remove()
Out[1091]: {u'n': 3, u'ok': 1}

In [1092]: edb.get_common_place_db().remove()
Out[1092]: {u'n': 0, u'ok': 1}

In [1093]: edb.get_common_trip_db().remove()
Out[1093]: {u'n': 0, u'ok': 1}

In [1094]: edb.get_pending_signup_db().remove()
Out[1094]: {u'n': 25, u'ok': 1}

In [1095]: edb._get_current_db().Stage_place.remove()
Out[1095]: {u'n': 0, u'ok': 1}

In [1096]: edb.get_routeCluster_db().remove()
Out[1096]: {u'n': 90, u'ok': 1}

In [1097]: edb._get_current_db().Stage_routeDistanceMatrix.remove()
Out[1097]: {u'n': 7, u'ok': 1}

In [1098]: edb._get_current_db().Stage_section_new.remove()
Out[1098]: {u'n': 0, u'ok': 1}

In [1099]: edb._get_current_db().Stage_stop.remove()
Out[1099]: {u'n': 0, u'ok': 1}

In [1100]: edb._get_current_db().Stage_trip_new.remove()
Out[1100]: {u'n': 0, u'ok': 1}

In [1101]: edb._get_current_db().Stage_user_moves_access.remove()
Out[1101]: {u'n': 118, u'ok': 1}

In [1102]: edb._get_current_db().Stage_utility_models.remove()
Out[1102]: {u'n': 36, u'ok': 1}

In [1103]: edb._get_current_db().Stage_Worktime.remove()
Out[1103]: {u'n': 2662, u'ok': 1}

In [1104]: edb.get_client_stats_db_backup().remove()
Out[1104]: {u'n': 650961, u'ok': 1}

In [1105]: edb.get_server_stats_db_backup().remove()
Out[1105]: {u'n': 449523, u'ok': 1}

Ok! I think we are done! There's plenty of room on the transfer disk, so
let's just create a new dump and keep the old dump as backup.

/dev/xvdg         296G   53G  228G  19% /transfer

$ mongodump --out /transfer/cleanedup-jan-3
2018-01-03T18:44:29.016+0000    Test_database.Test_Set to /transfer/cleanedup-jan-3/Test_database/Test_Set.bson
2018-01-03T18:44:29.018+0000             1 documents
2018-01-03T18:44:29.019+0000    Metadata for Test_database.Test_Set to /transfer/cleanedup-jan-3/Test_database/Test_Set.metadata.json
2018-01-03T18:44:29.019+0000 DATABASE: admin     to     /transfer/cleanedup-jan-3/admin

Dump is done!

Remaining steps:

attach volume to new server
mongorestore
re-run analysis pipeline
DONE!!!

shankari · 2018-01-04T07:51:54Z

Re-attach the volume to the new stack

Unmount from current stack
```
$ sudo umount /transfer
```
Snapshot
Create new volume in the correct region
Attach new volume to the database from the new stack

Mount the new volume

$ sudo mkdir -p /transfer
$ sudo chown ec2-user:ec2-user /transfer/
$ sudo mount /dev/xvdi /transfer/
$ ls /transfer
cleanedup-jan-3  lost+found  odc-usp-2017  original-jan-1  public_phone_stats

Restore

Validate

On old server

In [521]: edb.get_uuid_db().find().count()
Out[521]: 504

In [522]: edb.get_timeseries_db().find().count()
Out[522]: 39797029

In [523]: edb.get_analysis_timeseries_db().find().count()
Out[523]: 16190498

In [524]: len(edb.get_pipeline_state_db().find().distinct('user_id'))
Out[524]: 475

In [525]: edb.get_usercache_db().find().count()
Out[525]: 10011104

On new server

In [2]: edb.get_uuid_db().find().count()
Out[2]: 504

In [3]: edb.get_timeseries_db().find().count()
Out[3]: 39797135

In [4]: edb.get_analysis_timeseries_db().find().count()
Out[4]: 16190498

In [5]: len(edb.get_pipeline_state_db().find().distinct('user_id'))
Out[5]: 475

In [6]: edb.get_usercache_db().find().count()
Out[6]: 10011104

Check our favourite users

Me

In [7]: edb.get_uuid_db().find_one({"user_email": "[email protected]"})
Out[7]:
{'_id': ObjectId('54a6bdfd39e59673fd9fba5b'),
 'update_ts': datetime.datetime(2017, 8, 20, 2, 29, 50, 275000),
 'user_email': '<shankari_email>',
 'uuid': UUID('<shankari_uuid>')}

In [10]: edb.get_uuid_db().find({"user_id": UUID('<shankari_uuid>')}).count()
Out[10]: 0

In [11]: edb.get_timeseries_db().find({"user_id": UUID('<shankari_uuid>')}).count()
Out[11]: 735156

In [12]: edb.get_analysis_timeseries_db().find({"user_id": UUID('<shankari_uuid>')}).count()
Out[12]: 166571

In [23]: list(edb.get_timeseries_db().find({"user_id": shankari_entry["uuid"]}).sort("data.ts", 1).limit(1))
Out[23]:
[{'_id': ObjectId('5614ee7d88f663584fa03131'),
  'data': {'_id': ObjectId('5614ee7d88f663584fa03131'),
   'exit_fmt_time': '2015-08-21T18:06:16.905000-07:00',
   'exit_ts': 1440205576.905,
   'location': {'coordinates': [-122.4426899, 37.7280596], 'type': 'Point'},
   'starting_trip': ObjectId('5614ee7d88f663584fa03132'),
   'user_id': UUID('0763de67-f61e-3f5d-90e7-518e69793954')},
  'metadata': {'key': 'segmentation/raw_place',
   'platform': 'server',
   'time_zone': 'America/Los_Angeles',
   'write_fmt_time': '2016-04-25T06:32:01.332099-07:00',
   'write_ts': 1461591121.332099},
  'user_id': UUID('<shankari_uuid>')}]

In [24]: list(edb.get_analysis_timeseries_db().find({"user_id": shankari_entry["uuid"]}).sort("data.ts", 1).limit(1))
Out[24]:
[{'_id': ObjectId('57e962fa88f66347503059e7'),
  'data': {'exit_fmt_time': '2015-07-13T15:25:56.852000-07:00',
   'exit_ts': 1436826356.852,
   'location': {'coordinates': [-122.0879696, 37.3885529], 'type': 'Point'},
   'source': 'DwellSegmentationTimeFilter',
   'starting_trip': ObjectId('57e962fa88f66347503059e8')},
  'metadata': {'key': 'segmentation/raw_place',
   'platform': 'server',
   'time_zone': 'America/Los_Angeles',
   'write_fmt_time': '2016-09-26T11:03:38.105793-07:00',
   'write_ts': 1474913018.105793},
  'user_id': UUID('<shankari_uuid>')}]

In [25]: list(edb.get_timeseries_db().find({"user_id": shankari_entry["uuid"]}).sort("data.ts", -1).limit(1))
Out[25]:
[{'_id': ObjectId('581becd188f6630386d15ac5'),
  'data': {'battery_level_pct': 98.0, 'ts': 1478225037729.0},
  'metadata': {'key': 'background/battery',
   'platform': 'server',
   'time_zone': 'America/Los_Angeles',
   'write_ts': 1478225037729.0},
  'user_id': UUID('<shankari_uuid>')}]

In [26]: list(edb.get_analysis_timeseries_db().find({"user_id": shankari_entry["uuid"]}).sort("data.ts", -1).limit(1))
Out[26]:
[{'_id': ObjectId('59eed4b188f66334694bfcb2'),
  'data': {'altitude': 1.0,
   'distance': 2.8965257248272045,
   'fmt_time': '2017-10-23T17:05:41-07:00',
   'heading': 31.494464698609224,
   'idx': 27,
   'latitude': 37.3909994,
   'loc': {'coordinates': [-122.0864596, 37.3909994], 'type': 'Point'},
   'longitude': -122.0864596,
   'mode': 0,
   'section': ObjectId('59eed4b188f66334694bfc96'),
   'speed': 0.12344985580418566,
   'ts': 1508803541.0},
  'metadata': {'key': 'analysis/recreated_location',
   'platform': 'server',
   'time_zone': 'America/Los_Angeles',
   'write_fmt_time': '2017-10-23T22:50:41.916285-07:00',
   'write_ts': 1508824241.916285},
  'user_id': UUID('<shankari_uuid>')}]

Tom

 ```
 In [16]: tom_entry = edb.get_uuid_db().find_one({"user_email": "[email protected]"})

 In [17]: tom_entry
 Out[17]:
 {'_id': ObjectId('543c8a2239e59673fd9fb9dc'),
  'update_ts': datetime.datetime(2017, 5, 6, 21, 29, 3, 780000),
  'user_email': '<tom_email>',
  'uuid': UUID('<tom_uuid>')}

 In [18]: edb.get_timeseries_db().find({"user_id": tom_entry["uuid"]}).count()
 Out[18]: 592479

 In [19]: edb.get_analysis_timeseries_db().find({"user_id": tom_entry["uuid"]}).count()
 Out[19]: 139926

 In [27]: list(edb.get_timeseries_db().find({"user_id": tom_entry["uuid"]}).sort("data.ts", 1).limit(1))
 Out[27]:
 [{'_id': ObjectId('564ed10488f66311474836bd'),
   'data': {'_id': ObjectId('564ed10488f66311474836bd'),
    'exit_fmt_time': '2015-07-21T00:56:30.414000-07:00',
    'exit_ts': 1437465390.414,
    'location': {'coordinates': [-122.0862835, 37.3909556], 'type': 'Point'},
    'starting_trip': ObjectId('564ed10488f66311474836be'),
    'user_id': UUID('b0d937d0-70ef-305e-9563-440369012b39')},
   'metadata': {'key': 'segmentation/raw_place',
    'platform': 'server',
    'time_zone': 'America/Los_Angeles',
    'write_fmt_time': '2016-04-25T06:32:01.391767-07:00',
    'write_ts': 1461591121.391767},
   'user_id': UUID('<tom_uuid>')}]

 In [41]: list(edb.get_analysis_timeseries_db().find({"user_id": tom_entry["uuid"]}).sort("data.ts", 1).limit(1))
 Out[41]:
 [{'_id': ObjectId('57e9906b88f66347503240e7'),
   'data': {'exit_fmt_time': '2015-07-21T00:56:30.414000-07:00',
    'exit_ts': 1437465390.414,
    'location': {'coordinates': [-122.0862835, 37.3909556], 'type': 'Point'},
    'source': 'DwellSegmentationTimeFilter',
    'starting_trip': ObjectId('57e9906b88f66347503240e8')},
   'metadata': {'key': 'segmentation/raw_place',
    'platform': 'server',
    'time_zone': 'America/Los_Angeles',
    'write_fmt_time': '2016-09-26T14:17:31.475597-07:00',
    'write_ts': 1474924651.475597},
   'user_id': UUID('<tom_uuid>')}]


 In [29]: list(edb.get_timeseries_db().find({"user_id": tom_entry["uuid"]}).sort("data.ts", -1).limit(1))
 Out[29]:
 [{'_id': ObjectId('581b4c5588f6630386d0ea56'),
   'data': {'battery_level_pct': 85.0, 'ts': 1478184016437.0},
   'metadata': {'key': 'background/battery',
    'platform': 'server',
    'time_zone': 'America/Los_Angeles',
    'write_ts': 1478184016437.0},
   'user_id': UUID('<tom_uuid>')}]

 In [30]: list(edb.get_analysis_timeseries_db().find({"user_id": tom_entry["uuid"]}).sort("data.ts", -1).limit(1))
 Out[30]:
 [{'_id': ObjectId('59a23ae188f6632233d07f2f'),
   'data': {'altitude': 0.0,
    'distance': 1.0747850121465141,
    'fmt_time': '2017-08-25T16:21:37.192000-07:00',
    'heading': 134.91688221066724,
    'idx': 81,
    'latitude': 37.3910415,
    'loc': {'coordinates': [-122.0864408, 37.3910415], 'type': 'Point'},
    'longitude': -122.0864408,
    'mode': 4,
    'section': ObjectId('59a23ae188f6632233d07edd'),
    'speed': 0.10598412720427686,
    'ts': 1503703297.192},
   'metadata': {'key': 'analysis/recreated_location',
    'platform': 'server',
    'time_zone': 'America/Los_Angeles',
    'write_fmt_time': '2017-08-26T20:22:09.988126-07:00',
    'write_ts': 1503804129.988126},
   'user_id': UUID('<tom_uuid>')}]
 ```

I can see a couple of things that we should clean up. One which is easy and the other which should be done later.

First, we need to adjust the timestamps on the stats objects again. We fixed all the client entries but not the battery entries.

In [33]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["background/battery"]},"data.ts": {"$gt": now.timestamp}}).count()
Out[33]: 625387

for e in edb.get_timeseries_db().find({"metadata.key": {"$in": ["background/battery"]}, "data.ts": {"$gt": now.timestamp}}):
         edb.get_timeseries_db().update({"_id": e["_id"]},
                 {"$set": {"data.ts": float(e["data"]["ts"])/1000,
                   "metadata.write_ts": float(e["metadata"]["write_ts"])/1000}})

In [35]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["background/battery"]},"data.ts": {"$gt": now.timestamp}}).count()
Out[35]: 0

Now the most recent entries in the timeseries should be fixed.

In [36]: list(edb.get_timeseries_db().find({"user_id": shankari_entry["uuid"]}).sort("data.ts", -1).limit(1))
Out[36]:
[{'_id': ObjectId('5a4aaf3088f6636e03141d7a'),
  'data': {'name': 'POST_/usercache/get',
   'reading': 0.45238590240478516,
   'ts': 1514843952.94657},
  'metadata': {'key': 'stats/server_api_time',
   'platform': 'server',
   'time_zone': 'America/Los_Angeles',
'write_fmt_time': '2018-01-01T13:59:12.947125-08:00',
'write_ts': 1514843952.947125},
'user_id': UUID('ea59084e-11d4-4076-9252-3b9a29ce35e0')}]

In [37]: list(edb.get_timeseries_db().find({"user_id": tom_entry["uuid"]}).sort("data.ts", -1).limit(1))
Out[37]:
[{'_id': ObjectId('5a499a4688f6636e031416a9'),
  'data': {'name': 'POST_/datastreams/find_entries/timestamp',
   'reading': 0.06373810768127441,
   'ts': 1514773062.554465},
  'metadata': {'key': 'stats/server_api_time',
   'platform': 'server',
   'time_zone': 'America/Los_Angeles',
   'write_fmt_time': '2017-12-31T18:17:42.554857-08:00',
   'write_ts': 1514773062.554857},
'user_id': UUID('<tom_uuid>')}]

Second, we need to move all the segmentation/raw entries from the timeseries
to the analysis_timeseries. They are not doing anything bad there - we assume
trips and sections are only in the analysis database, and read them only from
there.
```
"segmentation/raw_trip": self.analysis_timeseries_db,
```

However, it seems like a bad idea to have weird data sitting around. Is it
actually duplicated in the analysis database? If so, can we just delete them
from the timeseries_db?

Are they duplicated? For Tom, yes. for me, pretty close.

For me: Times are different, although ~ 1 month from each other. Locations are different.

[{'_id': ObjectId('5614ee7d88f663584fa03131'),
  'data': {'_id': ObjectId('5614ee7d88f663584fa03131'),
   'exit_fmt_time': '2015-08-21T18:06:16.905000-07:00',
   'location': {'coordinates': [-122.4426899, 37.7280596], 'type': 'Point'},

[{'_id': ObjectId('57e962fa88f66347503059e7'),
  'data': {'exit_fmt_time': '2015-07-13T15:25:56.852000-07:00',
   'location': {'coordinates': [-122.0879696, 37.3885529], 'type': 'Point'},

For Tom: only _ids are different. Eveything else is identical.

[{'_id': ObjectId('564ed10488f66311474836bd'),
  'data': {'_id': ObjectId('564ed10488f66311474836bd'),
   'exit_fmt_time': '2015-07-21T00:56:30.414000-07:00',
   'location': {'coordinates': [-122.0862835, 37.3909556], 'type': 'Point'},

[{'_id': ObjectId('57e9906b88f66347503240e7'),
  'data': {'exit_fmt_time': '2015-07-21T00:56:30.414000-07:00',
   'location': {'coordinates': [-122.0862835, 37.3909556], 'type': 'Point'},

How many of them are there?

In [50]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["segmentation/raw_place", "segmentation/raw_place", "segmentation/raw_section", "segmentation/raw_stop", "segmentation/raw_untracked"]}}).count()
Out[50]: 29023

Is there really overlap?

First timeseries entry

In [66]: list(edb.get_timeseries_db().find({"metadata.key": {"$in": ["segmentation/raw_place", "segmentation/raw_place", "segmentation/raw_section", "segmentation/raw_stop", "segmentation/raw_untracked"]}}).sort([("data.enter_ts", 1), ("data.exit_ts", 1), ("data.start_ts", 1), ("data.end_ts", 1)]).limit(1))
Out[66]:
[{'_id': ObjectId('5674839788f66340b2fb12b9'),
  'data': {'_id': ObjectId('5674839788f66340b2fb12b9'),
   'duration': 269.63700008392334,
   'end_fmt_time': '2015-07-13T15:30:26.489000-07:00',
   'end_loc': {'coordinates': [-122.0824345, 37.3790636], 'type': 'Point'},
   'end_stop': ObjectId('5674839788f66340b2fb12bb'),
   'end_ts': 1436826626.489,
   'sensed_mode': 0,
   'source': 'SmoothedHighConfidenceMotion',
   'start_fmt_time': '2015-07-13T15:25:56.852000-07:00',
   'start_loc': {'coordinates': [-122.0879696, 37.3885529], 'type': 'Point'},
   'start_ts': 1436826356.852,
   'trip_id': ObjectId('5674838188f66340b2fb0c9c'),
   'user_id': UUID('0763de67-f61e-3f5d-90e7-518e69793954')},
  'metadata': {'key': 'segmentation/raw_section',
   'platform': 'server',
   'time_zone': 'America/Los_Angeles',
   'write_fmt_time': '2016-04-25T06:34:22.352027-07:00',
   'write_ts': 1461591262.352027},
  'user_id': UUID('ea59084e-11d4-4076-9252-3b9a29ce35e0')}]

Last timeseries entry

In [65]: list(edb.get_timeseries_db().find({"metadata.key": {"$in": ["segmentation/raw_place", "segmentation/raw_place", "segmentation/raw_section", "segmentation/raw_stop", "segmentation/raw_untracked"]}}).sort([("data.enter_ts", -1), ("data.exit_ts", -1), ("data.start_ts", -1), ("data.end_ts", -1)]).limit(1))
Out[65]:
[{'_id': ObjectId('571dfacb88f66333657dafb5'),
  'data': {'_id': ObjectId('571dfacb88f66333657dafb5'),
   'ending_trip': ObjectId('571dfacb88f66333657dafb4'),
   'enter_fmt_time': '2016-04-25T01:54:55.062190-07:00',
   'enter_ts': 1461574495.06219,
   'location': {'coordinates': [-122.2528321669644, 37.86827700681786],
    'type': 'Point'},
   'source': 'DwellSegmentationDistFilter',
   'user_id': UUID('788f46af-9e6d-300b-93e1-981ba9b3390b')},
  'metadata': {'key': 'segmentation/raw_place',
   'platform': 'server',
   'time_zone': 'America/Los_Angeles',
   'write_fmt_time': '2016-04-25T06:32:12.557781-07:00',
   'write_ts': 1461591132.557781},
  'user_id': UUID('43f9361e-1cb6-4026-99ba-458be357d245')}]

First analysis entry

In [63]: list(edb.get_analysis_timeseries_db().find({"metadata.key": {"$in": ["segmentation/raw_place", "segmentation/raw_place", "segmentation/raw_section", "segmentation/raw_stop", "segmentation/raw_untracked"]}}).sort([("data.enter_ts", 1), ("data.exit_ts", 1), ("data.start_ts", 1), ("data.end_ts", 1)]).limit(1))
Out[63]:
[{'_id': ObjectId('57e9648a88f6634750306dc5'),
  'data': {'duration': 269.63700008392334,
   'end_fmt_time': '2015-07-13T15:30:26.489000-07:00',
   'end_loc': {'coordinates': [-122.0824345, 37.3790636], 'type': 'Point'},
   'end_stop': ObjectId('57e9648a88f6634750306dc7'),
   'end_ts': 1436826626.489,
   'sensed_mode': 0,
   'source': 'SmoothedHighConfidenceMotion',
   'start_fmt_time': '2015-07-13T15:25:56.852000-07:00',
   'start_loc': {'coordinates': [-122.0879696, 37.3885529], 'type': 'Point'},
   'start_ts': 1436826356.852,
   'trip_id': ObjectId('57e962fa88f66347503059e8')},
  'metadata': {'key': 'segmentation/raw_section',
   'platform': 'server',
   'time_zone': 'America/Los_Angeles',
   'write_fmt_time': '2016-09-26T11:10:18.444634-07:00',
   'write_ts': 1474913418.444634},
  'user_id': UUID('ea59084e-11d4-4076-9252-3b9a29ce35e0')}]

Last analysis entry

In [64]: list(edb.get_analysis_timeseries_db().find({"metadata.key": {"$in": ["segmentation/raw_place", "segmentation/raw_place", "segmentation/raw_section", "segmentation/raw_stop", "segmentation/raw_untracked"]}}).sort([("data.enter_ts", -1), ("data.exit_ts", -1), ("data.start_ts", -1), ("data.end_ts", -1)]).limit(1))
Out[64]:
[{'_id': ObjectId('59db353088f6636450bb268e'),
  'data': {'duration': -1490206.003000021,
   'ending_trip': ObjectId('59db353088f6636450bb268d'),
   'enter_fmt_time': '2017-10-25T22:36:49.223000-07:00',
   'enter_ts': 1508996209.223,
   'exit_fmt_time': '2017-10-08T16:40:03.220000-07:00',
   'exit_ts': 1507506003.22,
   'location': {'coordinates': [-122.2579113, 37.873973], 'type': 'Point'},
   'source': 'DwellSegmentationTimeFilter',
   'starting_trip': ObjectId('59db353088f6636450bb268f')},
  'metadata': {'key': 'segmentation/raw_place',
   'platform': 'server',
   'time_zone': 'America/Los_Angeles',
   'write_fmt_time': '2017-10-09T01:37:04.036013-07:00',
   'write_ts': 1507538224.036013},
  'user_id': UUID('06f82876-4090-482f-a7be-91345df47bb2')}]

So it looks like we ran the pipeline in april, back when we were still storing
entries to the timeseries. Then we split and re-ran the pipeline but did not
delete the old entries. So the timeseries entries are from
2015-07-13T15:25:56.852000-07:00 to 2016-04-25T01:54:55.062190-07:00,
while the analysis timeseries entries are from
2015-07-13T15:25:56.852000-07:00 -> 2017-10-08T16:40:03.220000-07:00.
So there is a clear overlap and we can delete the entries from the timeseries.

Delete now or delete later?

Let's just delete now, while we still have backups sitting around.

In [67]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["segmentation/raw_place", "segmentation/raw_place", "segmentation/raw_section", "segmentation/raw_stop", "segmentation/raw_untracked"]}}).count()
Out[67]: 29023

In [68]:  edb.get_timeseries_db().find({"metadata.key": {"$in": ["analysis/cleaned_place", "analysis/cleaned_trip", "analysis/cleaned_section", "analysis/cleaned_stop", "analysis/cleaned_untracked"]}}).count()
Out[68]: 0

In [69]:  edb.get_analysis_timeseries_db().find({"metadata.key": {"$in": ["analysis/cleaned_place", "analysis/cleaned_trip", "analysis/cleaned_section", "analysis/cleaned_stop", "analysis/cleaned_untracked"]}}).count()
Out[69]: 505455

It looks like the first run was pre-cleaned trips, so we only have to delete raw-*

In [76]: edb.get_timeseries_db().delete_many({"metadata.key": {"$in": ["segmentation/raw_place", "segmentation/raw_place", "segmentation/raw_section", "segmentation/raw_stop", "segmentation/raw_untracked"]}}).raw_result
Out[76]: {'n': 29023, 'ok': 1.0}

In [77]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["segmentation/raw_place", "segmentation/raw_place", "segmentation/raw_section", "segmentation/raw_stop", "segmentation/raw_untracked"]}}).count()
Out[77]: 0

Ok, so now the oldest entries in the timeseries should be different.

In [78]: list(edb.get_timeseries_db().find({"user_id": shankari_entry["uuid"]}).sort("data.ts", 1).limit(1))
Out[78]:
[{'_id': ObjectId('564f7d6388f66343e476e832'),
  'data': {'deleted_points': [],
   'filtering_algo': 'SmoothZigzag',
   'outlier_algo': 'BoxplotOutlier',
   'section': ObjectId('564f7d3888f66343e476e518')},
  'metadata': {'key': 'analysis/smoothing',
   'platform': 'server',
   'time_zone': 'America/Los_Angeles',
   'write_fmt_time': '2015-11-20T12:06:59.574516-08:00',
   'write_local_dt': datetime.datetime(2015, 11, 20, 20, 6, 59, 574000),
   'write_ts': 1448050019.574516},
  'user_id': UUID('ea59084e-11d4-4076-9252-3b9a29ce35e0')}]

Oops. It is now another generated result. Let's query and delete these as well.

In [79]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["analysis/smoothing"]}}).count()
Out[79]: 9033

In [82]: edb.get_timeseries_db().delete_many({"metadata.key": {"$in": ["analysis/smoothing"]}}).raw_result
Out[82]: {'n': 9033, 'ok': 1.0}

In [83]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["analysis/smoothing"]}}).count()
Out[83]: 0

Ok, so now the oldest entries in the timeseries should be different. Argh, we missed segmentation/raw_trip in the first query.

In [85]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["segmentation/raw_trip"]}}).count()
Out[85]: 7815

In [86]: edb.get_timeseries_db().delete_many({"metadata.key": {"$in": ["segmentation/raw_trip"]}}).raw_result
Out[86]: {'n': 7815, 'ok': 1.0}

In [87]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["segmentation/raw_trip"]}}).count()
Out[87]: 0

Ok, so now do we have correct oldest entries in the timeseries.
Yes, finally. Although it doesn't really have a ts.

In [88]: list(edb.get_timeseries_db().find({"user_id": shankari_entry["uuid"]}).sort("data.ts", 1).limit(1))
Out[88]:
[{'_id': ObjectId('579fb85988f66357dde496ae'),
  'data': {'approval_date': '2016-07-14',
   'category': 'emSensorDataCollectionProtocol',
   'protocol_id': '2014-04-6267'},
  'metadata': {'key': 'config/consent',
   'platform': 'android',
   'read_ts': 0,
   'time_zone': 'Pacific/Honolulu',
   'type': 'rw-document',
   'write_fmt_time': '2016-08-01T06:02:09.845000-10:00',
   'write_ts': 1470067329.845},
  'user_id': UUID('ea59084e-11d4-4076-9252-3b9a29ce35e0')}]

In [89]: list(edb.get_timeseries_db().find({"user_id": shankari_entry["uuid"], "data.ts": {"$exists": True}}).sort("data.ts", 1).limit(1))
Out[89]:
[{'_id': ObjectId('59822c3fcb17471ac0667b86'),
  'data': {'accuracy': 1086.116,
   'altitude': 0,
   'elapsedRealtimeNanos': 7479231065135,
   'filter': 'time',
   'fmt_time': '1969-12-31T16:00:00-08:00',
   'heading': 0,
   'latitude': -23.56214940547943,
   'loc': {'coordinates': [-46.72179579734802, -23.56214940547943],
    'type': 'Point'},
   'longitude': -46.72179579734802,
   'sensed_speed': 0,
   'ts': 0},
  'metadata': {'key': 'background/location',
   'platform': 'android',
   'read_ts': 0,
   'time_zone': 'America/Los_Angeles',
   'type': 'sensor-data',
   'write_fmt_time': '2017-08-02T12:01:48.909000-07:00',
   'write_ts': 1501700508.909},
  'user_id': UUID('ea59084e-11d4-4076-9252-3b9a29ce35e0')}]

shankari · 2018-01-04T08:26:09Z

ok, so now that we believe that that database is fine, we can run the pipeline again for the first time in forever.

unmount the transfer drive now that its job is done
```
$ sudo umount /transfer
$
```
detach volume
while that is running, turn on the pipeline after multiple months. After
this runs successfully at least once, we can put it into a cronjob.

Note that we now have entries in the timeseries that are client stats only and
have no uuid entry or any real data. These are zombie entries from before the
uuid change. Are we ignoring these correctly?

A quick check shows that we just read from the UUID database.

def get_all_uuids():
    all_uuids = [e["uuid"] for e in edb.get_uuid_db().find()]
    return all_uuids

There are other methods that still use distinct, notably aggregate_timeseries.get_distinct_users and builtin_timeseries.get_uuid_list, but they don't seem to be used anywhere in the code. Let's remove them as part of cleanup so people are not tempted to use them. Alternatively, we can move zombie entries into a separate timeseries where they won't pollute anything.

Ok, so let's run this script!

$ ./e-mission-ipy.bash bin/intake_multiprocess.py 4 > /log/intake.stdinout.log 2>&1
$ date
Thu Jan  4 08:17:24 UTC 2018

It's taking a long time to even just get started. I wonder if we are using distinct somewhere...

Looking at the launcher logs, it is still iterating through the users and querying for the number of entries in the usercache. I don't even think we use that functionality and can probably get rid of it in the next release.

Ah so how processes have been launched. Hopefully this first run will be done by tomorrow morning.

Now check back on the volume - it has been detached
Delete both volumes and the snapshot. At this point, there is no unencrypted copy of the data. Will delete other server with encrypted data once everything is done.

shankari · 2018-01-04T08:30:44Z

We should also remove the filter_accuracy step from the real server since it is only applicable for test phones/open data.

shankari · 2018-01-04T17:12:09Z

Habitica API is at port 3000, so I have to open it as an outgoing port
Performance is significantly better. If filter accuracy is removed, then my data for 3 months is processed in ~ 15 mins. We might be able to get back to running every hour.

2018-01-04T12:02:11.764406+00:00**********UUID <shankari_uuid>: moving to long term**********
2018-01-04T12:04:29.687814+00:00**********UUID <shankari_uuid>: filter accuracy if needed**********
2018-01-04T12:26:01.563703+00:00**********UUID <shankari_uuid>: segmenting into trips**********
2018-01-04T12:33:06.101569+00:00**********UUID <shankari_uuid>: segmenting into sections**********
2018-01-04T12:33:33.701075+00:00**********UUID <shankari_uuid>: smoothing sections**********
2018-01-04T12:33:54.020444+00:00**********UUID <shankari_uuid>: cleaning and resampling timeline**********
2018-01-04T12:40:12.439267+00:00**********UUID <shankari_uuid>: checking active mode trips to autocheck habits**********

I'm also seeing some errors with saving data, need to make a pass through the errors.

Got error None while saving entry AttrDict({'_id': ObjectId('59f11f15cb17471ac0cfc059'), 'metadata': {'write_ts': 1508533079.587702, 'plugin': 'none', 'time_zone': 'America/Montreal', 'platform': 'ios', 'key': 'statemachine/transition', 'read_ts': 0, 'type': 'message'}, 'user_id': UUID('e95bfd0b-1cfc-4ea3-af01-c41dc2fad0ed'), 'data': {'transition': None, 'ts': 1508533079.587526, 'currState': 'STATE_ONGOING_TRIP'}}) -> None

shankari · 2018-07-06T09:52:19Z

AMPLab is using too many resources, so I have to trim my consumption.
Let's finally finish cleaning up the old servers

shankari · 2018-07-06T09:55:19Z

Copied dump_team_trajectories.py off the old server.
There shouldn't be anything else.
One caveat is that the original data is 150 GB

143G    /mnt/e-mission-primary-db/mongodb

But the new dataset is only 20GB.

/dev/mapper/xvdf  3.0T   20G  3.0T   1% /data
/dev/mapper/xvdg   25G  633M   25G   3% /journal
/dev/mapper/xvdh   10G  638M  9.4G   7% /log

So what is missing?

shankari · 2018-07-06T11:32:12Z

It doesn't appear to be the data. These are the only differences in collections between the old and new databases. There is nothing missing except for system.indexes, which should be an auto-generated collection.

shankari · 2018-07-06T11:36:14Z

Yup! system.indexes is now deprecated.
https://docs.mongodb.com/manual/reference/system-collections/#%3Cdatabase%3E.system.indexes

Deprecated since version 3.0: Access this data using listIndexes.

And listIndexes does have the data.

> db.Stage_timeseries.getIndexes()
[
        {
                "v" : 2,
                "key" : {
                        "_id" : 1
                },
                "name" : "_id_",
                "ns" : "Stage_database.Stage_timeseries"
        },
        {
                "v" : 2,
                "key" : {
                        "user_id" : "hashed"
                },
                "name" : "user_id_hashed",
                "ns" : "Stage_database.Stage_timeseries"
        },
        {
                "v" : 2,
                "key" : {
                        "metadata.key" : "hashed"
                },
                "name" : "metadata.key_hashed",
                "ns" : "Stage_database.Stage_timeseries"
        },
        {
                "v" : 2,
                "key" : {
                        "metadata.write_ts" : -1
                },
                "name" : "metadata.write_ts_-1",
                "ns" : "Stage_database.Stage_timeseries"
        },
        {
                "v" : 2,
                "key" : {
                        "data.ts" : -1
                },
                "name" : "data.ts_-1",
                "ns" : "Stage_database.Stage_timeseries",
                "sparse" : true
        },
...

shankari · 2018-07-06T11:36:44Z

Shutting down the old server now. RIP! You were a faithful friend and will be missed.

shankari · 2018-07-09T11:30:42Z

In the past 4 days, the compute has increased by $50. The storage has increased by $200. We need to turn off some storage. Wah! Wah! What if I lose something important?! I guess you just have to deal with it...

shankari · 2018-07-10T01:34:26Z

Deleted all related storage.

shankari · 2018-07-10T16:06:01Z

Even with all the deletions, we spent ~ $50/day. This is a problem, because we will then end up spending an additional $1040 for the rest of the month, and we have already spent ~ $1500. This also means that we won't be under $1000 for next month.

Since our reserved instances already cost $507, the m3/m4 legacy servers cost ~ $639, we have to keep our storage budget to under $500 to stay at my preferred 50% of my available budget.

The storage is mostly going towards the provisioned iOPS storage. I don't think I actually need 3GB.

Current storage is

Filesystem        Size  Used Avail Use% Mounted on
devtmpfs           30G   84K   30G   1% /dev
tmpfs              30G     0   30G   0% /dev/shm
/dev/xvda1        7.8G  2.3G  5.5G  30% /
/dev/mapper/xvdf  3.0T   21G  3.0T   1% /data
/dev/mapper/xvdg   25G  633M   25G   3% /journal
/dev/mapper/xvdh   10G  638M  9.4G   7% /log

We should be able to drop to:

/data: 200G

If we still need to reduce after that, we can change to:

/journal: 10G
/log: 5G

Let's see how easy it is to resize EBS volumes

shankari · 2018-07-10T16:38:21Z

Let's see how easy it is to resize EBS volumes

Not too hard, you just have to copy the data around.
https://matt.berther.io/2015/02/03/how-to-resize-aws-ec2-ebs-volumes/
Let's turn off everything and start copying the data tomorrow morning

shankari · 2018-07-11T10:52:46Z

This was a bit trickier than one would expect because xfs does not support resize2fs and in fact, does not support shrinking the file system at all. So we had to follow the instructions to use the temp instance, as in the link above.

Note that our data is also on an encrypted filesystem, so our steps were:

unmount + turn encryption off for disk
detach
create new volume
create new instance
attach both volumes
mount old_data
crypt setup new volume
mount new data
copy data over using xfsdump and xfsrestore. Note that xfsdump does not support a / at the end of the mounted filename
chmod new_data to the same ids as old_data

At this point, the only diff between them is

--- /tmp/old_data_list  2018-07-11 10:07:36.038063450 +0000
+++ /tmp/new_data_list  2018-07-11 10:07:24.266143934 +0000
@@ -140,7 +140,7 @@
 -rw-r--r-- 1  498  497 4.0K Jul 11 09:23 index-94-2297001609533616747.wt
 -rw-r--r-- 1  498  497 4.0K Jul 11 09:23 index-96-2297001609533616747.wt
 -rw-r--r-- 1  498  497 4.0K Jul 11 09:23 index-98-2297001609533616747.wt
-lrwxrwxrwx 1 root root    8 Jan  1  2018 journal -> /journal
+lrwxrwxrwx 1 498 497    8 Jul 11 09:53 journal -> /journal
 -rw-r--r-- 1  498  497  48K Jul 11 09:23 _mdb_catalog.wt
 -rw-r--r-- 1  498  497    0 Jul 11 09:23 mongod.lock
 -rw-r--r-- 1  498  497  36K Jul 11 09:23 sizeStorer.wt

which makes sense

So now it's time to reverse the steps and attach new_data back to the server

shankari · 2018-07-11T11:28:16Z

Reversed steps, restarted server. No errors so far.
Deleting old disk and migration instance.

shankari · 2018-07-11T11:29:28Z

Done. Closing this issue for now.

shankari · 2018-07-12T13:18:37Z

Burn rate is now $33/day
1419.11 - 1385.76 = 33.35

Should go down after we turn off air quality server
but 33 * 15 ~ $500
so we are on track for $2000 for the month, not $1500 as originally planned

shankari · 2018-07-18T10:37:44Z

Burn rate is now roughly $13/day (1497 - 1419 = 78 over 6 days = $13/day)
So this month should be $1497 + $156 = $1653
Next month should be 524 (reserved instances) + 13 * 30 = 524 + 390 = 914 (< $1000)

This fixes https://github.com/e-mission/e-mission-server/issues/530#issuecomment-352197949 Also add a new test case that checks for this. Also fix a small bug in the extraction script

… in the query This fixes https://github.com/e-mission/e-mission-server/issues/530#issuecomment-352206464 Basically, two sections are back to back, then the last point of the first section will overlap with the first point of the second section. So a query based on the start and end time for the first section will return the the first point of the second section as well, which causes a mismatch between the re-retrieved and stored speeds and distances. We detect and drop the last point in this case.

This fixes https://github.com/e-mission/e-mission-server/issues/530#issuecomment-352219808 dealing with using pymongo in a multi-process environment ``` /Users/shankari/OSS/anaconda/envs/emission/lib/python3.6/site-packages/pymongo/topology.py:145: UserWarning: MongoClient opened before fork. Create MongoClient with connect=False, or create client after forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#pymongo-fork-safe> "MongoClient opened before fork. Create MongoClient " /Users/shankari/OSS/anaconda/envs/emission/lib/python3.6/site-packages/pymongo/topology.py:145: UserWarning: MongoClient opened before fork. Create MongoClient with connect=False, or create client after forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#pymongo-fork-safe> "MongoClient opened before fork. Create MongoClient " /Users/shankari/OSS/anaconda/envs/emission/lib/python3.6/site-packages/pymongo/topology.py:145: UserWarning: MongoClient opened before fork. Create MongoClient with connect=False, or create client after forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#pymongo-fork-safe> "MongoClient opened before fork. Create MongoClient " ``` spawning instead of forking ensures that the subprocesses don't inherit the MongoClient object from the parent and create new ones instead. ``` storage not configured, falling back to sample, default configuration Connecting to database URL localhost debug not configured, falling back to sample, default configuration storage not configured, falling back to sample, default configuration Connecting to database URL localhost storage not configured, falling back to sample, default configuration Connecting to database URL localhost storage not configured, falling back to sample, default configuration Connecting to database URL localhost ```

i.e. from "test-phone-user-01" to the values at https://github.com/e-mission/e-mission-server/issues/530#issuecomment-351561385

The test mappings were for local testing. Actual mappings from: https://github.com/e-mission/e-mission-server/issues/530#issuecomment-351561385

See https://github.com/e-mission/e-mission-server/issues/530#issuecomment-353803676 Note that I remove all entries whose section entry is not valid and have snuck over from elsewhere. Regression described at https://github.com/e-mission/e-mission-server/issues/530#issuecomment-353803676 now fixed (I ran thrice in a row without failing)

Although people won't see the ipv6 until they start to use it. Note that there are a bunch of manual steps to turn on IPv6 for this setup. This change merely automates the tedious work of setting up the routing tables and security groups. https://github.com/e-mission/e-mission-server/issues/530#issuecomment-354061649 At this point, I declare that I am done with tweaking the configuration and will use the configuration deployed from this template (including 75d19de, 7a32bb6...) as the setup for the default/reference e-mission server.

shankari reopened this Jul 6, 2018

shankari closed this as completed Jul 11, 2018

valin1 referenced this issue in valin1/e-mission-server Oct 25, 2018

Fix the overridden hack

7fa0f5a

This fixes https://github.com/e-mission/e-mission-server/issues/530#issuecomment-352197949 Also add a new test case that checks for this. Also fix a small bug in the extraction script

valin1 referenced this issue in valin1/e-mission-server Oct 25, 2018

Reset the test phone emails to meaningful values

d16cc77

i.e. from "test-phone-user-01" to the values at https://github.com/e-mission/e-mission-server/issues/530#issuecomment-351561385

valin1 referenced this issue in valin1/e-mission-server Oct 25, 2018

Use the actual mappings instead of test mappings

d76cd85

The test mappings were for local testing. Actual mappings from: https://github.com/e-mission/e-mission-server/issues/530#issuecomment-351561385

shankari transferred this issue from e-mission/e-mission-server Feb 11, 2019

MukuFlash03 mentioned this issue Mar 5, 2024

Unify buid and deploy processes across the various components of OpenPATH #1048

Closed

Split the server for greater scalability #292

Split the server for greater scalability #292

Comments

shankari commented Nov 21, 2017

shankari commented Nov 24, 2017

shankari commented Nov 24, 2017

shankari commented Nov 24, 2017

shankari commented Nov 24, 2017

shankari commented Nov 24, 2017 • edited Loading

shankari commented Nov 24, 2017

shankari commented Nov 24, 2017

shankari commented Nov 24, 2017

shankari commented Nov 24, 2017 • edited Loading

m3

m4

shankari commented Nov 24, 2017

shankari commented Nov 24, 2017 • edited Loading

Storage details

shankari commented Dec 7, 2017 • edited Loading

shankari commented Dec 8, 2017 • edited Loading

shankari commented Dec 10, 2017 • edited Loading

shankari commented Dec 10, 2017 • edited Loading

shankari commented Dec 11, 2017

Adding authentication

shankari commented Dec 11, 2017 • edited Loading

Concrete proposal

shankari commented Dec 11, 2017 • edited Loading

shankari commented Dec 11, 2017

shankari commented Dec 11, 2017 • edited Loading

shankari commented Dec 11, 2017 • edited Loading

shankari commented Dec 11, 2017 • edited Loading

shankari commented Dec 11, 2017 • edited Loading

Service instances

Before migration

Migrate

After migration

shankari commented Dec 11, 2017

Database instance

Migration

shankari commented Dec 11, 2017

shankari commented Dec 11, 2017

shankari commented Dec 12, 2017

shankari commented Dec 12, 2017

shankari commented Dec 12, 2017

shankari commented Dec 12, 2017

shankari commented Jan 3, 2018 • edited Loading

shankari commented Jan 4, 2018 • edited Loading

Re-attach the volume to the new stack

shankari commented Jan 4, 2018

shankari commented Jan 4, 2018

shankari commented Jan 4, 2018

shankari commented Jul 6, 2018

shankari commented Jul 6, 2018

shankari commented Jul 6, 2018

shankari commented Jul 6, 2018

shankari commented Jul 6, 2018

shankari commented Jul 9, 2018

shankari commented Jul 10, 2018

shankari commented Jul 10, 2018

shankari commented Jul 10, 2018 • edited Loading

shankari commented Jul 11, 2018 • edited Loading

shankari commented Jul 11, 2018

shankari commented Jul 11, 2018

shankari commented Jul 12, 2018

shankari commented Jul 18, 2018

shankari commented Nov 24, 2017 •

edited

Loading

shankari commented Nov 24, 2017 •

edited

Loading

shankari commented Nov 24, 2017 •

edited

Loading

shankari commented Dec 7, 2017 •

edited

Loading

shankari commented Dec 8, 2017 •

edited

Loading

shankari commented Dec 10, 2017 •

edited

Loading

shankari commented Dec 10, 2017 •

edited

Loading

shankari commented Dec 11, 2017 •

edited

Loading

shankari commented Dec 11, 2017 •

edited

Loading

shankari commented Dec 11, 2017 •

edited

Loading

shankari commented Dec 11, 2017 •

edited

Loading

shankari commented Dec 11, 2017 •

edited

Loading

shankari commented Dec 11, 2017 •

edited

Loading

shankari commented Jan 3, 2018 •

edited

Loading

shankari commented Jan 4, 2018 •

edited

Loading

shankari commented Jul 10, 2018 •

edited

Loading

shankari commented Jul 11, 2018 •

edited

Loading