-
Notifications
You must be signed in to change notification settings - Fork 2
Setting up: CockroachDB
The federal spending data was migrated from PostgreSQL to CockroachDB, a distributed SQL ("NewSQL") database built on a transactional and strongly-consistent key-value store. When comparing to PostgreSQL, CockroachDB stands out in its built-in scalability, although there's more latency with complex queries. In our case we're using simple filtering and summing operations on a large (and growing) database, so the tradeoff is ok.
The following cluster deployment largely follows guidance by Cockroach Labs, with adjustments to my particular architecture.
- Download the CockroachDB archive and extract the binary.
curl https://binaries.cockroachdb.com/cockroach-v19.2.2.darwin-10.9-amd64.tgz | \
tar -xJ
- Copy the binary into
PATH
to execute cockroach commands from any shell.
cp -i cockroach-v19.2.2.darwin-10.9-amd64/cockroach /usr/local/bin
Launch 4 EC2 instances, each
- r5a.xlarge (2 vCPUs and 16GB memory)
- running Ubuntu
- with 750GB root volume
- with an Elastic IP
Note that the instances also require a lot of CPU use for writing, so a compute optimized instance may be an option.
Create a security group for the CockroachDB cluster, so that the nodes can communicate with each other.
Type | Protocol | Port Range | Source | Description |
---|---|---|---|---|
Custom TCP Rule | TCP | 8080 | < your IP > | access cockroachDB admin UI |
Custom TCP Rule | TCP | 8080 | IP range of VPC in CIDR notation | load balancer - health check communication |
Custom TCP Rule | TCP | 26257 | private IP of instance running FedSpend app UI | communication with the application UI |
Custom TCP Rule | TCP | 26257 | sg-xxx (security group ID) | for node X to talk to node Y |
Custom TCP Rule | TCP | 7077 | < your IP > | cockroachDB port |
SSH | TCP | 22 | < your IP > | SSH into machine |
Type | Protocol | Port Range | Source | Description |
---|---|---|---|---|
All traffic | All | All | sg-xxx (security group ID) | for node X to receive data from node Y |
All traffic | All | All | 0.0.0.0/0 | all download from external websites |
It's necessary for all nodes to be working from the same clock. Amazon Time Sync is a service that facilitates this and needs to be set up on all nodes.
sudo apt update
sudo apt install chrony
# here we'll open a file...
sudo nano /etc/chrony/chrony.conf
# ...and paste the following text above any pool or server statements
# -> server 169.254.169.123 prefer iburst minpoll 4 maxpoll 4
sudo /etc/init.d/chrony restart
# verify that chrony is using the 169.254.169.123 IP address
# to synchronize the time
chronyc sources -v
# verify the time synchronization metrics that are reported by chrony
chronyc tracking
We'll be using the AWS load balancing service to distribute client traffic across nodes.
- Open the EC2 console and select Load balancers.
- Click the button Create Load Balancer on the left.
- Create a Network Load Balancer.
- Select a name for the balancer, and add a TCP listener on port 26257.
- Under the 'Availability Zones' section, select the same VPC and subnet as the one containing the instances.
- In Step 3: Configure Routing, select any name for the target group and that they'll receive communication on port 25267. The Health checks protocol should be HTTP port 8080 with path
/health?ready=1
. - In Step 4: Register Targets, register instances with the target group just created, specifying port 26257.
Make a note of the private IP of the load balancer.
Generation is done locally, then the files transferred to each instance. Below is an outline, the full certificate process is available elsewhere.
- Generate the CA certificate and key.
- Create the certificate and key for the first node.
- Upload the CA certificate, node certificate, key to the first node.
- Delete the local copy of the node certificate and key.
- Repeat steps 2-4 for all other nodes.
- Create a client certificate and key for the
root
user. Transfer to the machine you'll run workloads from.
Below is an outline, the full startup process is available elsewhere. For each node...
- SSH into the node and install CockroachDB. Copy the binary into the
PATH
.
wget -qO- https://binaries.cockroachdb.com/cockroach-v19.2.2.linux-amd64.tgz | tar xvz
sudo cp -i cockroach-v19.2.2.linux-amd64/cockroach /usr/local/bin/
- Run the
cockroach start
command to prime the node to start.
Run the cockroach init
command from within any node to complete the startup process and have them join together as a cluster.
Note Although guidance says to run this command locally, this is not work for me!
cockroach init --certs-dir=certs --host=<private IP of node>
For secure clusters, only authorized users can access and view the Admin UI. The Admin UI can be accessed from any node in the cluster, via < node public IP >:8080.
- Launch the CockroachDB SQL client from within the node with the
root
user certificates.
cockroach sql --certs-dir=certs --host=<load balancer IP>
- Create a username and password for other users) for an account that will access the Admin UI.
CREATE USER admin WITH PASSWORD '<your-password>';
- Open the Admin UI with a web browser and log in!
📬 Create an issue if you have one.
(📝 instructions accurate as of Feb 2020)