Skip to content

Commit

Permalink
Integrate docs from ooni/docs into devops repo (#133)
Browse files Browse the repository at this point in the history
Integrate legacy docs into main docs to work on updating it
  • Loading branch information
hellais authored Jan 6, 2025
1 parent b621087 commit b757ecd
Show file tree
Hide file tree
Showing 12 changed files with 3,079 additions and 175 deletions.
2 changes: 0 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
# OONI Devops

This document outlines some of the best practices we follow when developing and
deploying OONI services.

## Infrastructure Tiers

Expand Down
30 changes: 30 additions & 0 deletions docs/DebianPackages.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Debian packages

**NOTE** The direction we are going with the new backend is that of dropping debian packaging of all backend API components and move to a dockerized deployment approach.

This section lists the Debian packages used to deploy backend
components. They are built by [GitHub CI workflows](#github-ci-workflows) 💡
and deployed using [The deployer tool](#the-deployer-tool) 🔧. See
[Debian package build and publish](#debian-package-build-and-publish) 💡.


#### ooni-api package
Debian package for the [API](#api) 


#### fastpath package
Debian package for the [Fastpath](#fastpath) 


#### detector package
Debian package for the
[Social media blocking event detector](#social-media-blocking-event-detector) 


#### analysis package
The `analysis` Debian package contains various tools and runs various of
systemd timers, see [Systemd timers](#systemd-timers) 💡.


#### Analysis deployment
See [Backend component deployment](#backend-component-deployment) 📒
141 changes: 141 additions & 0 deletions docs/DeprecatedDocs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
## Test helper rotation runbook
This runbook provides hints to troubleshoot the rotation of test
helpers. In this scenario test helpers are not being rotated as expected
and their TLS certificates might be at risk of expiring.

Steps:

1. Review [Test helpers](#comp:test_helpers), [Test helper rotation](#comp:test_helper_rotation) and [Test helpers notebook](#test-helpers-notebook) 📔

2. Review the charts on [Test helpers dashboard](#test-helpers-dashboard) 📊.
Look at different timespans:

a. The uptime of the test helpers should be staggered by a week
depending on [Test helper rotation](#test-helper-rotation) ⚙.

3. A summary of the live and last rotated test helper can be obtained
with:

```sql
SELECT rdn, dns_zone, name, region, draining_at FROM test_helper_instances ORDER BY name DESC LIMIT 8
```

4. The rotation tool can be started manually. It will always pick the
oldest host for rotation. ⚠️ Due to the propagation time of changes
in the DNS rotating many test helpers too quickly can impact the
probes.

a. Log on [backend-fsn.ooni.org](#backend-fsn.ooni.org) 🖥

b. Check the last run using
`sudo systemctl status ooni-rotation.timer`

c. Review the logs using `sudo journalctl -u ooni-rotation`

d. Run `sudo systemctl restart ooni-rotation` and monitor the logs.

5. Review the charts on [Test helpers dashboard](#test-helpers-dashboard) 📊
during and after the rotation.


### Test helpers failure runbook
This runbook presents a scenario where a test helper is causing probes
to fail their tests sporadically. It describes how to identify the
affected host and mitigate the issue but can also be used to investigate
other issues affecting the test helpers.

It has been chosen because such kind of incidents can impact the quality
of measurements and can be relatively difficult to troubleshoot.

For investigating glitches in the
[test helper rotation](#test-helper-rotation) ⚙ see
[test helper rotation runbook](#test-helper-rotation-runbook) 📒.

In this scenario either an alert has been sent to the
[#ooni-bots](#topic:oonibots) [Slack](#slack) 🔧 channel by
the [test helper failure rate notebook](#test-helper-failure-rate-notebook) 📔 or something
else caused the investigation.
See [Alerting](#alerting) 💡 for details.

Steps:

1. Review [Test helpers](#test-helpers) 

2. Review the charts on [Test helpers dashboard](#test-helpers-dashboard) 📊.
Look at different timespans:

a. The uptime of the test helpers should be staggered by a week
depending on [Test helper rotation](#test-helper-rotation) ⚙.

b. The in-flight requests and requests per second should be
consistent across hosts, except for `0.th.ooni.org`. See
[Test helpers list](#test-helpers-list) 🐝 for details.

c. Review CPU load, memory usage and run duration percentiles.

3. Review [Test helper failure rate notebook](#test-helper-failure-rate-notebook) 📔

4. For more detailed investigation there is also a [test helper notebook](https://jupyter.ooni.org/notebooks/notebooks/2023%20%5Bfederico%5D%20test%20helper%20metadata%20in%20fastpath.ipynb)

5. Log on the hosts using
`ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -Snone [email protected]`

6. Run `journalctl --since '1 hour ago'` or review logs using the query
below.

7. Run `top`, `strace`, `tcpdump` as needed.

8. The rotation tool can be started at any time to rotate away failing
test helpers. The rotation script will always pick the oldest host
for rotation. ⚠️ Due to the propagation time of changes in the DNS
rotating many test helpers too quickly can impact the probes.

a. Log on [backend-fsn.ooni.org](#backend-fsn.ooni.org) 🖥

b. Check the last run using
`sudo systemctl status ooni-rotation.timer`

c. Review the logs using `sudo journalctl -u ooni-rotation`

d. Run `sudo systemctl restart ooni-rotation` and monitor the logs.

9. Review the charts on [Test helpers dashboard](#test-helpers-dashboard) 📊
during and after the rotation.

10. Summarize traffic hitting a test helper using the following commands:

Top 10 miniooni probe IP addresses (Warning: this is sensitive data)

`tail -n 100000 /var/log/nginx/access.log | grep miniooni | cut -d' ' -f1|sort|uniq -c|sort -nr|head`

Similar, with anonimized IP addresses:

`grep POST /var/log/nginx/access.log | grep miniooni | cut -d'.' -f1-3 | head -n 10000 |sort|uniq -c|sort -nr|head`

Number of requests from miniooni probe in 10-minutes buckets:

`grep POST /var/log/nginx/access.log | grep miniooni | cut -d' ' -f4 | cut -c1-17 | uniq -c`

Number of requests from miniooni probe in 1-minute buckets:

`grep POST /var/log/nginx/access.log | grep miniooni | cut -d' ' -f4 | cut -c1-18 | uniq -c`

Number of requests grouped by hour, cache HIT/MISS/etc, software name and version

`head -n 100000 /var/log/nginx/access.log | awk '{print $4, $6, $13}' | cut -c1-15,22- | sort | uniq -c | sort -n`

To extract data from the centralized log database
on [monitoring.ooni.org](#monitoring.ooni.org) 🖥 you can use:

``` sql
SELECT message FROM logs
WHERE SYSLOG_IDENTIFIER = 'oohelperd'
ORDER BY __REALTIME_TIMESTAMP DESC
LIMIT 10
```

> **note**
> The table is indexed by `__REALTIME_TIMESTAMP`. Limiting the range by time can significantly increase query performance.

See [Selecting test helper for rotation](#selecting-test-helper-for-rotation) 🐞
Loading

0 comments on commit b757ecd

Please sign in to comment.