A distributed system status and metric checker. ๐
A visual dashboard for this service can be found at forgestatus-dashboard.
-
A server queues up tasks for checking system metrics (memory, cpu, etc.).
-
Several workers move tasks from this queue into their own queues.
-
Each task describes a system metric that a worker needs to query all other workers for.
-
The result is queued up once a worker has aggregated all the necessary metrics.
-
Workers are also responsible for responding to all other workers requests for info.
-
A client can query for metric results at any time.
- Automated CI/CD using CircleCI (including gates and efficient artifact caching).
- Full Kubernetes (GKE) deployment for two environments.
- Stackdriver alerts/monitoring of system and site health.
- Slack notifications for releases and alerts.
These are fully automated and monitored instances.
Used for debugging while work is done on forgestatus-dashboard.
This is for demo purposes.
1 server, 3 workers, and 1 redis instance will be started.
Output for task scheduling and processing is printed to the console.
$ docker-compose up
This is for local development purposes.
This requires a working go environment A locally running instance of redis is also required.
$ make
$ REDIS_URL=localhost:6379 server
$ REDIS_URL=localhost:6379 worker
Make a pull request and CircleCI will automatically build, test, and deploy your app to dev.forgestatus.com.
There are both unit and integration tests.
$ make unit-test
$ make integration-test
There is a CI gate that prevents code from being merged to master with failing tests.
Failing tests also trigger a slack alert at forgestatus.slack.com.
As I learn more about the language, more tests will be added.
An intentional shortcoming is that each worker is uniquely identified with a k8s deployment/service. This was required because each worker needs the capability to query all other workers. In a real system, there would only be one deployment/service for a single type of worker.
The reliable queue pattern was followed from redis.io. The only missing piece is the process to go through and re-queue stale work in the pending queues.
Workers only queue up one task at a time. Workers could potentially grab a handful of tasks and execute them in parallel. Also, each worker sequentially queries each other worker when given a command (which could easily be done asynchronously).
There definitely needs to be more tests. Significant time was spent on playing with golang, automation, and deployment
Automation of infrastructure resources could have been done through Terraform. Currently, all resource commands are saved as script files in infra.
This could be scaled by removing the unique worker k8s deployments/services.
The addition of Istio as the Ingress would make service management cleaner.
There could also be a hierarchy of task queues. This would allow a subset of tasks to be completed in parallel. Concurrency would be managed by parent processes/tasks. Something like an event aggregator could be used for state management.
This is already supported by having the server only schedule one task at a time. If another process needed to manage the task result, then a field for identifying dependent tasks could be added. The process could look up completed tasks and only work on the next task when all dependencies have been met.
Tasks are currently identified with a unique ID. If the server needs a task to be done in parallel, then multiple tasks could be sent with associated IDs. If the tasks needed to be aggregated, then something like a part number or aggregate ID could be added to each task.