Skip to content

Commit

Permalink
Networking dashboard and discovery tool refactor (#1020)
Browse files Browse the repository at this point in the history
* wip

* wip

* wip

* wip

* wip

* discovery

* single discovery

* page token

* batch requests

* remove plugin name

* streamline

* streamline

* dynamic routes

* dynamic routes

* forwarding rules and addresses

* batch requests

* metrics

* notes

* notes

* streamline

* fixes, dump

* streamline

* remove globals

* wip metrics

* subnet time series

* networks per project plugin

* firewall rules timeseries

* use names in metric labels

* firewall policies timeseries

* wip

* instances per network timeseries

* routes timeseries

* custom quota

* simpler quota, network peering timeseries

* peering timeseries

* timeseries names

* wip descriptors

* metric descriptors

* fixes

* wip

* Use partial for all cf init functions

* Add requirements.txt

* fix org key mismatch

* Fix folder short cli name

* Fix instance_networks when iterable is empty

* more readability and fixing some strings

* replace() -> removeprefix and remove unneeded quoting

* setdefault in init()s

* Fix next hop type

* Remove unneeded fstring

* create descriptors

* create descriptors log

* rename descriptor requests function

* non-working metrics implementation (duplicate timeseries batched)

* timeseries

* fixes

* write timseries

* fix timeseries plugins

* start documenting code

* docstrings and comments

* docstrings comments and small fixes

* rename cf to src

* discover nodes instead of just projects

* discovery node can be a folder or org

* cf entrypoint and fixes

* cf deployment

* remove old paths

* cloud function deploy readme

* diagrams

* resource ids in example

* discovery tool readme

* top-level README

* Some documentation fixes

* Add secondary ranges

* Update README.md

* add legend to scope diagram

* improve description of discovery configuration variable

* add comment in example for custom quotas file

* rename op_project to monitoring_project

* dashboard metric rename wip

* Update discover-cai-compute.py

* deploy sample dashboard

Co-authored-by: Julio Castillo <[email protected]>
Co-authored-by: Aurélien Legrand <[email protected]>
  • Loading branch information
3 people authored Dec 18, 2022
1 parent bc7dc8c commit 93361d7
Show file tree
Hide file tree
Showing 49 changed files with 2,485 additions and 3,734 deletions.
178 changes: 82 additions & 96 deletions blueprints/cloud-operations/network-dashboard/README.md
Original file line number Diff line number Diff line change
@@ -1,103 +1,89 @@
# Networking Dashboard
# Network Dashboard and Discovery Tool

This repository provides an end-to-end solution to gather some GCP Networking quotas and limits (that cannot be seen in the GCP console today) and display them in a dashboard.
The goal is to allow for better visibility of these limits, facilitating capacity planning and avoiding hitting these limits.
This repository provides an end-to-end solution to gather some GCP networking quotas, limits, and their corresponding usage, store them in Cloud Operations timeseries which can displayed in one or more dashboards or wired to alerts.

Here is an example of dashboard you can get with this solution:
The goal is to allow for better visibility of these limits, some of which cannot be seen in the GCP console today, facilitating capacity planning and being notified when actual usage approaches them.

The tool tracks several distinct usage types across a variety of resources: projects, policies, networks, subnetworks, peering groups, etc. For each usage type three distinct metrics are created tracking usage count, limit and utilization ratio.

The screenshot below is an example of a simple dashboard provided with this blueprint, showing utilization for a specific metric (number of instances per VPC) for multiple VPCs and projects:

<img src="metric.png" width="640px">

Here you see utilization (usage compared to the limit) for a specific metric (number of instances per VPC) for multiple VPCs and projects.

Three metric descriptors are created for each monitored resource: usage, limit and utilization. You can follow each of these and create alerting policies if a threshold is reached.

## Usage

Clone this repository, then go through the following steps to create resources:
- Create a terraform.tfvars file with the following content:
```tfvars
organization_id = "<YOUR-ORG-ID>"
billing_account = "<YOUR-BILLING-ACCOUNT>"
monitoring_project_id = "<YOUR-MONITORING-PROJECT>"
# Monitoring project where the dashboard will be created and the solution deployed, a project named "mon-network-dahshboard" will be created if left blank
monitored_projects_list = ["project-1", "project2"]
# Projects to be monitored by the solution
monitored_folders_list = ["folder_id"]
# Folders to be monitored by the solution
prefix = "<YOUR-PREFIX>"
# Monitoring project name prefix, monitoring project name is <YOUR-PREFIX>-network-dashboard, ignored if monitoring_project_id variable is provided
cf_version = V1|V2
# Set to V2 to use V2 Cloud Functions environment
```
- `terraform init`
- `terraform apply`
Note: Org level viewing permission is required for some metrics such as firewall policies.
Once the resources are deployed, go to the following page to see the dashboard: https://console.cloud.google.com/monitoring/dashboards?project=<YOUR-MONITORING-PROJECT> a dashboard called "quotas-utilization" should be created.
The Cloud Function runs every 10 minutes by default so you should start getting some data points after a few minutes.
You can use the metric explorer to view the data points for the different custom metrics created: https://console.cloud.google.com/monitoring/metrics-explorer?project=<YOUR-MONITORING-PROJECT>.
You can change this frequency by modifying the "schedule_cron" variable in variables.tf.
Note that some charts in the dashboard align values over 1h so you might need to wait 1h to see charts on the dashboard views.
Once done testing, you can clean up resources by running `terraform destroy`.
## Supported limits and quotas
The Cloud Function currently tracks usage, limit and utilization of:
- active VPC peerings per VPC
- VPC peerings per VPC
- instances per VPC
- instances per VPC peering group
- Subnet IP ranges per VPC peering group
- internal forwarding rules for internal L4 load balancers per VPC
- internal forwarding rules for internal L7 load balancers per VPC
- internal forwarding rules for internal L4 load balancers per VPC peering group
- internal forwarding rules for internal L7 load balancers per VPC peering group
- Dynamic routes per VPC
- Dynamic routes per VPC peering group
- Static routes per project (VPC drill down is available for usage)
- Static routes per VPC peering group
- IP utilization per subnet (% of IP addresses used in a subnet)
- VPC firewall rules per project (VPC drill down is available for usage)
- Tuples per Firewall Policy
It writes this values to custom metrics in Cloud Monitoring and creates a dashboard to visualize the current utilization of these metrics in Cloud Monitoring.
Note that metrics are created in the cloud-function/metrics.yaml file. You can also edit default limits for a specific network in that file. See the example for `vpc_peering_per_network`.
One other example is the IP utilization information per subnet, allowing you to monitor the percentage of used IP addresses in your GCP subnets.

More complex scenarios are possible by leveraging and combining the 50 different timeseries created by this tool, and connecting them to Cloud Operations dashboards and alerts.

Refer to the [Cloud Function deployment instructions](./deploy-cloud-function/) for a high level overview and an end-to-end deployment example, and to the[discovery tool documentation](./src/) to try it as a standalone program or to package it in alternative ways.

## Metrics created

- `firewall_policy/tuples_available`
- `firewall_policy/tuples_used`
- `firewall_policy/tuples_used_ratio`
- `network/firewall_rules_used`
- `network/forwarding_rules_l4_available`
- `network/forwarding_rules_l4_used`
- `network/forwarding_rules_l4_used_ratio`
- `network/forwarding_rules_l7_available`
- `network/forwarding_rules_l7_used`
- `network/forwarding_rules_l7_used_ratio`
- `network/instances_available`
- `network/instances_used`
- `network/instances_used_ratio`
- `network/peerings_active_available`
- `network/peerings_active_used`
- `network/peerings_active_used_ratio`
- `network/peerings_total_available`
- `network/peerings_total_used`
- `network/peerings_total_used_ratio`
- `network/routes_dynamic_available`
- `network/routes_dynamic_used`
- `network/routes_dynamic_used_ratio`
- `network/routes_static_used`
- `network/subnets_available`
- `network/subnets_used`
- `network/subnets_used_ratio`
- `peering_group/forwarding_rules_l4_available`
- `peering_group/forwarding_rules_l4_used`
- `peering_group/forwarding_rules_l4_used_ratio`
- `peering_group/forwarding_rules_l7_available`
- `peering_group/forwarding_rules_l7_used`
- `peering_group/forwarding_rules_l7_used_ratio`
- `peering_group/instances_available`
- `peering_group/instances_used`
- `peering_group/instances_used_ratio`
- `peering_group/routes_dynamic_available`
- `peering_group/routes_dynamic_used`
- `peering_group/routes_dynamic_used_ratio`
- `peering_group/routes_static_available`
- `peering_group/routes_static_used`
- `peering_group/routes_static_used_ratio`
- `project/firewall_rules_available`
- `project/firewall_rules_used`
- `project/firewall_rules_used_ratio`
- `project/routes_static_available`
- `project/routes_static_used`
- `project/routes_static_used_ratio`
- `subnetwork/addresses_available`
- `subnetwork/addresses_used`
- `subnetwork/addresses_used_ratio`

## Assumptions and limitations
- The CF assumes that all VPCs in peering groups are within the same organization, except for PSA peerings
- The CF will only fetch subnet utilization data from the PSA peerings (not the VMs, ILB or routes usage)
- The CF assumes global routing is ON, this impacts dynamic routes usage calculation
- The CF assumes custom routes importing/exporting is ON, this impacts static and dynamic routes usage calculation
- The CF assumes all networks in peering groups have the same global routing and custom routes sharing configuration
## Next steps and ideas
In a future release, we could support:
- Google managed VPCs that are peered with PSA (such as Cloud SQL or Memorystore)
- Dynamic routes calculation for VPCs/PPGs with "global routing" set to OFF
- Static routes calculation for projects/PPGs with "custom routes importing/exporting" set to OFF
- Calculations for cross Organization peering groups
- Support different scopes (reduced and fine-grained)
If you are interested in this and/or would like to contribute, please contact [email protected].
<!-- BEGIN TFDOC -->
## Variables
| name | description | type | required | default |
|---|---|:---:|:---:|:---:|
| [billing_account](variables.tf#L17) | The ID of the billing account to associate this project with. | <code></code> | ✓ | |
| [monitored_projects_list](variables.tf#L36) | ID of the projects to be monitored (where limits and quotas data will be pulled). | <code>list&#40;string&#41;</code> | ✓ | |
| [organization_id](variables.tf#L46) | The organization id for the associated services. | <code></code> | ✓ | |
| [prefix](variables.tf#L50) | Prefix used for resource names. | <code>string</code> | ✓ | |
| [cf_version](variables.tf#L21) | Cloud Function version 2nd Gen or 1st Gen. Possible options: 'V1' or 'V2'.Use CFv2 if your Cloud Function timeouts after 9 minutes. By default it is using CFv1. | <code></code> | | <code>V1</code> |
| [monitored_folders_list](variables.tf#L30) | ID of the projects to be monitored (where limits and quotas data will be pulled). | <code>list&#40;string&#41;</code> | | <code>&#91;&#93;</code> |
| [monitoring_project_id](variables.tf#L41) | Monitoring project where the dashboard will be created and the solution deployed; a project will be created if set to empty string. | <code></code> | | |
| [project_monitoring_services](variables.tf#L59) | Service APIs enabled in the monitoring project if it will be created. | <code></code> | | <code title="&#91;&#10; &#34;artifactregistry.googleapis.com&#34;,&#10; &#34;cloudasset.googleapis.com&#34;,&#10; &#34;cloudbilling.googleapis.com&#34;,&#10; &#34;cloudbuild.googleapis.com&#34;,&#10; &#34;cloudfunctions.googleapis.com&#34;,&#10; &#34;cloudresourcemanager.googleapis.com&#34;,&#10; &#34;cloudscheduler.googleapis.com&#34;,&#10; &#34;compute.googleapis.com&#34;,&#10; &#34;iam.googleapis.com&#34;,&#10; &#34;iamcredentials.googleapis.com&#34;,&#10; &#34;logging.googleapis.com&#34;,&#10; &#34;monitoring.googleapis.com&#34;,&#10; &#34;pubsub.googleapis.com&#34;,&#10; &#34;run.googleapis.com&#34;,&#10; &#34;servicenetworking.googleapis.com&#34;,&#10; &#34;serviceusage.googleapis.com&#34;,&#10; &#34;storage-component.googleapis.com&#34;&#10;&#93;">&#91;&#8230;&#93;</code> |
| [region](variables.tf#L81) | Region used to deploy the cloud functions and scheduler. | <code></code> | | <code>europe-west1</code> |
| [schedule_cron](variables.tf#L86) | Cron format schedule to run the Cloud Function. Default is every 10 minutes. | <code></code> | | <code>&#42;&#47;10 &#42; &#42; &#42; &#42;</code> |
<!-- END TFDOC -->

- The tool assumes all VPCs in peering groups are within the same organization, except for PSA peerings.
- The tool will only fetch subnet utilization data from the PSA peerings (not the VMs, ILB or routes usage).
- The tool assumes global routing is ON, this impacts dynamic routes usage calculation.
- The tool assumes custom routes importing/exporting is ON, this impacts static and dynamic routes usage calculation.
- The tool assumes all networks in peering groups have the same global routing and custom routes sharing configuration.

## TODO

These are some of our ideas for additional features:

- support PSA-peered Google VPCs (Cloud SQL, Memorystore, etc.)
- dynamic routes for VPCs/peering groups with "global routing" turned off
- static routes calculation for projects/peering groups with custom routes import/export turned off
- cross-organization peering groups

If you are interested in this and/or would like to contribute, please open an issue in this repository or send us a PR.
Loading

0 comments on commit 93361d7

Please sign in to comment.