Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provisioning tool is failing to create new monitoring dashboard. #89

Closed
soumyapani opened this issue Mar 9, 2023 · 3 comments
Closed
Assignees
Labels
bug Something isn't working OpsAgent Ops agent related setup and development.

Comments

@soumyapani
Copy link
Collaborator

Seeing this error

module.aiinfra-mig.google_compute_instance_group_manager.mig: Creation complete after 2m9s [id=projects/soumyapani-testing/zones/us-central1-a/instanceGroupManagers/spani-develop-mig]
╷
│ Error: Error creating Dashboard: googleapi: Error 400: Field gridLayout.widgets[11].xyChart.dataSets[0].timeSeriesQuery.timeSeriesQueryLanguage has an invalid value: Could not find a metric named '[workload.googleapis.com/dcgm.gpu.nvlink_traffic_rate](http://workload.googleapis.com/dcgm.gpu.nvlink_traffic_rate)'.
│ Field gridLayout.widgets[13].xyChart.dataSets[0].timeSeriesQuery.timeSeriesQueryLanguage has an invalid value: Could not find a metric named '[workload.googleapis.com/dcgm.gpu.pcie_traffic_rate](http://workload.googleapis.com/dcgm.gpu.pcie_traffic_rate)'.
│ Details:
│ [
│   {
│     "@type": "[type.googleapis.com/google.rpc.DebugInfo](http://type.googleapis.com/google.rpc.DebugInfo)",
│     "detail": "[ORIGINAL ERROR] generic::invalid_argument: com.google.apps.framework.request.BadRequestException: Field gridLayout.widgets[11].xyChart.dataSets[0].timeSeriesQuery.timeSeriesQueryLanguage has an invalid value: Could not find a metric named '[workload.googleapis.com/dcgm.gpu.nvlink_traffic_rate'](http://workload.googleapis.com/dcgm.gpu.nvlink_traffic_rate%27).\nField gridLayout.widgets[13].xyChart.dataSets[0].timeSeriesQuery.timeSeriesQueryLanguage has an invalid value: Could not find a metric named '[workload.googleapis.com/dcgm.gpu.pcie_traffic_rate](http://workload.googleapis.com/dcgm.gpu.pcie_traffic_rate)'. [google.rpc.error_details_ext] { message: \"Field gridLayout.widgets[11].xyChart.dataSets[0].timeSeriesQuery.timeSeriesQueryLanguage has an invalid value: Could not find a metric named \\'[workload.googleapis.com/dcgm.gpu.nvlink_traffic_rate\\'](http://workload.googleapis.com/dcgm.gpu.nvlink_traffic_rate%5C%5C%27).\\nField gridLayout.widgets[13].xyChart.dataSets[0].timeSeriesQuery.timeSeriesQueryLanguage has an invalid value: Could not find a metric named \\'[workload.googleapis.com/dcgm.gpu.pcie_traffic_rate\\'](http://workload.googleapis.com/dcgm.gpu.pcie_traffic_rate%5C%5C%27).\" }"
│   }
│ ]
│ 
│   with module.aiinfra-default-dashboard.google_monitoring_dashboard.dashboard,
│   on .terraform/modules/aiinfra-default-dashboard/modules/monitoring/dashboard/[main.tf](http://main.tf/) line 22, in resource "google_monitoring_dashboard" "dashboard":
│   22: resource "google_monitoring_dashboard" "dashboard" {
│ 
╵
Terraform apply failed with error 1.
Listed 0 items.
@soumyapani soumyapani added the bug Something isn't working label Mar 9, 2023
@soumyapani
Copy link
Collaborator Author

Until we have the right fix for this issue, can we add a simple sleep for 30 seconds and see if that resolves the issue
https://registry.terraform.io/providers/hashicorp/time/latest/docs/resources/sleep

@soumyapani soumyapani added the OpsAgent Ops agent related setup and development. label Mar 15, 2023
@stevenBorisko
Copy link
Collaborator

wip #101

@stevenBorisko
Copy link
Collaborator

  • adding a simple depends_on = [aiinfra-compute] does not work since the startup scripts run after the mig is created
  • adding a sleep n whatever the n is just pawning this problem off to future us

I'm going to see if creating the metricDescriptors before terraform apply will "solve" this issue (until we add more metrics to the dashboard), but I think that #104 might become a high priority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working OpsAgent Ops agent related setup and development.
Projects
None yet
Development

No branches or pull requests

2 participants