- Overview
- What alerting is available?
- What configurability is available
- Onboarding and FAQ
- Availability
- Client Error
- Create prevalidation and regression
- Performance
- How do I onboard?
- Time zone based alerting
- Overwrite default TSG with extenions' TSG
- Route alerts to another team in IcM
- FAQ
The framework provides an out of the box alerting infrastructure. Some of these alerts you will be automatically onboarded to and others you will have to onboard manually. They offer a great way to constantly assess your product and ensure you are within SLAs and not regressing within real time. Each alert type has either configurable or fixed alerting criteria which is assessed on varying windows of time and then based on the results the alert will be triggered or not. Once an alert is triggered it will automatically open an IcM on the owning team.
There are number of framework provided alerts:
-
Extension SDK age
- Sev3 IcM incident for an extension when its SDK is older than 60 days
- Sev2 for over 90 days
Note: You can set the time of the day at which you want to trigger an SDK age alert. Learn more about time zone based alerting here.
-
Telemetry Throttled
- Sev4 IcM incident for an extension when its ExtTelemetry logs reach above 500 events every 60 seconds per user per browser tab. We stop logging ExtTelemetry events until the next 60 seconds start.
-
Availability
-
Client Error
-
Create Prevalidation and Regression
-
Performance
Some of these alert types are configurable per extension. The following alerts types currently support extension configuration:
To onboard to the configurable alerts please see the relevant sub section below.
Similarly, to the non-configurable alerts, once the thresholds for any of the configured alerts are met or surpassed a ICM alert containing details will be opened against the owning team.
Besides configuration for various alert types extension partners can configure when they'd like to receive an alert (for SDK age alert), who they'd like to assign an alert to and overwrite the default TSG links in IcM with TSG links owned and provided by extensions.
The alerts have extension, blade and part load availability on different environments including sovereign and air-gapped clouds.
Every extension in Azure Portal is opted in automatically by default. No action is needed from extension partner.
- User Failed At Least Once (FALO): when users have at least one load failure.
- User Failed Always (FA): when users do not have any successfully loads after they try to load at least once.
Percentage based availability alert is not configurable. The same set of alerts triggering criteria are used for extension, blade and part respectively.
Currently extension, blade and part availability alert run 5, 10 and 15 minutes respectively assessing the previous 1, 2, 4, 8, 12 and 24 hours of data.
Below two tables show different criteria for different alert types and different severities that apply for any extension, blade, and part load in Azure Portal. The numbers are for per Portal domain name and safe deployment stage.
Alert Type | Cloud | Min Total User Count | Min Affected User Percentage |
---|---|---|---|
Failed At Least Once | All Clouds* | 10 | 7% |
Failed Always | All Clouds* | 10 | 5% |
Alert Type | Cloud | Min Total User Count | Min Affected User Percentage |
---|---|---|---|
Failed At Least Once | Public | 50 | 50% |
Non-Public Clouds** | 25 | ||
Failed Always | Public | 50 | 25% |
Non-Public Clouds** | 25 |
*All Clouds are Public, Fairfax, Mooncake and air-gapped clouds.
**Non-public clouds are Fairfax, Mooncake and air-gapped clouds.
For any given monitor window (1, 2, 4, 8, 12 and 24 hours) per Portal domain name and safe deployment stage the following three conditions must be met to fire an alert.
- total user count >= Min Total User Count
- affected user percentage >= Min Affected User Percentage
- affected user count >= 3 for 1, 2, 4, 8, 12-hour period and >=4 for 24-hour period
When alert firing conditions are true for both User Failed At Least Once and User Failed Always, only User Failed Always will be fired.
When alert firing conditions are true for different monitor windows, only alert on the smallest window will be fired. For example, alerting firing conditions are true for 1-hour lookback window and 2-hour lookback window, alert will fire only on 1-hour lookback window.
When alert firing conditions are true for different safe deployment stages, only alert for the latter stage will be fired. For example, alerting firing conditions are true for stage 4 and stage 5, alert will fire only for stage 5.
There are two high level types of client error alerts, error percentage and error message on different environments including national clouds.
- Error percentage alerts fire when the percentage of users experiencing any error(s) is above the defined threshold.
- Error message alerts fire on specified error messages.
At a high level you define:
- An environment for the alerts to run against. See definition below
- The error configuration for the alerts within that environment
{
"extensionName": "Your_Extension_Name",
"enabled": true,
"environments": [
{
"environment": ["portal.azure.com", "portal.azure.cn"],
"availability": [...],
"clientError": [
{
"type": "message",
"enabled": true,
"criteria": [
...
]
},
{
"type": "percentage",
"enabled": true,
"criteria": [
...
]
}
],
"create": [...],
"performance": [...],
},
{
"environment": ["ms.portal.azure.com"],
...
"clientError": [
{
...
}
...
],
...
}
...
]
...
}
"environments" property is an array. Each of its elements represents a set of alerting criteria for an environment.
"environment" property is an array. Its supported value is portal.azure.com or ms.portal.azure.com or portal.azure.cn or canary.portal.azure.com or any other legit portal domain name, a.k.a., national cloud and air-gapped cloud domain names are supported too. Multiple values can be set for an "environment" property.
"enabled" property is used to enable (when "enabled" is true) or disable ("enabled" is false) alerts on various level depending on where it is in customization json. For details, see "enabled" property in json snippet.
Among "message" and "percentage" types, you can choose to have one type or two types. Per each of those, you can define a set of criteria like the below. You can define N number of criteria.
An example of a percentage error alert criteria
You can specify up to 3 messages in "exclusion" property. "type" property's supported value is "and" or "or".
[
{
"type": "percentage",
"enabled": true,
"criteria": [
{
"severity": 3,
"enabled": true,
"minAffectedUserCount": 2,
"minAffectedUserPercentage": 10.0,
"exclusion": {
"type": "or",
"message1":"eastus2stage",
"message2":"eastus2(stage)"
},
"safeDeploymentStage": ["3"],
"datacenterCode": ["AM"]
},
...
]
},
...
]
An example of a message error alert criteria.
You can specify up to 3 messages in one criterion and up to 3 messages in "exclusion" property. "type" property's supported value is "and" or "or".
[
{
"type": "message",
"criteria": [
{
"severity": 4,
"enabled": true,
"checkAllNullRefs": true,
"message1": "Cannot read property",
"message2": "of null",
"minAffectedUserCount": 1,
"exclusion": {
"type": "or",
"message1":"eastus2stage",
"message2":"eastus2(stage)"
},
"safeDeploymentStage": ["3"],
"datacenterCode": ["AM"]
},
...
]
},
...
]
This is the severity value that an IcM alert would have when an alert is fired.
This is the minimum number of percentage of users affected by any client error.
This is the minimum number of users affected by any client error.
When it is true, alert checks all the null refs client errors. You can still specify message1, message2, etc. They are additional conditions. 'checkAllNullRefs' property is optional.
This is the error string that error message alerts check if it exists in client error logs, specifically in [message] column at (Client|Ext)Events log table. They are logical AND relations. To count as an error, all the messages that specified in criteria element have to be present in a client error message([message] column at (Client|Ext)Events log table). You can specify up to 3 messages in one criterion.
This specifies condition(s) that alerts do not count as a client error. You can specify it for both error percentage and error message alerts.
This is the error string(s) that alerts would not count it as a client error when they're present in a client error message([message] column at (Client|Ext)Events log table). You can specify up to 3 messages in "exclusion" property.
This is the logical operator for messages in "exclusion" property. Its supported value is "and" or "or". "and" means when all the messages specified in "exclusion" property are present in a client error message, error alerts would not count it as a client error. "or" means when any of the messages specified in "exclusion" property is present in a client error message, error alerts would not count it as a client error.
Safe deployment stage can be "0", "1", "2", or "3". Each stage has a batch of regions. It does not support asterisk ("*") sign. Safe deployment stage is optional. If you do not specify the safe deployment stage property in criteria, when alerting calculates affectedUserCount, affectedUserPercentage, it does not take safe deployment stage into consideration. So, you will not have affectedUserCount or affectedUserPercentage per safe deployment stage. For such a case, minAffectedUserCount or minAffectedUserPercentage specified in criteria are for all (combined, overall) the safe deployment stages.
Datacenter code can be "*
", "AM", "BY", etc. "*
" represents all Azure Portal Production regions.
Datacenter code is optional. If you do not specify the datacenterCode property in criteria, when alerting calculates affectedUserCount or affectedUserPercentage, it does not take datacenter into consideration. So, you will not have affectedUserCount or affectedUserPercentage per datacenter. For such a case, minAffectedUserCount or minAffectedUserPercentage specified in criteria are for all (combined, overall) the datacenters.
For the complete list of datacenter code names, go to datacenter code list
Currently error percentage alerts run every 15 minutes and error message alerts run every 5 minutes assessing the previous 60 minutes of data.
The create prevalidation and regression alerts can be configured for create blade extension on different environments. Prevalidation alert is supported only in public cloud, whereas the regression alert is supported in all clouds.
At a high level you define;
- N number of environment within "environments" property like the below.
- The create prevalidation or regression configuration for the alerts within that environment
{
"extensionName": "Your_Extension_Name",
"enabled": true,
"environments": [
{
"environment": ["portal.azure.com", "ms.portal.azure.com"], // prevalidation type is only supported in public clouds
"availability": [...], // Optional
"clientError": [...], // Optional.
"create": [
{
"type": "regression", // "regression" or "prevalidation" are supported types
"enabled": true,
"criteria": [
...
]
}
],
"performance": [...], // Optional.
},
{
"environment": ["ms.portal.azure.com"],
"create": [
{
...
}
...
]
...
}
...
]
...
}
"environments" property is an array. Each of its elements represents a set of alerting criteria for an environment.
"environment" property is an array. Its supported value for this alert is portal.azure.com or ms.portal.azure.com. For regression alerts, cn or canary.portal.azure.com or any other legit portal domain name, a.k.a., national cloud and air-gapped cloud domain names are supported too. Prevalidation alerts are only available in ms.portal.azure.com and the portal.azure.com portal domains. Mutiple values can be set for an "environment" property.
"enabled" property is used to enable (when "enabled" is true) or disable ("enabled" is false) alerts on various level depending on where it's located in customization json. For details, see "enabled" property in json snippet.
You can define N number of criteria like the below.
{
"severity": 3,
"enabled": true,
"bladeName": ["CreateBlade"],
"minSuccessRateOverPast24Hours":94.0,
"minSuccessRateOverPastHour":94.0,
"minTotalCountOverPast24Hours":50,
"minTotalCountOverPastHour":3,
"errorCodesToExclude": [""] // Optional and only supported in prevalidation alerts
}
This is the severity value that an IcM alert would have when an alert is fired.
The list of the create blade name.
This is the minimum create blade success rate or the create prevalidation success rate over the past 24 hours.
This is the minimum create blade success rate or the create prevalidation success rate over the past hour.
This is the minimum number of creates or create prevalidations that get kicked off over the past 24 hours.
This is the minimum number of creates or create prevalidations that get kicked off over the past hour.
The optional list of create prevalidation error codes which need to be excluded while evaluating the alert. This is only supported for prevalidation alerts.
- For regressions, the alert is available in all clouds including national clouds.
- For prevalidations, the alert is only available in public cloud.
Every 60 minutes, we get create or prevalidation success rate and create or prevalidation total count for the last 60 minutes and the last 24 hours. Alerts will only trigger when the following criteria are met.
- Hourly create or prevalidation success rate is below {minSuccessRateOverPastHour} and hourly create or prevalidation total count is above {minTotalCountOverPastHour}
- 24-hour create or prevalidation success rate is below {minSuccessRateOverPast24Hours} and 24-hour create or prevalidation total count is above {minTotalCountOverPast24Hours}
The alerts can be configured for extension performance, blade performance and part performance on different environments including national clouds.
At a high level you define;
- N number of environment within "environments" property like the below.
- The performance configuration for the alerts within that environment
{
"extensionName": "Your_Extension_Name",
"enabled": true,
"environments": [
{
"environment": ["portal.azure.com", "portal.azure.cn"],
"availability": [...],
"clientError": [...],
"create": [...],
"performance": [
{
"type": "extension",
"enabled": true,
"criteria": [
...
]
},
{
"type": "blade",
"enabled": true,
"criteria": [
...
]
}
...
]
},
{
"environment": ["ms.portal.azure.com"],
"performance": [
{
...
}
...
]
...
}
...
]
...
}
Per each of those, you can define a set of criteria like the below.
Only blade or part is required to have a namePath property or optionally to have an exclusion property.
"environments" property is an array. Each of its elements represents a set of alerting criteria for an environment.
"environment" property is an array. Its supported value is portal.azure.com or ms.portal.azure.com or portal.azure.cn or canary.portal.azure.com or any other legit portal domain name, a.k.a., national cloud and air-gapped cloud domain names are supported too. Multiple values can be set for an "environment" property.
"enabled" property is used to enable (when "enabled" is true) or disable ("enabled" is false) alerts on various level depending on where it is located in customization json. For details, see "enabled" property in json snippet.
You can define N number of criteria like the below.
{
"severity": 3,
"enabled": true,
"percentile": 95,
"percentileDurationThresholdInMilliseconds": 4000,
"minAffectedUserCount": 10,
"bottomMinAffectedUserCount": 2,
"namePath": ["*"],
"exclusion": [
"Extension/Your_Extension_Name/Blade/BladeNameA",
"Extension/Your_Extension_Name/Blade/BladeNameB"],
"safeDeploymentStage": ["3"],
"datacenterCode": ["AM"]
}
This is the severity value that an IcM alert would have when an alert is fired.
This is at which percentile you want to measure the performance. Today the only options are 80 or 95.
This is the minimum duration (in milliseconds) when {percentile}% of users are above the {percentileDurationThresholdInMilliseconds}.
This is the minimum number of users whose load duration is above {percentileDurationThresholdInMilliseconds}.
This is used as a threshold to trigger an alert if the {percentile} defined is greater than or equal to double the {percentileThreshold} defined.
This is defaulted to 20% of {minAffectedUserCount}.
This is used to catch any unusual spikes on the weekends/low traffic periods.
This only applies to blades or parts and defines what blades or parts to alert on, you can either use an asterisk ("*") sign to include all the blades or parts within your extension or specify a list of full blade or part names to alert on. The percentileDurationThresholdInMilliseconds, minAffectedUserCount and bottomMinAffectedUserCount specified in criteria are for individual blades or parts.
This only applies to blades or parts and defines what blades or parts you wish to exclude.
Safe deployment stage can be "0", "1", "2", or "3". Each stage has a batch of regions. It does not support asterisk ("*") sign. Safe deployment stage is optional. If you do not specify the safe deployment stage property in criteria, when alerting calculates percentileDuration and affectedUserCount, it does not take safe deployment stage into consideration. So, you won't have percentileDuration and affectedUserCount per safe deployment stage. For such a case, percentileDurationThresholdInMilliseconds, minAffectedUserCount and bottomMinAffectedUserCount specified in criteria are for all (combined, overall) the safe deployment stages.
Datacenter code can be "*
", "AM", "BY", etc. "*
" represents all Azure Portal Production regions.
Datacenter code is optional. If you do not specify the datacenterCode property in criteria, when alerting calculates percentileDuration and affectedUserCount, it does not take datacenter into consideration. So you will not have percentileDuration and affectedUserCount per datacenter. For such a case percentileDurationThresholdInMilliseconds, minAffectedUserCount and bottomMinAffectedUserCount specified in criteria are for all (combined, overall) the datacenters.
For the complete list of datacenter code names, go to datacenter code list
Every 10 minutes, we get percentile load duration for the last 90 minutes. We get the most recent 6 sample points and calculate a weighted percentile load duration based on the following formula.
Weighted duration = 8/24 * {most recent percentile load duration} + 6/24 * {2nd most recent percentile load duration} + 4/24 * {3rd…} + 3/24 * {4th …} + 2/24 * {5th …} + 1/24 * {6th …}
Alerts will only trigger when one of the following criteria is met.
- Weighted duration is above {percentileDurationThresholdInMilliseconds} and affected user count is above {minAffectedUserCount}
- Weighted duration is above 2 * {percentileDurationThresholdInMilliseconds} and affected user count is above {bottomMinAffectedUserCount}
Currently performance alerts run every 10 minutes assessing the previous 90 minutes of data.
- Submit and complete a Pull Request in Azure Portal Alerting Repo a.k.a. Alerting Repo.
For availability alert refer to Availability Opted in
For non-create alert the customization JSON should be located at
products/{YourServiceNameInIcM}/{ExtensionName}.alerting.json
. It is recommended to have anowners.txt
in the same folder as the customization JSON file. Theowners.txt
has AAD enabled email alias or/and individual MSFT aliases. Anyone fromowners.txt
can approve the Pull Request for any changes within that folder or its subfolder. The PR becomes eligible to complete once it gets an approval fromowners.txt
and the merge validation passes.
For create alert the customization JSON should be located at
products/IbizaFx/Create/{ExtensionName}.create.alerting.json
. To be eligible to complete the PR need additional approval from Gauge Team Code Reviews or Create PMs.
For public and national clouds the configuration updates take effect immediately after the PR gets completed.
For air-gapped clouds we deploy configuration updates to air-gapped clouds once every month.
- Set up correlation rules in ICM
Field | Value |
---|---|
Routing ID | 'AIMS://AZUREPORTAL\Portal\{ExtensionName}' |
Correlation ID | use table below to map |
Mode | Hit count (recommended) |
Match DC/Region | Checked |
Match Slice | Checked |
Match Severity | Checked |
Match Role | Checked |
Match Instance/Cluster | Checked |
Depending on the alert you are correlating you will need to use the corresponding correlation id
If you'd like alerts to correlate on different safe deployment stages, do not check Match DC/Region for availability alert.
Alert | Correlation ID |
---|---|
Availability | PercentageBasedAvailability |
Create - Regression | CreateBladeSuccessRate |
Error - AffectedUserPercentage | ErrorAffectedUserPercentage |
Error - Message | ErrorMessage |
Extension SDK Age* | ExtensionAge |
Performance - Extension | ExtensionLoadPerformance |
Performance - Blade | BladeLoadPerformance |
Performance - Part | PartLoadPerformance |
*It's required when extension team's tenant in IcM owns multiple extensions in Azure Portal. Without it the extension age alerts fired for different extensions would be correlated into one IcM per cloud.
As Ibiza has extension developers spread across the world, we have a mechanism to trigger alerts in the business hours of the extension team. Currently, the alerts supported for time zone based alerting are -
- SDK Age alerts.
For all other alerts, the extension owner cannot pick a time zone and will be alerted as soon as the alert trigger conditions are met.
To configure time zone based alerting, you need to specify a businessHourStartTimeUtc
property in the alerting config. The value takes an integer value from 0
to 23
as a string. The value represents the UTC hour at which business hours start in the extension team's region.
When an alert is triggered, the Ibiza team guarantees that you will receive it within 6 hours of the hour configured as businessHourStartTimeUtc
.
Examples -
-
If your region is 6 hours ahead of UTC (UTC +6), and you want to receive an alert between 10 AM to 4 PM, you can set
businessHourStartTimeUtc
to "4" as 10 AM in your region will be 4 AM in UTC. -
If your region is 8 hours behind UTC (UTC -8), and you want to receive an alert between 10 AM to 4 PM, you can set
businessHourStartTimeUtc
to "18" as 10 AM in your region will be 6 PM in UTC.
Here is an example of how to specify businessHourStartTimeUtc
in the config for a team that wants to receive alerts between 4 AM and 10 AM UTC.
{
"extensionName": "Your_Extension_Name",
"businessHourStartTimeUtc": "4",
"enabled": true,
"environments": [
...
]
...
}
If no value is specified for businessHourStartTimeUtc
, alerts are triggered in Pacific Time business hours by default.
Extension team can overwrite the default TSG links in IcM that are set by Azure Portal team by specifying their own TSG links in extension's customization JSON. The TSG link can be any valid URL that points to the TSG owned by extension team although it's preferably a URL in Engineering Hub because it can be accessible in air-gapped clouds without extra work. The Engineering Hub takes care of translating it to a URL in air-gapped clouds and makes sure it can be accessible in air-gapped clouds behind the scenes.
Here is an example of how to specify the tsgLinks
in the customization JSON config.
{
"extensionName": "Your_Extension_Name",
"enabled": true,
"tsgLinks": {
"availability":"https://extension_availability_TSG_link_preferably_a_url_in_engineering_hub",
"clientError":"any_valid_url",
"create":"https://aka.ms/your_extension_name_portalfx_create_alert_TSG_link",
"extensionAgeOverdue":"https://extension_age_overdue_alert_TSG_link",
"performance":"https://your_extension_name_performance_alert",
"telemetryThrottled":"https://aka.ms/telemetry_throttled_TSG"
}
...
}
You don't need to specify the TSG links for all alert types. For alert types that are not specified in
tsgLinks
, the IcM uses the default TSG link that is owned by Portal framework team.
By default all the alerts are fired against Azure Portal (IbizaFx) team and IbizaFx team maintains an IcM routing table by which alerts are routed to different services and teams. Since IcM does not support secondary routing, once extension partners receive an IcM, they cannot route it to another service or team in IcM even if they have their own IcM routing table. The workaround is to fire alerts directly to you (the extension partner) in IcM, which requires your team to create a custom connector in IcM, onboard a certificate to it and add connector Id into customization JSON.
[If your extension is comprised of sub-teams] - With this setup you would also be able to route these alerts to your team's ICM and then you will be able to setup your own routing rules, which can check the blade name, or other properties, and then route to the appropriate sub-team.
Onboard a custom IcM connector per cloud instance following the IcM doc Onboard a connector for a Service. Alerting service will use the custom connector owned by your team to inject IcM incidents directly to your service in IcM. Only the service admin has the rights to onboard a new or update an existing connector.
Certificate is used by IcM service to authenticate with the alerting service who sends incidents to IcM service. We are using Azure Key Vault managed certificate. You will need to add DSTS.PHMS.AZUREWEBSITES.NET as one of "Certificates SAN(s)" on connector edit or onboarding page.
Submit and complete a PR to add IcM connector info into alerting customization JSON so that alerting service knows what connector is used when sending incidents for that extension. The supported cloud values are Public, Fairfax, Mooncake and air-gapped clouds.
If one or more clouds are not specified in customization JSON, the IcM incidents will be created and sent to Azure Portal (IbizaFx) team through IbizaFx's custom connector for the cloud instance(s) that're not specified in the extension's customization JSON. And IbizaFx's IcM routing rule auto-routes the incidents to the corresponding service and team in IcM
{
"extensionName":"Your_Extension_Name",
"enabled":true,
"icmConnectors":[
{
"connectorId":"12345678-abcd-abcd-abcd-123456789012",
"cloud":"Public"
},
{
"connectorId":"87654321-dcba-dcba-dcba-210987654321",
"cloud":"Mooncake"
},
...
],
"environments":[
...
]
}
The alerting service will be sending out IcM through customized connectors once step 2 is complete
The last step is to create routing rules to route different IcMs to different teams at IcM site. Please reach out to ICM support team to opt in the Export/Import feature to bulk update IcM routing rules.
A: Alerting is running off customization JSONs that live in Alerting Repo. All the non-create alerts customimzation JSONs are located at products/{YourServiceNameInIcM}/{ExtensionName}.alerting.json
. All the create alerts customization JSONs are located at products/IbizaFx/Create/{ExtensionName}.create.alerting.json
.
A: Submit and complete a Pull Request on your extension's customization JSON in Alerting Repo. The update is 'live' once the Pull Request is complete.
For each extension there is an
owners.txt
that is in the same or parent folder as the JSON. Theowners.txt
has AAD enabled email alias or/and individual MSFT aliases. Anyone fromowners.txt
can approve the Pull Request. Theowners.txt
is created and maintained by extension team.
A: Azure Portal partner team's IcM info is collected during partner onboarding process and is stored in the extension config. An IcM routing rule is added under Azure Portal (Ibiza) service in IcM to route incidents to corresponding partners.
The IcM routing rule is in format 'AIMS://AZUREPORTAL\Portal{ExtensionName}'.