Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Sponsorships] Setup the secondary Azure subscription to consume the sponsor credits #3818

Closed
dduportal opened this issue Nov 14, 2023 · 29 comments

Comments

@dduportal
Copy link
Contributor

We've been given 40.000$ of Azure sponsorship credits: we should start using them as soon as possible to decrease the current Azure bill paid by the CDF for us.

This issue tracks the associated work for this.


Ideas on "how to consume these credits":

  • Run ci.jenkins.io existing Azure ephemeral workloads:

    • ACI (Windows containers)
    • buildPlugin with Docker CE (Azure VMs)
    • ATH with Docker CE (Azure VMs) (separate from buildPlugin to ensure we can keep track of costs)
    • BOM builds (VM first, and evenyually an AKS cluster in addition to EKS)
  • Run other controller ephemeral workloads: infra.ci ?

@dduportal dduportal added this to the infra-team-sync-2023-11-21 milestone Nov 14, 2023
@dduportal
Copy link
Contributor Author

dduportal commented Nov 16, 2023

Next steps:

@dduportal
Copy link
Contributor Author

dduportal commented Nov 22, 2023

Update:

dduportal added a commit to jenkins-infra/azure that referenced this issue Nov 22, 2023
…r agents management in Azure (#516)

This PR is a (mandatory) preparatory step for
jenkins-infra/helpdesk#3818 .

It split the controller/azurevm-agents/aci-agent scopes in 3 terraform
modules. The goal is to allow instantiating the (*)agents module for the
new sponsorship subscrption without repeating code.


To avoid any breakage on the principal branch (which uses the latest
reference of the `main` branch on
https://github.com/jenkins-infra/shared-tools/), I've created 3 brand
new modules in
jenkins-infra/shared-tools@c7ec5b0
. It should make the PR here autonomous to merge (but the aformentionned
commit is also to be reviewed and we can update modules).


💡 A few notes on the introduces changes:

- The ci.jenkins.io's Network Security Group rule
`allow_outbound_ssh_from_ci_controller_to_s390x` is removed. Its
integrated into the new controller module as one of the agent IP
prefixes passed as argument.
- Same for the trusted.ci.jenkins.io's
`allow_outbound_ssh_from_controller_to_permanent_agent`
- The 3 Network Security Group rules
`allow_inbound_ssh_from_controller_to_ephemeral_agents` (1 for each
controller) are changed from a single `source_address_prefix` to
`source_address_prefixes` collection
- For ci.jenkins.io, it also adds the Public IP of the controller VM in
this collection (along to the private VM IP) to cover cases where the
requests are routed through the Internet instead of the internal network
peerings

---------

Signed-off-by: Damien Duportal <[email protected]>
dduportal added a commit to jenkins-infra/azure that referenced this issue Nov 22, 2023
…sorship subscription (#519)

Related to jenkins-infra/helpdesk#3818

This PR adds resources for ci.jenkins.io in the "sponsorship"
subscription to allow spinning up azure-vm and aci agents

Signed-off-by: Damien Duportal <[email protected]>
dduportal added a commit to jenkins-infra/azure that referenced this issue Nov 22, 2023
…n to store agent NSG (#520)

Related to jenkins-infra/helpdesk#3818

Fixup of #519 to correct the error

```
 │ Error: creating/updating Network Security Group: (Name "ci.jenkins.io-ephemeralagents" / Resource Group "ci-jenkins-io-controller"): network.SecurityGroupsClient#CreateOrUpdate: Failure sending request: StatusCode=404 -- Original Error: Code="ResourceGroupNotFound" Message="Resource group 'ci-jenkins-io-controller' could not be found."
```


----

Note that permissions have been increased to the SP to correct the
following errors seen on the main branch:

```
│ Error: authorization.RoleAssignmentsClient#Create: Failure responding to request: StatusCode=403 -- Original Error: autorest/azure: Service returned an error. Status=403 Code="AuthorizationFailed" Message="The client '<redacted>' with object id '<redacted>' does not have authorization to perform action 'Microsoft.Authorization/roleAssignments/write' over scope '/subscriptions/<redacted>/resourceGroups/ci-jenkins-io-ephemeral-agents/providers/Microsoft.Authorization/roleAssignments/e6e75982-06dc-57fd-1743-3a2648e0546f' or the scope is invalid. If access was recently granted, please refresh your credentials."
```

Signed-off-by: Damien Duportal <[email protected]>
@dduportal
Copy link
Contributor Author

Update:

WiP:

@dduportal
Copy link
Contributor Author

Update:

  • Credentials on ci.jenkins.io:

    • Added azure-jenkins-sponsorship-credential, same as the current azure-credential except for the subscription ID of course
    • Updated definition of both credentials
    • Used the "Check Service Principal" on both credentials with success
  • Deployed the new JCasc configuration adding the secondary Azure VM cloud with success:

Wip: validation in progress on ci.jenkins.io (top level cloud is valid but network specification on each template need to be updated: incoming puppet code changes)

dduportal added a commit to jenkins-infra/azure that referenced this issue Nov 23, 2023
…g the agent vnet (#522)

Related to jenkins-infra/helpdesk#3818

This PR adds missing permissions allowing the ci.jenkins.io's SP to read
the vnet in which it will spawns the agents for the new subscription.

Please note there might be improvement to be done to have this setup in
the terraform module for controller in the long term.


Tested and applied locally: i'll self-merge this PR and watch the build
on the main branch.

Signed-off-by: Damien Duportal <[email protected]>
@dduportal
Copy link
Contributor Author

Update (wip):

@dduportal
Copy link
Contributor Author

Update:

  • Quota increased by Microsoft for normal vCPUs, but they refused for spot vCPUs as they have too much spot requests these days
  • Initial manual testing showed error around the disk encryption. We had to enable a feature at the subscription level () to avoid the following error:
{"code":"DeploymentFailed","message":"At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details.","details":[{"code":"InvalidParameter","message":"The property 'securityProfile.encryptionAtHost' is not valid because the 'Microsoft.Compute/EncryptionAtHost' feature is not enabled for this subscription."}]}
  • Manual test was successful after manually disabling the spot mode (otherwise quota blocks us).

  • WiP:

    • Puppet configuration to allows disabling spot at subscription level
    • Check the initial costs (as it is my credit card in the new subscription)
  • TODO:

    • Add resources for ACI agents
    • Add resources for trusted.ci azure vm agents
    • Add resources for cert.ci azure vm agents

@dduportal
Copy link
Contributor Author

Update:

  • Confirmed with @MarkEWaite that we started to consume credits in the new subscription (VM agents spined up by ci.jenkins.io), billing being automatic

  • WiP: ACI (ci.jenkins.io)

    • Check for (secondary) controller SP permissions or create another one
    • Add the ACI JCasc config in puppet (jenkins-infra/jenkins-infra)
    • Validate manually
  • Next steps:

    • trusted.ci.jenkins.io (same as ci.jenkins.io)
      • [ ] Create network resources in jenkins-infra/azure-net (vnet, subnet, peeing with trusted)
      • [ ] Create agent-related resources in jenkins-infra/azure (ephemeral agents module, addition network rules/azure ad permirssions)
      • Add secondary controller SP as a credential in trusted.ci controller UI
      • Add JCasc agent configuration in jenkins-infra/jenkins-infra
      • Validate manually an agent

@dduportal
Copy link
Contributor Author

Adding a figures about packer billing (which justifies the effort to move it to the sponsored subscription):

In the past 6 months, it costed us between $200 to $300 each month:

Capture d’écran 2023-12-19 à 10 22 12

See the past few days:

Capture d’écran 2023-12-19 à 07 56 36

Of course, the main cost center is the resource group with the production images gallery: it means we should garbage collect the old VM templates (or decrease the current garbage collector retention time) to decrease this cost.

@dduportal
Copy link
Contributor Author

dduportal commented Dec 20, 2023

To use the new subscription for packer :

* [x]   migrate packer-builds resources groups (3 rg) [azure] [feat(packer): move packer-builds to the new provider azure sponsored azure#556](https://github.com/jenkins-infra/azure/pull/556)

* [x]  create a new azure SP credential within infra.ci for packer image job on the new subscription

* [ ]  update packer-image jenkins pipeline with the new credential

* [ ]  temporarily disable spot instance usage within packer (sources.pkr.hcl) / add quotas for packers machines within the new subscription (and try again spot instances)

* [ ]  split var.azure_subscription_id to provide the legacy subscription for the image gallery destination

* [ ]  allow infra.ci agents to reach packer VM with winRM protocol (ssh is already allowed) :
  
  * [ ]   change packer-image to use brand new private network

While @smerle33 leads the plan above, I'll lead the migration of the packer images resource groups:

  • Create the new resource groups *packer-images and associated resources in the new subscription (terraform project jenkins-infra/azure)
    • Create RGs
    • Check wether shared galleries have their names spreads on all azure or per RG/subscription (and either create new ones with same name or migrate existing one)
  • Add read permissions on the new production images on all controllers using Azure VMs
  • Ensure there is at least one production image (either from migration or release a new one after @smerle33 work is finished)
  • Migrate controllers to the new image
  • Improve GC on the "current subscription" to decrease image removal
  • Remove old RGs
  • Migrate GC to new subscription

dduportal added a commit to jenkins-infra/azure that referenced this issue Dec 21, 2023
Ref.
jenkins-infra/helpdesk#3818 (comment)

This PR creates the shared gallery in the new subscription:

- 3 resource groups (dev, staging and prod) with one gallery each
- 4 images on each gallery

IMPORTANT: this PR sets the ground to move everything to US East 2
(faster packer builds and we don't use East US since 1.5 years for
agents). It cannot do all "eastus" -> "eastus**2**" changes yet though
as changing location marks a resource group/gallery to be deleted, while
we only want to create new resources (terraform forgets the old resource
when only changing provider).

IMPORTANT (2): I've removed the 4 role assignments which are required
for the 4 controllers (ci, trusted, cert and infra) to read the shared
gallery to spin up agent. The build >= 3 for this PR should only mark 3
resources to delete (the role assignment of the packer_sp itself):

```
terraform state rm 'module.cert_ci_jenkins_io.azurerm_role_assignment.controller_read_packer_prod_images[0]'
terraform state rm 'module.trusted_ci_jenkins_io.azurerm_role_assignment.controller_read_packer_prod_images[0]'
terraform state rm 'module.ci_jenkins_io.azurerm_role_assignment.controller_read_packer_prod_images[0]'
terraform state rm 'azurerm_role_assignment.infra_ci_jenkins_io_allow_packer'
```

Signed-off-by: Damien Duportal <[email protected]>
@dduportal
Copy link
Contributor Author

Update: @smerle33 and I are working in parallel (and pair) on both plans.

While working on jenkins-infra/packer-images#959, he was stuck on location issues.

As such, we are moving all new resources (in the new subscription) to "US East 2" as we only use this location (see jenkins-infra/azure#560 and jenkins-infra/azure#561)

dduportal added a commit to jenkins-infra/azure that referenced this issue Dec 21, 2023
Follow up of #560 

Ref.
jenkins-infra/helpdesk#3818 (comment)

This PR ensures that all the packer resources defined in the new
subscription (and only these) are migrated to US East 2 to solve errors
found in jenkins-infra/packer-images#959


Expecting 23 resources to be re-created:
- 4 RGs on the 6 are in us east today
- 4 role assignements (as the 4 RGs changed)
- 3 galleries
- 12 images (4 per gallery

Signed-off-by: Damien Duportal <[email protected]>
@dduportal
Copy link
Contributor Author

Update: most of the work done by @smerle33 in jenkins-infra/packer-images#959:

  • Changed instance size + disabled spot (to ensure we stay in allocated quotas)
  • Switched to new credential for subscription + allow overriding subscription for gallery to allow us all cases
  • Restrict to US East 2 region only
  • Moved to using a private network (no public IP, finer control on machines access)

Wip:

  • Switching to private network shows a timeout for reaching machines through SSH or WinRM, we have to specify custom NSG rules to cross subnets from infra.ci VM agents (VM/kube) to the packer net.

dduportal added a commit to jenkins-infra/azure that referenced this issue Dec 21, 2023
#563)

Ref.
jenkins-infra/helpdesk#3818 (comment)

This PR adds netowrk security rules to allow packer processes running in
the infra.ci *Azure VM* agents to reach packer VMs (in their own subnet)
through SSH or WinRM

Signed-off-by: Damien Duportal <[email protected]>
dduportal added a commit to dduportal/jenkins-infra that referenced this issue Dec 23, 2023
dduportal added a commit to jenkins-infra/jenkins-infra that referenced this issue Dec 23, 2023
@dduportal
Copy link
Contributor Author

Update:

  • Packer builds are fully utilizing the new subscription and the release 1.43.0 has been done in the new subscription

WiP:

  • Bump packer image version to 1.43.0 on all controllers (which requires switching gallery configuration to new subscription)
  • Fix errors related to the new subscription

@dduportal
Copy link
Contributor Author

Update:

  • All controllers are using the new image in the new subscription
  • Removed the packer resources from the former subscription

@dduportal
Copy link
Contributor Author

Reopening: #3875 (comment)

The crawler issue is most probably caused by the trusted ephemeral agent CIDR changes

@dduportal
Copy link
Contributor Author

Update:

  • Costs check:
    • We can clearly view the resource migration. The amount of billing decrease is from 2% to 6% depending on the months
Capture d’écran 2024-01-02 à 11 05 24 Capture d’écran 2024-01-02 à 11 09 07
  • The new (sponsored) subscription is now used (almost 1k$ consumed). Check below the top-level consumption per resource kind. Please note that we cannot use Spot instance in this for now (not enough quota when we asked last month):
Capture d’écran 2024-01-02 à 11 06 01 Capture d’écran 2024-01-02 à 11 06 31

@dduportal
Copy link
Contributor Author

Closing as #3875 is NOT caused by the new networks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants