Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for Azure Bastion. #1300

Merged
merged 1 commit into from
Jun 3, 2021
Merged

Added support for Azure Bastion. #1300

merged 1 commit into from
Jun 3, 2021

Conversation

whites11
Copy link
Contributor

/kind feature

What this PR does / why we need it:

Towards: #165

This PR adds support for the first of 3 bastion host implementations for CAPZ: the one using Azure Bastion.

The PR adds a new field to the AzureCluster CR that looks like this:

apiVersion: infrastructure.cluster.x-k8s.io/v1alpha4
kind: AzureCluster
metadata:
  name: d3m3v
  namespace: default
spec:
  bastionSpec:
    azureBastion:
      name: "azurebastion1"
      subnet: *SubnetSpec
      publicIP:
        name: "azurebastion1-publicIP"

by default the bastionSpec field is empty and the azureBastion field is therefore null (that means the feature is disabled by default).
User can simply set a value of the azureBastion field to enable the feature (other fields are optional and defaulted).

Once the feature is enabled, removing it from a cluster is unsupported (see slack thread).

It is possible to enable the azureBastion feature on an existing v1alpha4 cluster though.

Special notes for your reviewer:

Not sure where to put documentation about the feature, guidance would be appreciated.

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

No image version changes.

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:

Added support for using [Azure Bastion](https://azure.microsoft.com/en-us/services/azure-bastion/) to get console access to virtual machines in the cluster through the Azure Portal.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 12, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @whites11. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 12, 2021
@k8s-ci-robot k8s-ci-robot added the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label Apr 12, 2021
@CecileRobertMichon
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 12, 2021
@shysank
Copy link
Contributor

shysank commented Apr 12, 2021

@whites11 can you add the copyright info to any new files you created? I think pull-cluster-api-provider-azure-verify is failing because of that.

@whites11
Copy link
Contributor Author

@whites11 can you add the copyright info to any new files you created? I think pull-cluster-api-provider-azure-verify is failing because of that.

Fixed, thx

@whites11
Copy link
Contributor Author

@whites11 can you add the copyright info to any new files you created? I think pull-cluster-api-provider-azure-verify is failing because of that.

Not sure how to fix the apidiff job. Any hints?

@shysank
Copy link
Contributor

shysank commented Apr 12, 2021

Not sure how to fix the apidiff job. Any hints?

It doesn't have to be fixed. They are just warnings when there are changes in api.

Copy link
Contributor

@devigned devigned left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the conversion code can be slimmed up a bit. I don't think the extra bastion annotation is needed as the entire converted structure should be available via the utilconversion annotation.

api/v1alpha3/azurecluster_conversion.go Outdated Show resolved Hide resolved
api/v1alpha3/azurecluster_conversion.go Outdated Show resolved Hide resolved
api/v1alpha3/azurecluster_conversion.go Outdated Show resolved Hide resolved
Copy link
Contributor

@devigned devigned left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small nit, but overall the PR looks great!

I do have a concern about not having an e2e test to ensure the functionality doesn't regress. Would you consider adding a bastion to the existing e2e cluster deployment (perhaps, ./templates/test/ci/prow) and a test verifying successful deployment of the bastion?

I wonder if the bastion shouldn't be part of the default template... @CecileRobertMichon wdyt?

@@ -58,6 +58,8 @@ func (src *AzureCluster) ConvertTo(dstRaw conversion.Hub) error { // nolint
dst.Spec.NetworkSpec.APIServerLB.FrontendIPsCount = restored.Spec.NetworkSpec.APIServerLB.FrontendIPsCount
dst.Spec.NetworkSpec.NodeOutboundLB = restored.Spec.NetworkSpec.NodeOutboundLB

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please remove the space between line 59 and 61.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blank line removed.
Regarding the e2e test, I will work on that.

@CecileRobertMichon
Copy link
Contributor

I promise to take a closer look tomorrow but in the meantime, would love to see some user-facing docs as part of this PR.

re: testing/flavors, I'm worried we're adding too many different flavors / test for each feature, we should start combining some. In what use case does it make the most sense for a user to turn this on? a private API server w/ no ssh maybe? or is it useful even when ssh access is possible? I wouldn't add it to the "default" flavor template if it's going to incur some extra cost for every user (especially quick start users), but it might make sense to add it to the default-prow template (or just the private cluster template if that makes more sense in terms of use case).

@whites11
Copy link
Contributor Author

During office hours earlier today we agreed enabling the azure bastion feature in the "internal networking" e2e test and adding additional checks into that test rather than having a dedicated e2e test just for this feature.

@whites11
Copy link
Contributor Author

Can anybody please help me a little bit with the e2e tests?
What do I need to test? I expected to test the fact that when the azure bastion feature is enabled in a cluster, the Azure Bastion cloud object gets created and becomes healthy,
Problem is that I don't see any example of this kind of tests in the code.
Can anybody please provide some guidance?

@devigned
Copy link
Contributor

@whites11 I don't think we are checking the Azure resources are created, but rather the functionality is working. For example, if we apply a machine deployment, we would then check if we can run workloads on those nodes. Perhaps, the same strategy could be used in the bastion. One example that comes to mind is: "Can you ssh into the bastion and then tunnel to one of the private nodes?". wdyt?

@whites11
Copy link
Contributor Author

@whites11 I don't think we are checking the Azure resources are created, but rather the functionality is working. For example, if we apply a machine deployment, we would then check if we can run workloads on those nodes. Perhaps, the same strategy could be used in the bastion. One example that comes to mind is: "Can you ssh into the bastion and then tunnel to one of the private nodes?". wdyt?

Thanks for the reply. Azure bastion is a web based feature so I doubt I can test it works via a command line tool. I mean, I could use something like cypress, login to azure portal and find my way through the azure bastion feature but I feel like this is overkill. WDYT?

@devigned
Copy link
Contributor

devigned commented Apr 20, 2021

Hmm... here's an example of setting up an Azure client and calling out to Azure Resource Manager (ARM) to get event logs. Perhaps, use a similar pattern where you create a client, request the bastion resource from ARM, and verify the properties you expect are set.

func (acp *AzureClusterProxy) collectActivityLogs(ctx context.Context, aboveMachinesPath string) {
timeoutctx, cancel := context.WithTimeout(ctx, 30*time.Second)
defer cancel()
settings, err := auth.GetSettingsFromEnvironment()
Expect(err).NotTo(HaveOccurred())
subscriptionID := settings.GetSubscriptionID()
authorizer, err := settings.GetAuthorizer()
Expect(err).NotTo(HaveOccurred())
activityLogsClient := insights.NewActivityLogsClient(subscriptionID)
activityLogsClient.Authorizer = authorizer
groupName := os.Getenv(AzureResourceGroup)
start := time.Now().Add(-2 * time.Hour).UTC().Format(time.RFC3339)
end := time.Now().UTC().Format(time.RFC3339)
itr, err := activityLogsClient.ListComplete(timeoutctx, fmt.Sprintf("eventTimestamp ge '%s' and eventTimestamp le '%s' and resourceGroupName eq '%s'", start, end, groupName), "")
if err != nil {
// Failing to fetch logs should not cause the test to fail
Byf("Error fetching activity logs for resource group %s: %v", groupName, err)
return
}
logFile := path.Join(aboveMachinesPath, activitylog, groupName+".log")
Expect(os.MkdirAll(filepath.Dir(logFile), 0755)).To(Succeed())
f, err := os.OpenFile(logFile, os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0644)
if err != nil {
// Failing to fetch logs should not cause the test to fail
Byf("Error opening file to write activity logs: %v", err)
return
}
defer f.Close()
out := bufio.NewWriter(f)
defer out.Flush()
for ; itr.NotDone(); err = itr.NextWithContext(timeoutctx) {
if err != nil {
Byf("Got error while iterating over activity logs for resource group %s: %v", groupName, err)
return
}
event := itr.Value()
if to.String(event.Category.Value) != "Policy" {
b, err := json.MarshalIndent(myEventData(event), "", " ")
if err != nil {
Byf("Got error converting activity logs data to json: %v", err)
}
if _, err = out.WriteString(string(b) + "\n"); err != nil {
Byf("Got error while writing activity logs for resource group %s: %v", groupName, err)
}
}
}
}

I was thinking about a VM bastion scenario, not the Azure Bastion. My fault.

@whites11
Copy link
Contributor Author

/retest

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels May 3, 2021
@whites11
Copy link
Contributor Author

whites11 commented May 4, 2021

I don't get why pull-cluster-api-provider-azure-verify fails: running make generate locally doesn't change files

@devigned
Copy link
Contributor

devigned commented May 4, 2021

@whites11 perhaps, wipe out your ./hack/tools/bin directory. Maybe it's using an older version of a bin than the CI server.

@whites11
Copy link
Contributor Author

whites11 commented May 4, 2021

@whites11 perhaps, wipe out your ./hack/tools/bin directory. Maybe it's using an older version of a bin than the CI server.

Didn't work unluckily.
Can it be related to this error I get in the output (both locally and in the CI)?

E0504 12:44:05.503496    1728 conversion.go:193] Rename function sigs.k8s.io/cluster-api-provider-azure/api/v1alpha3 Convert_v1alpha3_ManagedDisk_To_v1alpha4_ManagedDiskOptions -> Convert_v1alpha3_ManagedDisk_To_v1alpha4_ManagedDiskParameters to match expected conversion signature
E0504 12:44:05.503732    1728 conversion.go:193] Rename function sigs.k8s.io/cluster-api-provider-azure/api/v1alpha3 Convert_v1alpha4_ManagedDiskOptions_To_v1alpha3_ManagedDisk -> Convert_v1alpha4_ManagedDiskParameters_To_v1alpha3_ManagedDisk to match expected conversion signature

Copy link
Contributor

@shysank shysank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of nits and clarifications. I'm fine with opening up a followup pr for the nits as this has been around for a while, and don't want to get lgtm removed. Just want to make sure about the subnet and reconcileDelete clarifications before approving.

api/v1alpha4/azurecluster_types.go Outdated Show resolved Hide resolved
azure/scope/cluster.go Outdated Show resolved Hide resolved
c.Spec.BastionSpec.AzureBastion.Name = generateAzureBastionName(c.ObjectMeta.Name)
}
// Ensure defaults for the Subnet settings.
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we default security group and role here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default is no security group and no role.
The fields are there because they belong to the SubnetSpec type.

azure/services/bastionhosts/azurebastion.go Outdated Show resolved Hide resolved
@@ -254,6 +254,20 @@ func (s *ClusterScope) SubnetSpecs() []azure.SubnetSpec {
Role: s.NodeSubnet().Role,
},
}

if s.AzureCluster.Spec.BastionSpec.AzureBastion != nil {
azureBastionSubnet := s.AzureCluster.Spec.BastionSpec.AzureBastion.Subnet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this means azureBastionSubnet will be reconciled in subnetsSvc as well as in bastionSvc. Is this intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a very good point! Not sure how the PR ended up managing subnet and public IP that way, but I removed my handlers and let the other services reconcile the subnet and public IP.
It feels much better now, thanks @shysank

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we leave bastion subnets and bastion public ip reconciliation to subnteSvc and publicIpSvc respectively and have bastionSvc only reconcile bastion host? Since we reconcile subnets and publicIps are reconciled before bastion, you should not have to check for their existence, and it will make the code much simpler as well. wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@whites11 Just checking to see if you need any help or is this something you feel is not needed?
cc: @CecileRobertMichon for more 👀

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to reconcile bastion subnets and bastion public ip by the subnteSvc and publicIpSvc

Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels May 19, 2021
@CecileRobertMichon
Copy link
Contributor

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 20, 2021
@whites11
Copy link
Contributor Author

/retest

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels May 27, 2021
@nader-ziada
Copy link
Contributor

tested the flow manually and confirmed the e2e tests works as expected, I don't see any outstanding comments

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 31, 2021
@nader-ziada
Copy link
Contributor

@whites11 you are going to have to generate again to get the verify to pass after the update to capi PR

@nader-ziada
Copy link
Contributor

@whites11 you are going to have to generate again to get the verify to pass after the update to capi PR

can you also add something to the docs about the bastion and how to use it?

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 1, 2021
@whites11
Copy link
Contributor Author

whites11 commented Jun 1, 2021

@whites11 you are going to have to generate again to get the verify to pass after the update to capi PR

can you also add something to the docs about the bastion and how to use it?

Sure, mind suggesting where I should put those docs?

@nader-ziada
Copy link
Contributor

@whites11 you are going to have to generate again to get the verify to pass after the update to capi PR

can you also add something to the docs about the bastion and how to use it?

Sure, mind suggesting where I should put those docs?

you can either add a section in the private cluster section https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/master/docs/book/src/topics/api-server-endpoint.md , or add a new topic https://github.com/kubernetes-sigs/cluster-api-provider-azure/tree/master/docs/book/src/topics

@k8s-ci-robot
Copy link
Contributor

@whites11: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-cluster-api-provider-azure-apidiff f737685 link /test pull-cluster-api-provider-azure-apidiff

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@whites11
Copy link
Contributor Author

whites11 commented Jun 3, 2021

@whites11 you are going to have to generate again to get the verify to pass after the update to capi PR

can you also add something to the docs about the bastion and how to use it?

Sure, mind suggesting where I should put those docs?

you can either add a section in the private cluster section https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/master/docs/book/src/topics/api-server-endpoint.md , or add a new topic https://github.com/kubernetes-sigs/cluster-api-provider-azure/tree/master/docs/book/src/topics

ok Added a new topic here https://github.com/whites11/cluster-api-provider-azure/blob/azure-bastion/docs/book/src/topics/ssh-access.md

@nader-ziada
Copy link
Contributor

thanks for adding the docs @whites11

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 3, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nader-ziada

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 66e4073 into kubernetes-sigs:master Jun 3, 2021
@k8s-ci-robot k8s-ci-robot added this to the v0.5.0 milestone Jun 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants