Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure Batch worker nodes can use a Managed Identity to stage and upload data #5232

Open
adamrtalbot opened this issue Aug 16, 2024 · 9 comments · May be fixed by #5670
Open

Azure Batch worker nodes can use a Managed Identity to stage and upload data #5232

adamrtalbot opened this issue Aug 16, 2024 · 9 comments · May be fixed by #5670

Comments

@adamrtalbot
Copy link
Collaborator

adamrtalbot commented Aug 16, 2024

Originally posted by @adamrtalbot in #3314 (comment)

Now Managed Identities are supported by Nextflow (#4897), we should be able to modify the bash wrapper to adopt the Managed Identities. This is reasonably straightforward, as we can modify the bash wrapper to remove the SAS key and follow the instructions below. We might have to include an option to the Nextflow config:

azure {
    batch {
        pools {
            pool1 {
                managedIdentity {
                    system = true
                    // or
                    clientId = "$managedIdentityClientId"
                }
            }
        }
    }
}

and to be a perfect solution we should not generate a SAS key in the first place.

OK I've managed this, it seems pretty straightforward.

Firstly, I made an Azure Batch pool which was the same as the normal Nextflow pools, but with two changes:

  • The identity was a user assigned identity, which I selected from a set. I just used the UI so far and selected from a list. Docs: https://learn.microsoft.com/en-us/azure/batch/managed-identity-pools. Would be worth someone who actually understands Azure identities to check this out.
  • Modified the startTask to download a more recent version of AzCopy. Not a big deal. The script was this:
bash -c "tar -xzvf azcopy.tar.gz && chmod +x azcopy*/azcopy && mkdir $AZ_BATCH_NODE_SHARED_DIR/bin/ && cp azcopy*/azcopy $AZ_BATCH_NODE_SHARED_DIR/bin/"

and resource file was used this URL: https://aka.ms/downloadazcopy-v10-linux to file azcopy.tar.gz.

After this, you need to set some env variables to tell azcopy to authenticate automatically. This is unique to azcopy but I imagine there would be something similar if we used this system to access the Azure Key Vault to enable secrets. This was pretty straightforward with the env directive (and nothing too secret here). Docs here: https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-authorize-azure-active-directory#authorize-by-using-a-system-wide-managed-identity.

I then ran the following Nextflow pipeline. I used Fusion to try and remove any secret azcopy authentication that may have occured. I also ran it on a different batch pool where it failed with this error message: Failed to perform Auto-login: ManagedIdentityCredential: no default identity is assigned to this resource.

main.nf:

process AZCOPY {
    container 'quay.io/nf-core/ubuntu:20.04'

    output:
        path("nf-1JnhkGMxTKrBuU.log"), emit: myFile

    """
    \$AZ_BATCH_NODE_SHARED_DIR/bin/azcopy copy 'https://account.blob.core.windows.net/path/nf-1JnhkGMxTKrBuU.log' nf-1JnhkGMxTKrBuU.log
    cat nf-1JnhkGMxTKrBuU.log
    """
}

workflow {
    AZCOPY()
}

nextflow.config:

workDir = "$AZURE_BATCH_WORK_DIR"

tower {
    enabled = true
}

wave {
    enabled = true
}

fusion {
    enabled = true
}

env {
    AZCOPY_JOB_PLAN_LOCATION = "."
    AZCOPY_AUTO_LOGIN_TYPE   = "MSI"
    AZCOPY_LOG_LOCATION      = "."
}

process {
    executor = 'azurebatch'
    queue = 'managed-identity'
}

azure {
    storage {
        accountName   = "$AZURE_STORAGE_ACCOUNT_NAME"
    }
    batch {
        location      = "$AZURE_BATCH_ACCOUNT_REGION"
        accountName   = "$AZURE_BATCH_ACCOUNT_NAME"
    }
}

All-in-all, it's just a couple of API call changes and updating azcopy, so nothing too troubling. Users will have to create a managed identity with the correct permissions so docs may take a bit longer because it's a bit fiddly. @vsmalladi would you or someone from msft be able to help?

Will try Azure Key Vault now.

@adamrtalbot adamrtalbot changed the title Azure Batch worker nodes can use Managed Identity to share data Azure Batch worker nodes can use Managed Identity to stage and upload data Aug 16, 2024
@adamrtalbot adamrtalbot changed the title Azure Batch worker nodes can use Managed Identity to stage and upload data Azure Batch worker nodes can use a Managed Identity to stage and upload data Aug 16, 2024
@adamrtalbot
Copy link
Collaborator Author

Looking at the Azure Java docs, we should be able to attach the managed identity in the case where Nextflow is creating the Azure node pools (e.g. when using the autopools feature). This will make deployment much simpler but will need the role to be increased.

https://learn.microsoft.com/en-en/java/api/com.microsoft.azure.batch.protocol.models.userassignedidentity

@pditommaso
Copy link
Member

I would expect it needs to be provided in the node pool creation

final poolParams = new BatchPoolCreateContent(spec.poolId, spec.vmType.name)
.setVirtualMachineConfiguration(poolVmConfig(spec.opts))
// same as the number of cores
// https://docs.microsoft.com/en-us/azure/batch/batch-parallel-node-tasks
.setTaskSlotsPerNode(spec.vmType.numberOfCores)
final startTask = createStartTask(spec.opts.startTask)
if( startTask ) {
poolParams .setStartTask(startTask)
}
// resource labels
if( spec.metadata ) {
final metadata = spec.metadata.collect { name, value ->
new MetadataItem(name, value)
}
poolParams.setMetadata(metadata)
}
// virtual network
if( spec.opts.virtualNetwork )
poolParams.setNetworkConfiguration( new NetworkConfiguration().setSubnetId(spec.opts.virtualNetwork) )
// scheduling policy
if( spec.opts.schedulePolicy ) {
final pol = BatchNodeFillType.fromString(spec.opts.schedulePolicy)
if( !pol ) throw new IllegalArgumentException("Unknown Azure Batch scheduling policy: ${spec.opts.schedulePolicy}")
poolParams.setTaskSchedulingPolicy( new BatchTaskSchedulingPolicy(pol) )
}
// mount points
if ( config.storage().fileShares ) {
List<MountConfiguration> mountConfigs = new ArrayList(config.storage().fileShares.size())
config.storage().fileShares.each {
if (it.key) {
final String accountName = config.storage().accountName
final endpoint = "https://${config.storage().accountName}.file.core.windows.net/${it.key}" as String
final accountKey = config.storage().accountKey
final shareConfig = new AzureFileShareConfiguration( accountName, endpoint, accountKey, it.key )
.setMountOptions(it.value.mountOptions)
mountConfigs << new MountConfiguration().setAzureFileShareConfiguration(shareConfig)
} else {
throw new IllegalArgumentException("Cannot mount a null File Share")
}
}
poolParams.setMountConfiguration(mountConfigs)
}
// autoscale
if( spec.opts.autoScale ) {
log.debug "Creating autoscale pool with id: ${spec.poolId}; vmCount=${spec.opts.vmCount}; maxVmCount=${spec.opts.maxVmCount}; interval=${spec.opts.scaleInterval}"
final interval = spec.opts.scaleInterval.seconds as int
poolParams
.setEnableAutoScale(true)
.setAutoScaleEvaluationInterval( Duration.of(interval, ChronoUnit.SECONDS) )
.setAutoScaleFormula(scaleFormula(spec.opts))
}
else if( spec.opts.lowPriority ) {
log.debug "Creating low-priority pool with id: ${spec.poolId}; vmCount=${spec.opts.vmCount};"
poolParams
.setTargetLowPriorityNodes(spec.opts.vmCount)
}
else {
log.debug "Creating fixed pool with id: ${spec.poolId}; vmCount=${spec.opts.vmCount};"
poolParams
.setTargetDedicatedNodes(spec.opts.vmCount)
}
apply(() -> client.createPool(poolParams))

@adamrtalbot
Copy link
Collaborator Author

adamrtalbot commented Nov 26, 2024

OK, here's a to-do list for this feature:

  • azcopy needs to be updated to a version which supports managed identities
  • The .command.run needs to updated to use a managed identity instead of a SAS when uploading/downloading files with azcopy
  • When using Fusion, Nextflow needs to tell the task that it needs to use a managed identity instead of a SAS
  • When Nextflow creates a node pool, it needs to be able to attach a managed identity
  • Update docs to explain additional requirements (i.e. increased permissions required)
  • Nextflow should check if a managed identity is attached to an existing node pool before it submits work to it (?)
  • A node pool should be able to use a managed identity for accessing private container registries

From my reading of the SDK docs, all of this is possible right now.

@pditommaso
Copy link
Member

Let's focus on Fusion only support for managed identity

@adamrtalbot
Copy link
Collaborator Author

I don't think that will work, there are too many moving parts and it will become a confusing mess of dependencies. For Fusion we will still need to do 3-6 which is all the hard stuff.

@zjupNN
Copy link

zjupNN commented Dec 13, 2024

The issue is similar to the fusion issue. The pipeline tests whether we can access data from two different containers test1 and test2 within the same storage account. This previously failed with fusion and it is also failing when using AZCOPY. We use system managed identity.

We have used the most recent version of nextflow (and also tested most recent edge version). We have tested the most recent version of azcopy as well.

@zjupNN
Copy link

zjupNN commented Dec 13, 2024

I've upload here minimal example prepared by my colleagues (Malthe and Kristian):
multi_minimal.zip

@adamrtalbot
Copy link
Collaborator Author

Thanks @zjupNN.

I think it's pretty clear what we need to do now:

  1. Give Nextflow some ability of knowing the worker nodes have a managed identity, e.g. azure.batch.pools.<pool name>.managedIdentityId, equivalent of aws.batch.executionRole = 'EXECUTION ROLE ARN'
  2. Pass this variable to Fusion (probably as an env)
  3. Fusion accepts the managed identity ID

For the first implementation, this would require users to manually add the managed ID to each pool which isn't fun but would work. Later extensions would include:

  1. Nextflow can attach each managed identity to pools as part of it's autopool feature
  2. Support Managed Identity for worker nodes without Fusion (azcopy)

@zjupNN
Copy link

zjupNN commented Dec 13, 2024

Answer from Kristian (I'm just a proxy here, as my coleagues had blocked option to write commnets on github in their NN accounts):
"As fusion cannot handle 503 ingress/egress errors we are interested in finding a solution for azcopy"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants