Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for HDInsight Hadoop cluster external SQL database #5230

Closed
lotaezhao opened this issue Dec 20, 2019 · 7 comments · Fixed by #6145 or #6969
Closed

Support for HDInsight Hadoop cluster external SQL database #5230

lotaezhao opened this issue Dec 20, 2019 · 7 comments · Fixed by #6145 or #6969

Comments

@lotaezhao
Copy link

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Description

HDInsight supports custom external metastores, which are recommended for production clusters. However the current Terraform HDInsight Hadoop cluster resource doesn't support specifying an existing external Azure SQL database to host the metastore. Please can an attribute be added to specify an external Azure SQL database for the metastore.

This configuration is supported when creating an HDInsight cluster through the Azure portal.

New or Affected Resource(s)

azurerm_hdinsight_hadoop_cluster

Potential Terraform Configuration

resource "azurerm_hdinsight_hadoop_cluster" "example" {
  name                = "example-hdicluster"
  resource_group_name = "${azurerm_resource_group.example.name}"
  location            = "${azurerm_resource_group.example.location}"
  cluster_version     = "3.6"
  tier                = "Standard"
  sql_database = "${azurerm_sql_database.example.name}"

  component_version {
    hadoop = "2.7"
  }

  gateway {
    enabled  = true
    username = "acctestusrgw"
    password = "TerrAform123!"
  }

  storage_account {
    storage_container_id = "${azurerm_storage_container.example.id}"
    storage_account_key  = "${azurerm_storage_account.example.primary_access_key}"
    is_default           = true
  }

  roles {
    head_node {
      vm_size  = "Standard_D3_V2"
      username = "acctestusrvm"
      password = "AccTestvdSC4daf986!"
    }

    worker_node {
      vm_size               = "Standard_D4_V2"
      username              = "acctestusrvm"
      password              = "AccTestvdSC4daf986!"
      target_instance_count = 3
    }

    zookeeper_node {
      vm_size  = "Standard_D3_V2"
      username = "acctestusrvm"
      password = "AccTestvdSC4daf986!"
    }
  }
}

References

https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-use-external-metadata-stores

@apetresc
Copy link

apetresc commented Jan 28, 2020

I'm taking a look at adding the necessary support for this right now.

An important clarification that I think should be made: while you're right that Azure's documentation, sample ARM templates, and Portal UI for HDInsight do treat "existing Hive metastore" as some first-class recommended entity, there is no actual support for this in the HDInsight Cluster REST API directly.

Instead, the HDInsight API exposes Hadoop-specific configuration files through special properties.clusterDefinition.configurations blocks, specifically hive-site and hive-env. All of the logic that maps existing SQL database instances to the default Hive metastore happens "client-side", so to speak. For example, like this in Azure's own quickstart templates. The "existingHiveMetastore" parameters only exist as ARM params, they have no direct analogue on the HDInsight API side.

Unfortunately, at the moment, the AzureRM Terraform Provider only supports the gateway configuration block, not any of the others that this use-case requires.

This presents two options:

  • We follow the Azure user-facing convention and provide parameters like existingHiveMetastoreServerName in the azurerm_hdinsight_hadoop_cluster and reimplement the same sort of logic that the quickstart template uses to populate the correct configuration fields automatically. That is roughly what @lotaezhao is proposing in the "Potential Terraform Configuration" example.
  • We simply expose the hive-site and hive-env configuration blocks directly and let each module do the mapping themselves. This is probably the more flexible approach; for example, Azure's own quickstart templates implement the mapping incorrectly for non-MSSQL backends. If we just let the user do it, they can work around any specific requirements they have themselves.

Thoughts? I can implement this either way, just want to get some feedback before I do.

@kosinsky
Copy link
Contributor

Support for other databases as a metastore is interesting feature.
A few questions it creates:

  1. What else is available in hive-site/hive-env?
  2. How will people use these extra settings? Could it cause confusion, if some hive-site properties can't be controlled via ARM, but Terraform syntax will allow these properties?
  3. What about core-site/mapred-site/yarn-site etc.? Sounds like scope creep for me.

@lotaezhao
Copy link
Author

lotaezhao commented Feb 26, 2020

I'm taking a look at adding the necessary support for this right now.

An important clarification that I think should be made: while you're right that Azure's documentation, sample ARM templates, and Portal UI for HDInsight do treat "existing Hive metastore" as some first-class recommended entity, there is no actual support for this in the HDInsight Cluster REST API directly.

Instead, the HDInsight API exposes Hadoop-specific configuration files through special properties.clusterDefinition.configurations blocks, specifically hive-site and hive-env. All of the logic that maps existing SQL database instances to the default Hive metastore happens "client-side", so to speak. For example, like this in Azure's own quickstart templates. The "existingHiveMetastore" parameters only exist as ARM params, they have no direct analogue on the HDInsight API side.

Unfortunately, at the moment, the AzureRM Terraform Provider only supports the gateway configuration block, not any of the others that this use-case requires.

This presents two options:

  • We follow the Azure user-facing convention and provide parameters like existingHiveMetastoreServerName in the azurerm_hdinsight_hadoop_cluster and reimplement the same sort of logic that the quickstart template uses to populate the correct configuration fields automatically. That is roughly what @lotaezhao is proposing in the "Potential Terraform Configuration" example.
  • We simply expose the hive-site and hive-env configuration blocks directly and let each module do the mapping themselves. This is probably the more flexible approach; for example, Azure's own quickstart templates implement the mapping incorrectly for non-MSSQL backends. If we just let the user do it, they can work around any specific requirements they have themselves.

Thoughts? I can implement this either way, just want to get some feedback before I do.

I'd be happy for you to do it either way as long as it's possible to configure an external database for the metastore via Terraform that will resolve my issue. Option 2 sounds better and more flexible as you say. Thanks in advance :-)

@kosinsky
Copy link
Contributor

Looks like only Azure SQL DB is supported and I got following from the HDInsights team:

Besides of Azure SQL DB, we don’t support other dbs.
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-use-external-metadata-stores#custom-metastore
Custom metastore
HDInsight also supports custom metastores, which are recommended for production clusters:
• You specify your own Azure SQL Database as the metastore.
• A custom metastore lets you attach multiple clusters and cluster types to that metastore. >For example, a single metastore can be shared across Interactive Query, Hive, and Spark clusters in HDInsight.
• You pay for the cost of a metastore (Azure SQL DB) according to the performance level you choose.
• You can scale up the metastore as needed.
• The cluster and the external metastore must be hosted in the same region.

As a result, adding simple sql_database makes a lot of sense.

@kosinsky
Copy link
Contributor

I've submitted PR to implement that feature in sql server only way

@ghost
Copy link

ghost commented Jun 3, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks!

@ghost ghost locked and limited conversation to collaborators Jun 3, 2020
@jackofallops jackofallops added this to the v2.18.0 milestone Jul 8, 2020
@ghost
Copy link

ghost commented Jul 10, 2020

This has been released in version 2.18.0 of the provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading. As an example:

provider "azurerm" {
    version = "~> 2.18.0"
}
# ... other configuration ...

@ghost ghost unlocked this conversation Jul 10, 2020
@ghost ghost locked and limited conversation to collaborators Jul 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.