Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Redis] Internal Network API Bug #1347

Closed
tombuildsstuff opened this issue Jun 26, 2017 · 22 comments · Fixed by #2581
Closed

[Redis] Internal Network API Bug #1347

tombuildsstuff opened this issue Jun 26, 2017 · 22 comments · Fixed by #2581
Assignees
Labels
Redis Cache Service Attention Workflow: This issue is responsible by Azure service team.

Comments

@tombuildsstuff
Copy link
Contributor

tombuildsstuff commented Jun 26, 2017

👋 howdy!

I've been trying to add support for Redis on the Internal Subnets to Terraform.

Creating a Redis instance on an Internal Subnet works great, however there's a bug in the API when deleting it:

DELETE /subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/acctestRG-21/providers/Microsoft.Cache/Redis/acctestRedis21?api-version=2016-04-01 HTTP/1.1
Host: management.azure.com
User-Agent: HashiCorp-Terraform-v0.9.8
Accept-Encoding: gzip

which returns:

HTTP/1.1 200 OK
Cache-Control: no-cache
Date: Mon, 26 Jun 2017 09:48:09 GMT
Expires: -1
Pragma: no-cache
Server: Microsoft-HTTPAPI/2.0
Strict-Transport-Security: max-age=31536000; includeSubDomains
Content-Length: 0

Querying for the Redis Cache at this point then returns a 404:

GET /subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/acctestRG-21/providers/Microsoft.Cache/Redis/acctestRedis21?api-version=2016-04-01 HTTP/1.1
Host: management.azure.com
User-Agent: HashiCorp-Terraform-v0.9.8
Accept-Encoding: gzip

[DEBUG] AzureRM Response for https://management.azure.com/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/acctestRG-21/providers/Microsoft.Cache/Redis/acctestRedis21?api-version=2016-04-01:
HTTP/1.1 404 Not Found
Content-Length: 152
Cache-Control: no-cache
Content-Type: application/json; charset=utf-8
Date: Mon, 26 Jun 2017 09:48:10 GMT
Expires: -1
Pragma: no-cache

{"error":{"code":"ResourceNotFound","message":"The Resource 'Microsoft.Cache/Redis/acctestRedis21' under resource group 'acctestRG-21' was not found."}}

Thus, it appears safe to delete the Subnet:

DELETE /subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/acctestRG-21/providers/Microsoft.Network/virtualNetworks/acctestnw-21/subnets/testsubnet?api-version=2017-03-01 HTTP/1.1
Host: management.azure.com
User-Agent: HashiCorp-Terraform-v0.9.8
Accept-Encoding: gzip

However attempting to delete the Subnet at this point returns an error:

2017/06/26 10:48:10 [DEBUG] AzureRM Response for https://management.azure.com/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/acctestRG-21/providers/Microsoft.Network/virtualNetworks/acctestnw-21/subnets/testsubnet?api-version=2017-03-01:
HTTP/1.1 400 Bad Request
Content-Length: 285
Cache-Control: no-cache
Content-Type: application/json; charset=utf-8
Date: Mon, 26 Jun 2017 09:48:10 GMT
Expires: -1
Pragma: no-cache
Server: Microsoft-HTTPAPI/2.0

{
  "error": {
    "code": "InUseSubnetCannotBeDeleted",
    "message": "Subnet testsubnet is in use by /subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/acctestRG-21/providers/Microsoft.Cache/Redis/acctestRedis21 and cannot be deleted.",
    "details": []
  }
}

From what I can see this is a bug in the Redis API which is exposing the Redis Cache as deleted when it's really still kicking around. Would it be possible to fix this bug in the API so that the Redis Cache is returned in a Deleting state until it's deleted like the other resources do?

Thanks!

@tombuildsstuff
Copy link
Contributor Author

Hey @sarangan12

Just an FYI that this looks similar to a previous bug with Virtual Network Gateway's, I'm wondering it's related? #1233

Thanks!

@sarangan12
Copy link
Member

This is a service side issue. I am assigning this issue to @TimLovellSmith and @hrishi18pathak to look in to the issue

@tombuildsstuff
Copy link
Contributor Author

👋 hey @TimLovellSmith / @hrishi18pathak

Is there any update on this issue, or a rough timeframe for when this is likely to be fixed?

Thanks!

@TimLovellSmith
Copy link
Member

@tombuildsstuff
Yes, there's a bug where there is a delay between the redis cache returning Deleted OK and the subnet actually being deletable. A fix may not happen for a while yet, since it is a behavioral change to return 202 instead of 200 for delete and there is a risk of confusing existing clients by making that change (if they are not already coded to handle it), we will likely only fix it in a future api version. A workaround is just to retry delete subnet until it actually succeeds, it should normally succeed within 10-15 minutes.

@tombuildsstuff
Copy link
Contributor Author

@TimLovellSmith thanks for the update - is there a timeframe for when a new API version will be released? Based on your description, I'm assuming that would be duplicating the API version and updating the behaviour, rather than including additional functionality - and as such should be relatively quick?

Unfortunately the workaround specified above isn't ideal since it opens us up to other issues where the deletion failing is valid (such as permissions, Locking or the HyperV networking/locking errors) and we'll have to keep track of which error are retry-able and which aren't - and as such I'd prefer to avoid that if at all possible.

Thanks!

@TimLovellSmith
Copy link
Member

@tombuildsstuff Sorry, still no definite timeframe on that change.. I'm not familiar with the hyper-v networking/locking errors you mention, what are those like? For permissions, it would fail with 401 or 403 status code, rather than 400. You can most definitely distinguish this particular scenario via the 'InUseSubnetCannotBeDeleted' status in the error response. Which as you say, is a lot better than retrying blindly - although I understand its undesirable to do the work of writing such a special case.

/cc @JonCole

@TimLovellSmith
Copy link
Member

Suggest closing as 'probably won't fix' - note unfortunately if we change the behavior to be more transparent it is still going to take just as long to delete the subnet, and its not that hard to workaround, so it is just not a very high priority to fix it.

@rem-aj
Copy link

rem-aj commented Jan 30, 2018

Hi @TimLovellSmith, so just to be clear we will be able to use terraform to create the azure redis service passing in it's own subnet and static IP, we just won't be able to delete the said subnet with terraform? I just want to confirm that at least this part is functional and if so is that in the azure provider version 1.0.1 or a future to be released version?

cc; @tombuildsstuff

Thanks!

@TimLovellSmith
Copy link
Member

TimLovellSmith commented Jan 30, 2018

Hi @bostonmoto I am not familiar with terraform, but I imagine this as most likely manifesting as a bug that deleting a subnet that holds a redis cache would appear to fail, and then need to be retried after a suitable delay, at which time it would succeed.

@rem-aj
Copy link

rem-aj commented Jan 30, 2018

Not a problem @TimLovellSmith I will wait for @tombuildsstuff to respond. Just trying to confirm that the creation part works with static IP and subnet. If delete doesn't work we'll just do that manually after as needed.

@TimLovellSmith
Copy link
Member

TimLovellSmith commented Feb 8, 2018

@tombuildsstuff I came up with one other idea for allowing workarounds in code that could possibly avoid the need for a new api version:

Could we address this by returning a Azure-AsyncOperation header as part of the DELETE 200 response?

And then you could poll on the URL in that header to wait for the delete work to really be completed. We could probably roll this out faster than a new api version. But would it work for you?

/cc @bostonmoto

@tombuildsstuff
Copy link
Contributor Author

@TimLovellSmith I have a feeling that would break the Azure SDK - @marstr / @mcardosos / @jhendrixMSFT / @joshgav should be able to comment further

@sarangan12 this functionality (internal subnets) currently isn't supported in Terraform since we're waiting for the API to be fixed - once it is we should be able to add support for this relatively quickly (there's a branch with support - but we can't ship it until this issue is resolved)

Thanks!

@mcardosos
Copy link
Contributor

Defining the operation as long running operation would introduce breaking changes in the go SDK, yes.

@jhendrixMSFT
Copy link
Member

But maybe it's the correct thing to do? This came up a few days ago talking with @johanste

@mcardosos
Copy link
Contributor

I think it is the correct thing to do. But yes, breaking changes.

@johanste
Copy link
Member

johanste commented Feb 9, 2018

Since the delete operation is already marked as a long running operation in swagger (from what I can tell), there wouldn't be a breaking change in the Go SDK. I don't know if any other management libraries would be tricked into doing something functionally incorrect by a 200 response w. an AzureAsync-Operation header - but if the operation will take a significant time to complete, usages where the client was fine with fire & forget (doesn't need the resource to actually be deleted), we risk making a call that previously was instantaneous (relatively speaking :)) into a 15 minute call (based on @TimLovellSmith 's earlier comments in the thread). Which would be sub-optimal.

I believe that the long term solution would be to:

  1. Make the redis cache instance stay around with provisioning state == deleting until the resource is actually gone (and all locks on other resources have been released). This will allow clients to figure out which resource is holding the lock on the subnet.
  2. Ensure that the redis cache operation is following the DELETE flow for resource as specified in the resource provider contract specification by including an operation location - which based on my understanding is not quite the case as of today.

In addition, I suspect that 409 for the subnet delete call failure would be better understood by most clients. And possibly (not standard for a 409) include a retry-after header telling the client that it may have better luck in the not-too-distant future.

@tombuildsstuff
Copy link
Contributor Author

tombuildsstuff commented Mar 16, 2018

@mcardosos could we leave this open until the fix is deployed and verified? At this point the new API is still broken since the fix isn't deployed, based on my reading of this comment: #2581 (comment)

api spec is done here, but server fixes are not yet rolled out

Thanks!

@mcardosos mcardosos reopened this Mar 16, 2018
@TimLovellSmith
Copy link
Member

TimLovellSmith commented Mar 16, 2018

@tombuildsstuff It is deployed. That comment was several days old.
/cc @mcardosos

@TimLovellSmith
Copy link
Member

@mcardosos Can we close it again?

@mcardosos
Copy link
Contributor

@tombuildsstuff feel free to reopen if not fixed yet

@rem-aj
Copy link

rem-aj commented Mar 21, 2018

Hi @tombuildsstuff so we can now not only create azure redis with subnet via terraform but we can also delete it properly now. Is that confirmed? Thanks!

@tombuildsstuff
Copy link
Contributor Author

@bostonmoto we still need to upgrade to the new version of the Azure SDK for Go, and then add rebase the branch with support for this functionality to be able to confirm that it, so it isn't at the moment, but should be soon. I'd suggest subscribing to this issue for updates: hashicorp/terraform-provider-azurerm#82 :)

@bsiegel bsiegel added the Service Attention Workflow: This issue is responsible by Azure service team. label Sep 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Redis Cache Service Attention Workflow: This issue is responsible by Azure service team.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants