Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

approle: invalid secret id / status 500 on lookup #4955

Closed
chludwig-haufe opened this issue Jul 19, 2018 · 34 comments
Closed

approle: invalid secret id / status 500 on lookup #4955

chludwig-haufe opened this issue Jul 19, 2018 · 34 comments
Assignees
Milestone

Comments

@chludwig-haufe
Copy link
Contributor

Describe the bug
When our Vault instance has been up for some time (by now about a day), it starts to fail to look up (some) approle secret ids. Applications that try to log in using an affected secret id get the message "invalid secret id". If someone tries to look up the secret id metadata using the corresponding secret id accessor, the response status is 500 with error message "failed to find accessor entry for secret_id_accessor". This also affects secret ids that were issued moments before.

After a restart of the vault server, freshly issued secret ids can be accessed again.

This issue might be related to #4396.

To Reproduce

$ vault read auth/approle/role/ch-service-chinteg
WARNING! The following warnings were returned from Vault:

  * The "bound_cidr_list" parameter is deprecated and will be removed in favor
  of "secret_id_bound_cidrs".

Key                      Value
---                      -----
bind_secret_id           true
bound_cidr_list          <nil>
local_secret_ids         false
period                   0
policies                 [default read-all_contenthub-chinteg read-write-all_contenthub-chinteg-certs read-write_contenthub-pki-chinteg-gateway_issue_all_roles read-write_contenthub-pki-chinteg-services_issue_all_roles]
secret_id_bound_cidrs    <nil>
secret_id_num_uses       0
secret_id_ttl            0
token_bound_cidrs        <nil>
token_max_ttl            0
token_num_uses           0
token_ttl                1200
$ vault write -f auth/approle/role/ch-service-chinteg/secret-id
Key                   Value
---                   -----
secret_id             xxxxxxxx-xxxx-xxxx-xxxxx-xxxxxxxxxxxx
secret_id_accessor    34769af9-80c5-8421-064e-c1fc452f97b3
$ vault write auth/approle/role/ch-service-chinteg/secret-id-accessor/lookup secret_id_accessor=34769af9-80c5-8421-064e-c1fc452f97b3
Error writing data to auth/approle/role/ch-service-chinteg/secret-id-accessor/lookup: Error making API request.

URL: PUT https://vault.contenthub.haufe.io:8200/v1/auth/approle/role/ch-service-chinteg/secret-id-accessor/lookup
Code: 500. Errors:

* 1 error occurred:

* failed to find accessor entry for secret_id_accessor: "34769af9-80c5-8421-064e-c1fc452f97b3"

The server logs do not reveal any extra information.

After a restart of our vault:


$ vault write -f auth/approle/role/ch-service-chinteg/secret-id
Key                   Value
---                   -----
secret_id             xxxxxxxx-xxxx-xxxx-xxxxx-xxxxxxxxxxxx
secret_id_accessor    b27944e6-ae59-326f-6112-67a2f1a3972a
$ vault write auth/approle/role/ch-service-chinteg/secret-id-accessor/lookup secret_id_accessor=b27944e6-ae59-326f-6112-67a2f1a3972a
Key                   Value
---                   -----
cidr_list             []
creation_time         2018-07-19T07:09:04.674578783Z
expiration_time       0001-01-01T00:00:00Z
last_updated_time     2018-07-19T07:09:04.674578783Z
metadata              map[]
secret_id_accessor    b27944e6-ae59-326f-6112-67a2f1a3972a
secret_id_num_uses    0
secret_id_ttl         0

Expected behavior
The secret id access should always be successful like in the example above after the vault restart.

Environment:

  • Vault Server Version (retrieve with vault status):
Key             Value
---             -----
Seal Type       shamir
Sealed          false
Total Shares    1
Threshold       1
Version         0.10.3
Cluster Name    vault-cluster-d58cc98f
Cluster ID      9c7510f8-8e9a-24b7-6d95-18372b96f8de
HA Enabled      false
  • Vault CLI Version (retrieve with vault version):
    • The example above was created with Vault v0.10.3 ('533003e27840d9646cb4e7d23b3a113895da1dd0').
    • Our applications access vault via org.springframework.cloud:spring-cloud-vault-config:1.1.1.RELEASE.
  • Server Operating System/Architecture: CentOS Linux release 7.5.1804 on an Azure VM Standard DS2 v2 Promo (2 vcpus, 7 GB memory)

Vault server configuration file(s):

backend "azure" {
  container = "vaultcontainer"
  accountName = "vaultstorage"
  accountKey = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
}

ui = true

listener "tcp" {
  address = "10.1.0.4:8200"
  cluster_address = "10.1.0.4:8201"
  tls_disable = "false"
  tls_min_version = "tls12"
  tls_cert_file = "/etc/letsencrypt/live/xxxx.xxxx.xxxx.xx/fullchain.pem"
  tls_key_file = "/etc/letsencrypt/live/xxxx.xxxx.xxxx.xx/privkey.pem"
}

Additional context

  • The server has been upgraded several times over the last two years. The upgrade to 0.10.3 (from 0.9.3) took place on June 21, 2018.
  • The role name was initially created as ch-service-CHinteg - back then, the role names were still case sensitive. Since we found the behavior with respect to existing roles confusing after the role names became case insensitive, we deleted the role and re-created it again. We also tried using the role name ch-service-CHinteg when issuing and accessing the secret id - to no avail.
  • We observed the issue the first time roughly at the same time we upgraded to 0.10.3. Over the last few days, it happened much more frequent, though.
@jefferai jefferai added this to the 0.10.4 milestone Jul 19, 2018
@jefferai
Copy link
Member

Hi there,

The AppRole backend does not cache values. There is a global storage cache for Vault, but it's the same one that's been in there since the beginning and is the same across storage backends, so I don't suspect an issue. However, you could try turning off the cache via disable_cache = true in your config file.

What I do suspect is Azure storage. The code path producing this error is a straight pass through to storage, and the only way it'd be nil (triggering this error) is if the underlying storage says it's nil (as opposed to returning an error).

My guess is that you're either hitting an eventual consistency issue within Azure's storage, some kind of caching that is performed by the Azure SDK, some bug in the SDK, or some bug in the (community-supported) Azure storage plugin for Vault. When Vault is restarted, all of the last three are reinitialized; and the intervening time might allow time for Azure to catch up if it's the first option. (Disabling Vault's cache massively slows it down so even if it's an Azure consistency issue disabling the cache might also seem to work).

@jefferai
Copy link
Member

Just to walk you through my analysis a bit, the message you're seeing is output when this function returns nil but not an error: https://github.com/hashicorp/vault/blob/master/builtin/credential/approle/validation.go#L262

There is only one place where that happens and that's if the storage Get call fails. The key here is that it works after a reboot; if there was something wrong with, say, the calculation of the salt, it would be the same across reboots. So it suggests that it's not that things are being stored improperly, but rather than the underlying storage is acting unreliably.

@chludwig-haufe
Copy link
Contributor Author

Thanks, Jeff, for looking into this.

I will turn off the global Vault cache - if only to see whether this makes indeed a difference. If I understood you correctly, then the performance hit might be too heavy to keep it this way in the long run, even if it makes the failures disappear.

I will also try to trace what's going on in the Azure storage plugin and in the Azure SDK. I would not be surprised if writes into the Azure storage sometimes are eventually visible only.

I am not clear yet how this fits with the behavior I observed (reproducibly) today:

  1. I create a new secret-id for the affected approle.
  2. Looking up the secret-id metadata by its accessor works (even multiple times).
  3. I look up secret-id metadata using an accessor that failed before. It fails again.
  4. I look up secret-id metadata using the accessor used in the 2nd step. Now it fails as well and the accessor stays "broken" even after a vault restart.

So something must be overwritten inside the Azure storage when we hit this failure scenario - otherwise everything should be fine again after a vault restart.

In the worst case, we will need to migrate to another backend. I am not keen to do this because migrating the pki backends will require more planning. But we'll see. Maybe I can find the cause in the Azure storage plugin.

@jefferai
Copy link
Member

So you have two separate issues now? One that is fixed after a restart, and one that isn't?

@jefferai
Copy link
Member

Also, did you get the same exact errors or were they different? Can you paste the errors and any logs you have?

@jefferai
Copy link
Member

Can you also provide examples of the steps, such as role configuration and secret-ID fetching? Is it possible that the secret ID that "stayed broken" has limited uses or limited TTL and is simply expiring? Output from vault auth list -detailed would be good too.

@chludwig-haufe
Copy link
Contributor Author

To be honest, I am not sure whether it is the same issue or not. This morning again, I had to restart vault; only then the deployment of our service succeeded. (The deployment script issues a new secret-id every time.) Whether it's related to cold caches, I don't know. Only this afternoon I realized / observed the behavior I described in my last comment.

Our secret ids have no explicit TTL. The "broken" secret ids were all issued for the role ch-service-chinteg (used for services in the integration environment); luckily, we did not observe any failures with secret ids issued for services in the production environment yet. You can see the settings of ch-service-chinteg at the top in my initial post.

While collecting the commands and output below, I found that the newly issued secret ids "broke" even without access to another secret id in the meantime. I therefore suspect that it is in fact a caching issue - something (in the Azure storage plugin??) is not properly persisted and as soon as the cache is updated from the storage, the secret id becomes inaccessible. The access to the "already broken" secret id was a probably red hering, it simply incurred the delay or forced the cache update that triggered the issue.

Anyway, here's the output I see (still with disable_cache = false):

$ vault list auth/approle/role/ch-service-chinteg/secret-id
Keys
----
1259fe56-3c68-e94e-8db2-96d596c7ba54
13db1506-d809-0a49-e13c-aa8d095a7797
28165715-21a9-89bb-d1c3-ec6be19b4178
34769af9-80c5-8421-064e-c1fc452f97b3
5397a608-0d03-32dd-2a55-9b2b7c01b51a
60e4cfc7-50cc-a5e4-c7fa-1c019c84fc45
635acef5-a184-134a-de58-d682d32dd47b
65fb2212-07e5-02eb-9ed4-709de4160a73
86b8f393-c2c0-5367-71db-16f4a3378459
95352e5d-0f65-6f69-08db-ac1a173ce3d1
9ed48f21-67bd-1126-f473-80eb9ffcf8a2
ac66a465-c2f1-547f-384e-abb43a4e7273
b27944e6-ae59-326f-6112-67a2f1a3972a
bfd9ae12-5b3b-f4a9-d4ff-d72054741204
$ vault write auth/approle/role/ch-service-chinteg/secret-id-accessor/lookup secret_id_accessor=b27944e6-ae59-326f-6112-67a2f1a3972a
Error writing data to auth/approle/role/ch-service-chinteg/secret-id-accessor/lookup: Error making API request.

URL: PUT https://vault.contenthub.haufe.io:8200/v1/auth/approle/role/ch-service-chinteg/secret-id-accessor/lookup
Code: 500. Errors:

* 1 error occurred:

* failed to find accessor entry for secret_id_accessor: "b27944e6-ae59-326f-6112-67a2f1a3972a"
$ vault write -f auth/approle/role/ch-service-chinteg/secret-id
Key                   Value
---                   -----
secret_id             xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
secret_id_accessor    5a5f4733-4137-391b-b5ba-7288e0e51b35
$ echo "Timestamp: $(date -u '+%C%y-%m-%dT%H:%M:%SZ')" ; vault write auth/approle/role/ch-service-chinteg/secret-id-accessor/lookup secret_id_accessor=5a5f4733-4137-391b-b5ba-7288e0e51b35
Timestamp: 2018-07-20T18:06:55Z
Key                   Value
---                   -----
cidr_list             []
creation_time         2018-07-20T18:06:18.885411211Z
expiration_time       0001-01-01T00:00:00Z
last_updated_time     2018-07-20T18:06:18.885411211Z
metadata              map[]
secret_id_accessor    5a5f4733-4137-391b-b5ba-7288e0e51b35
secret_id_num_uses    0
secret_id_ttl         0
$ echo "Timestamp: $(date -u '+%C%y-%m-%dT%H:%M:%SZ')" ; vault write auth/approle/role/ch-service-chinteg/secret-id-accessor/lookup secret_id_accessor=5a5f4733-4137-391b-b5ba-7288e0e51b35
Timestamp: 2018-07-20T18:07:27Z
Error writing data to auth/approle/role/ch-service-chinteg/secret-id-accessor/lookup: Error making API request.

URL: PUT https://vault.contenthub.haufe.io:8200/v1/auth/approle/role/ch-service-chinteg/secret-id-accessor/lookup
Code: 500. Errors:

* 1 error occurred:

* failed to find accessor entry for secret_id_accessor: "5a5f4733-4137-391b-b5ba-7288e0e51b35"

So as shown by the secret id's creation_time field t and the timestamp I echoed when submitting the lookup, at some point between 37 and 70 seconds after the secret id's creation the secret id became inaccessible.

I guess the expiration_time 0001-01-01T00:00:00Z and secret_id_ttl 0 means there is no TTL set on the secret id itself, doesn't it?

The attached vault-messages-2018-07-20.txt contains the vault logs from the relevant time period between creation of the secret id with accessor 5a5f4733-4137-391b-b5ba-7288e0e51b35 and when it became inaccessible. I don't see anything noteworthy in there. Here's the accessor's audit hash:

$ vault write /sys/audit-hash/syslog input=5a5f4733-4137-391b-b5ba-7288e0e51b35
Key     Value
---     -----
hash    hmac-sha256:7234165b9900e5cbcbbe7924f66d5f1ac3efba729844523467d8a0b710ea1961

You asked for the auth plugin details:

$ vault auth list -detailed
Path         Type        Accessor                  Plugin    Default TTL    Max TTL    Replication    Seal Wrap    Options    Description
----         ----        --------                  ------    -----------    -------    -----------    ---------    -------    -----------
approle/     approle     auth_approle_7e29a194     n/a       system         system     replicated     false        map[]      n/a
token/       token       auth_token_38cc257c       n/a       system         system     replicated     false        map[]      token based credentials
userpass/    userpass    auth_userpass_bf7f2a3b    n/a       system         system     replicated     false        map[]      n/a

@chludwig-haufe
Copy link
Contributor Author

chludwig-haufe commented Jul 20, 2018

In the meantime, I restarted vault with disable_cache = true. Unfortunately, this setting does not mitigate the issue. After about 10 minutes, a newly created secret id became inaccessible again.

@chludwig-haufe
Copy link
Contributor Author

I added log statements to the functions Put, Get, Delete, and List in physical/azure/azure.go that output the relevant entry key (or the relevant prefix in case of List).

  • I see PUT is called at 2018-07-22T13:06:25.386Z for the accessor when I create a new secret id.
  • I see the entry inside the blob storage container via the Azure portal. So it is in fact written.
  • A lookup of the accessor is successful at 2018-07-22T13:09:39.15019296Z.
  • Another lookup at 2018-07-22T13:13:52.095879889Z fails. At this time, when I look into the storage container via the Azure portal, the entry is in fact gone.
  • The only invocation of Delete with a key starting with auth/ comes at 2018-07-22T13:14:00.618Z (i.e., well after the relevant entry was gone) and it refers to a different accessor.

I am baffled - even if there was a bug in either the Azure storage backend or the Azure SDK, why did the the issue become so easily reproducible only in the past week? Why does it seem to affect the role ch-service-chinteg only? I'd expect to see the same effect for all roles if we hit some weird behavior of the Azure SDK.

@jefferai
Copy link
Member

I'm baffled too but based on your own investigation it does seem like Vault isn't triggering the deletion, so as long as it's not writing the entry with some kind of expiration I don't see how it wouldn't be an issue on the Azure side. But I agree it's odd that you only see it for this role.

Is there any way to get logs from Azure of access to that entry? It would be really nice to see if any other entity somehow was accessing that value. Maybe some automated process set up at some point is ended up being triggered by the name of the role. Grasping at straws but without logs from Azure it's all speculation.

@chludwig-haufe
Copy link
Contributor Author

I turned on access logging for the relevant storage account. There's always an explicit DELETE request for the accessor entry from the vault server's TCP address, so it's not a bug in the implementation of the Azure blob storage.

I made two further observations:

  1. After a vault restart, it takes "long" (> 10 minutes) for the accessor to be deleted. However, around the time this happens, there is a gap in the logs produced by vault. In one instance, the gap lasts from 16:30:47.697Z to 16:31:10.340Z. In the access log, I can see the Azure storage requests corresponding to the log entries directly before and after the gap; there are 347 access log entries in between (all from vault's TCP address). One of them is the DELETE request for the accessor I used for my test at that time.
    I don't know what causes the loss of log messages; but it explains why I couldn't find an invocation of the Azure storage backend's Delete function for the accessors in the logs.
  2. Once the vault process was running for some time, the accessors are deleted much more quickly (< 2 minutes). Then I also find my trace of the Delete function in the logs.

I made the Delete function print its stack trace:

Jul 22 17:12:40 panama-vault vault: 2018-07-22T17:12:40.018Z [WARN ] storage.azure: Azure Backend: Going to delete entry: key=auth/51b00338-bbcc-5e20-ae52-6f03293ed2da/accessor/9b9dd75e02cd0f7bab4ca9c369568821c5ae6ed3cd2b3c077963bf58464308a6 container=panamavaultcontainer
Jul 22 17:12:40 panama-vault vault: goroutine 29410 [running]:
Jul 22 17:12:40 panama-vault vault: runtime/debug.Stack(0xc42065ec00, 0xc400000004, 0x24eae8f)
Jul 22 17:12:40 panama-vault vault: /usr/local/Cellar/go/1.10.3/libexec/src/runtime/debug/stack.go:24 +0xa7
Jul 22 17:12:40 panama-vault vault: runtime/debug.PrintStack()
Jul 22 17:12:40 panama-vault vault: /usr/local/Cellar/go/1.10.3/libexec/src/runtime/debug/stack.go:16 +0x22
Jul 22 17:12:40 panama-vault vault: github.com/hashicorp/vault/physical/azure.(*AzureBackend).Delete(0xc42053a6a0, 0x271aa80, 0xc42003a018, 0xc424500580, 0x73, 0x0, 0x0)
Jul 22 17:12:40 panama-vault vault: /Users/ludwigc/go/src/github.com/hashicorp/vault/physical/azure/azure.go:179 +0x343
Jul 22 17:12:40 panama-vault vault: github.com/hashicorp/vault/vault.(*sealUnwrapper).Delete(0xc42069da40, 0x271aa80, 0xc42003a018, 0xc424500580, 0x73, 0x0, 0x0)
Jul 22 17:12:40 panama-vault vault: /Users/ludwigc/go/src/github.com/hashicorp/vault/vault/sealunwrapper.go:146 +0x10f
Jul 22 17:12:40 panama-vault vault: github.com/hashicorp/vault/physical.(*Cache).Delete(0xc4207bd7c0, 0x271aa80, 0xc42003a018, 0xc424500580, 0x73, 0x0, 0x0)
Jul 22 17:12:40 panama-vault vault: /Users/ludwigc/go/src/github.com/hashicorp/vault/physical/cache.go:171 +0x128
Jul 22 17:12:40 panama-vault vault: github.com/hashicorp/vault/vault.(*AESGCMBarrier).Delete(0xc4201bd260, 0x271aa80, 0xc42003a018, 0xc424500580, 0x73, 0x0, 0x0)
Jul 22 17:12:40 panama-vault vault: /Users/ludwigc/go/src/github.com/hashicorp/vault/vault/barrier_aes_gcm.go:700 +0x1ad
Jul 22 17:12:40 panama-vault vault: github.com/hashicorp/vault/vault.(*BarrierView).Delete(0xc420cb1d60, 0x271aa80, 0xc42003a018, 0xc4244fe410, 0x49, 0xc4244fe410, 0x49)
Jul 22 17:12:40 panama-vault vault: /Users/ludwigc/go/src/github.com/hashicorp/vault/vault/barrier_view.go:128 +0x117
Jul 22 17:12:40 panama-vault vault: github.com/hashicorp/vault/builtin/credential/approle.(*backend).tidySecretID.func1.1(0x24a594a, 0xa, 0x24a2f1e, 0x9, 0xc420974780, 0x0)
Jul 22 17:12:40 panama-vault vault: /Users/ludwigc/go/src/github.com/hashicorp/vault/builtin/credential/approle/path_tidy_user_id.go:150 +0x636
Jul 22 17:12:40 panama-vault vault: github.com/hashicorp/vault/builtin/credential/approle.(*backend).tidySecretID.func1(0xc420c14fc0, 0xc4222cbb90, 0x271b600, 0xc420cb1d60)
Jul 22 17:12:40 panama-vault vault: /Users/ludwigc/go/src/github.com/hashicorp/vault/builtin/credential/approle/path_tidy_user_id.go:159 +0x18e
Jul 22 17:12:40 panama-vault vault: created by github.com/hashicorp/vault/builtin/credential/approle.(*backend).tidySecretID
Jul 22 17:12:40 panama-vault vault: /Users/ludwigc/go/src/github.com/hashicorp/vault/builtin/credential/approle/path_tidy_user_id.go:39 +0xce

This means tidySecretID considers the accessor dangling - even though the secret id still exists in the storage.

Just by reading the code, I saw only one explanation how this could happen:

a) The anonymous go function inside tidySecretId defines a local variable result of type error.
b) The nested inner function secretIDCleanupFunc reads the secret id entry from the storage. If s.Get returns nil, nil, then the result variable of the enclosing go function is updated (line 77 of builtin/credential/approle/path_tidy_user_id.go). secretIDCleanupFunc returns without error. In particular, the accessor is not deleted from accessorMap and will be considered dangling.
c) The go function's result variable is never accessed except for the mentioned update in secretIDCleanupFunc. The error message added in line 77 is therefore never logged.

If the s.Get invocation for reading the secret id entry inside secretIDCleanupFunc reached the physical backend, then I'd expect to see both the trace log statement in the vault logs that I added to Get in physical/azure/azure.go and an access log entry in the Azure storage logs. There is neither. I cannot say yet, though, why the cache might return a nil entry. (Speculation: Might it be caused by the fact that the role used to exist with the case sensitive name ch-service-CHinteg?)

@jefferai
Copy link
Member

Thanks for doing that tracing, that's super helpful. I'll take a look at it soon.

When you say the role used to exist with that name, do you mean that it was created with that name but you now access it via lowercase, or do you mean that you had it as that name, then removed the role, and then recreated the role entirely as lowercase?

@chludwig-haufe
Copy link
Contributor Author

No problem - I have a vested interest in this issue being resolved. :-)

The role was created (a long time ago with vault 0.6.x) as ch-service-CHinteg. At some point (after vault started to treat role names in a case insensitive manner) we realized that it is possible to issue secret ids for such "legacy" roles by referencing the roles in lower case only - however, login with such secret ids failed. It took us a considerable amount of time to figure out the cause of these login failures and we found this behavior to be confusing. We therefore deleted ch-service-CHinteg and created a new role ch-service-chinteg.

@jefferai
Copy link
Member

OK so any secret IDs ought to have been from the new role.

Can you see if ch-service-CHinteg still exists in your Azure storage?

@chludwig-haufe
Copy link
Contributor Author

You mean inside <blobcontainer>/auth/<auth-uuid>/role/? There I only see ch-service-chinteg (all lower-case) besides the other roles with totally different names.

@jefferai
Copy link
Member

Yep, that's what I was asking. Just trying to think through possibilities.

@jefferai
Copy link
Member

Outside of the error shadowing, the problem as you identified earlier is that the Get function is coming back with nil for the entry. When you turned off Vault's cache, you took that out of the equation so it's not the cache.

Any chance you can add a check in the azure storage code in Get where you look for a key with that prefix but that comes back with nil? It would be great to understand whether it's Azure returning nil or whether something is happening along the way in Vault.

@chludwig-haufe
Copy link
Contributor Author

My Get function already reads:

func (a *AzureBackend) Get(ctx context.Context, key string) (*physical.Entry, error) {
	defer metrics.MeasureSince([]string{"azure", "get"}, time.Now())

	a.permitPool.Acquire()
	defer a.permitPool.Release()

	blob := &storage.Blob{
		Container: a.container,
		Name:      key,
	}
	exists, err := blob.Exists()
	if err != nil {
		return nil, err
	}
	if !exists {
		a.logger.Info("Azure Backend: key not found", "key", key, "container", a.container.Name)
		return nil, nil
	}

	reader, err := blob.Get(nil)
	if err != nil {
		return nil, err
	}
	defer reader.Close()
	data, err := ioutil.ReadAll(reader)

	ent := &physical.Entry{
		Key:   key,
		Value: data,
	}

	a.logger.Info("Azure backend: Successfulley got entry", "key", key, "container", a.container.Name)
	return ent, err
}

If a nil entry was returned without an error, then I should have seen the key not found message. So far, I observed it only before the secret id was written - as in:

Jul 22 17:11:53 panama-vault vault[16230]: {"time":"2018-07-22T17:11:53.298810583Z","type":"request","auth":{"client_token":"hmac-sha256:55961d014523c53419a191b4c2659b7eab779db431759d3bd04ee4268e24aec1","accessor":"hmac-sha256:5b11bb
05c96f3adf49c035e58d3bc73d1dd4c685aa32be738506c809b2f5c2ee","display_name":"root","policies":["root"],"token_policies":["root"],"metadata":null,"entity_id":""},"request":{"id":"0a3ef1b0-622e-0a4d-b4c1-43f8f0f666b8","operation":"updat
e","client_token":"hmac-sha256:55961d014523c53419a191b4c2659b7eab779db431759d3bd04ee4268e24aec1","client_token_accessor":"hmac-sha256:5b11bb05c96f3adf49c035e58d3bc73d1dd4c685aa32be738506c809b2f5c2ee","path":"auth/approle/role/ch-serv
ice-chinteg/secret-id","data":null,"policy_override":false,"remote_address":"84.160.207.22","wrap_ttl":0,"headers":{}},"error":""}
Jul 22 17:11:53 panama-vault vault: 2018-07-22T17:11:53.327Z [INFO ] storage.azure: Azure Backend: key not found: key=auth/51b00338-bbcc-5e20-ae52-6f03293ed2da/secret_id/a2c0c3ff0ed0899ecdba396818c24becd005c9aa0c2901555af7c574365c13a
5/8b202fed12b2f7fdc82268fa3c1ad4bbd54285c78dc43759eb72eed877a0a702 container=panamavaultcontainer
Jul 22 17:11:53 panama-vault vault: 2018-07-22T17:11:53.333Z [INFO ] storage.azure: Azure Backend: key not found: key=auth/51b00338-bbcc-5e20-ae52-6f03293ed2da/secret_id/a2c0c3ff0ed0899ecdba396818c24becd005c9aa0c2901555af7c574365c13a
5/8b202fed12b2f7fdc82268fa3c1ad4bbd54285c78dc43759eb72eed877a0a702 container=panamavaultcontainer
Jul 22 17:11:53 panama-vault vault: 2018-07-22T17:11:53.333Z [INFO ] storage.azure: Azure Backend: writing entry: key=auth/51b00338-bbcc-5e20-ae52-6f03293ed2da/accessor/9b9dd75e02cd0f7bab4ca9c369568821c5ae6ed3cd2b3c077963bf58464308a6 container=panamavaultcontainer
Jul 22 17:11:53 panama-vault vault: 2018-07-22T17:11:53.381Z [INFO ] storage.azure: Azure Backend: writing entry: key=auth/51b00338-bbcc-5e20-ae52-6f03293ed2da/secret_id/a2c0c3ff0ed0899ecdba396818c24becd005c9aa0c2901555af7c574365c13a5/8b202fed12b2f7fdc82268fa3c1ad4bbd54285c78dc43759eb72eed877a0a702 container=panamavaultcontainer
Jul 22 17:11:53 panama-vault vault[16230]: {"time":"2018-07-22T17:11:53.40891896Z","type":"response","auth":{"client_token":"hmac-sha256:55961d014523c53419a191b4c2659b7eab779db431759d3bd04ee4268e24aec1","accessor":"hmac-sha256:5b11bb05c96f3adf49c035e58d3bc73d1dd4c685aa32be738506c809b2f5c2ee","display_name":"root","policies":["root"],"token_policies":["root"],"metadata":null,"entity_id":""},"request":{"id":"0a3ef1b0-622e-0a4d-b4c1-43f8f0f666b8","operation":"update","client_token":"hmac-sha256:55961d014523c53419a191b4c2659b7eab779db431759d3bd04ee4268e24aec1","client_token_accessor":"hmac-sha256:5b11bb05c96f3adf49c035e58d3bc73d1dd4c685aa32be738506c809b2f5c2ee","path":"auth/approle/role/ch-service-chinteg/secret-id","data":null,"policy_override":false,"remote_address":"84.160.207.22","wrap_ttl":0,"headers":{}},"response":{"data":{"secret_id":"hmac-sha256:48341f2fe245011c9cd6e81b12f272c5bff13493e6f0ad2ee5a287268a257f2c","secret_id_accessor":"hmac-sha256:76b124ec680e4a905f2dcfce3e283f23a09c559a676aed5a2944105abd63e4e5"}},"error":""}

I searched for the secret id entry mentioned in the excerpt above in the rest of yesterday's logs; it is not mentioned anymore.

I can add log statements to Gets error cases directly after blob.Exists() and blob.Get(nil). As far as I can see, these are the function's only exit points not covered yet.

If you want me to dump any specific fields of secret ids, accessors, roles etc., just let me know in which functions.

Just to be sure, I can also repeat my experiment with the disabled cache. It probably has to wait until tomorrow, though.

@jefferai
Copy link
Member

Looking into this agian. I put in a commit into master that changes that error to instead log to the logger when there is a nil value encountered, which should at least confirm what you posited before that that error case is being hit but no error is returned.

@jefferai
Copy link
Member

I'm not seeing anything else obvious -- with that commit in there, can you look for instances where you see that logged message and compare the logged hmac'd secret ID with the logs you have in the Get function? If you can verify that Azure is returning a non-nil entry, and then it's getting turned to nil somewhere along the way, that would be great. (I mean, not great, but at least would help verify that the next step is to add debugging all throughout to figure out where that's changing.)

@jefferai
Copy link
Member

I have an idea, but it will take me a while (probably tomorrow) to get a build to you with changes. Hopefully once I do you can test it soon after so we can try to get a fix into 0.10.4.

@jefferai
Copy link
Member

I think I have it sorted -- can you run the dangling-accessor-tidy-fixup branch and let me know how it goes?

@chludwig-haufe
Copy link
Contributor Author

Thanks a lot! Unfortunately, deploying the branch dangling-accessor-tidy-fixup did not fix the issue.

Nevertheless, it gave me a clue:

  • When I created a secret id, it was written into the blob auth/51b00338-bbcc-5e20-ae52-6f03293ed2da/secret_id/a2c0c3ff0ed0899ecdba396818c24becd005c9aa0c2901555af7c574365c13a5/e370d0bca21a5c25787a55884dc1bb9a0b537f73fa5c533368a4e9fa364ee7ea. Note the role HMAC prefix a2c0c3ff0ed0899ecdba396818c24becd005c9aa0c2901555af7c574365c13a5. I see the corresponding entry in the Azure blob container.
  • In line 57 of path_tidy_user_id.go, roleNameHMACs is initialized with the result of listing all blobs with the prefix secret_id/. In barrier_view.go, this prefix is expanded to auth/51b00338-bbcc-5e20-ae52-6f03293ed2da/secret_id/. Therefore, I'd expect a2c0c3ff0ed0899ecdba396818c24becd005c9aa0c2901555af7c574365c13a5/ in roleNameHMACs.
  • The loop you added starting in line 194 of path_tidy_user_id.go calls s.List for all members of roleNameHMACs. Since I added log statements (covering both success and error case) to the List function in azure.go, I'd expect to see a log entry mentioning auth/51b00338-bbcc-5e20-ae52-6f03293ed2da/secret_id/a2c0c3ff0ed0899ecdba396818c24becd005c9aa0c2901555af7c574365c13a5/. However, the role HMAC a2c0c3ff0ed0899ecdba396818c24becd005c9aa0c2901555af7c574365c13a5/ does not appear in the logs anymore after the secret-id was written.

I therefore suspect that the initialization of roleNameHMACs misses some entries. Of course, this causes the corresponding accessors to appear dangling. It would also explain why not all approles are affected by this issue.

I am going to add log statements to the Azure backend's List function that show what the Azure SDK returned and what the List function passes back to its caller. May this will shed some light on the issue.

@chludwig-haufe
Copy link
Contributor Author

I found it! The bug is in the List function of physical/azure/azure.go:

The call to a.container.ListBlobs(storage.ListBlobsParameters{Prefix: prefix}) matches more than maxresults blob container entries (default is 5000, according to https://docs.microsoft.com/en-us/rest/api/storageservices/list-blobs). List fails to inspect the response field list.NextMarker and to fetch the next result page. Therefore it misses a part of the entries.

I guess since some days the role HMAC a2c0c3ff0ed0899ecdba396818c24becd005c9aa0c2901555af7c574365c13a5 appears no longer on the first result page, therefore it is not included in List's result and does not become a member of roleNameHMAC inside tidySecretID.

@chludwig-haufe
Copy link
Contributor Author

If I prepare a pull request, do you require a Contributor License Agreement or similar? I need to obtain the "ok" from my department head anyway, but I expect this will be a mere formality.

Or do you prefer to write the fix yourself?

@jefferai
Copy link
Member

Awesome that you found it! Glad my branch helped -- the good news is that what I found was a very real race condition so that line of investigation still bore a lot of fruit.

One of our devs is happy to code it up -- we are trying to release 0.10.4 very soon so would like to see this fix in.

@jefferai
Copy link
Member

jefferai commented Jul 24, 2018

(removed, mistaken post)

@jefferai
Copy link
Member

Haha. I somehow switched tabs in the middle of typing. Redacting the above (it was meant to go onto #4981, where it is still public).

@chludwig-haufe
Copy link
Contributor Author

Ok, that's fine with me.

I have a local version with a fix, but it's a simple loop so your colleague will probably code it in no time either. For me, the most time consuming part was to google how you write a do-while-loop in Golang. :-)

@chrishoffman
Copy link
Contributor

@chludwig-haufe Can you test out https://github.com/hashicorp/vault/pull/4983/files and see if it helps out?

@chludwig-haufe
Copy link
Contributor Author

@chrishoffman With a build of my local patch deployed, the accessor for a ch-service-chinteg service id I created an hour ago does still exists - I am therefore confident that it fixes the issues.

Since your change to the Azure physical backend's List function looks almost like what I came up with, I am pretty sure that your patch fixes the issue as well. I am going to deploy a build of your code and let it run over night. I will let you know the result tomorrow morning.

@jefferai
Copy link
Member

Closing for now, please tell us how it goes!

@chludwig-haufe
Copy link
Contributor Author

@chrishoffman The accessor of a secret id for ch-service-chinteg I created yesterday before leaving the office is still available. So your fix seems to do it.

Thanks a lot to you and @jefferai for looking into this issue so quickly!

@jefferai
Copy link
Member

Happy to help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants