Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleaning of stale leases #2452

Merged
merged 47 commits into from
May 5, 2017
Merged
Show file tree
Hide file tree
Changes from 41 commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
1e5d6e3
Added sys/tidy-leases endpoint
vishalnayak Mar 7, 2017
239bd1c
Add locking where possible while doing auth/token/tidy
vishalnayak Mar 7, 2017
14aaa0a
Merge branch 'oss' into clean-stale-leases
vishalnayak Apr 26, 2017
3477038
Added atomic lock to ensure a single tidy operation is in progress
vishalnayak Apr 26, 2017
e52625d
Revoke lease that has empty token; added logs
vishalnayak Apr 26, 2017
dca0d70
Added logger to token store and logs to tidy function
vishalnayak Apr 26, 2017
de1a2a0
Added caching of looked up tokens
vishalnayak Apr 26, 2017
65c63b4
Fix the log statements
vishalnayak Apr 26, 2017
b036478
Fix logging levels
vishalnayak Apr 26, 2017
711153d
Fix logging suggestions; put the policyStore nil check back in
vishalnayak Apr 27, 2017
0d629ff
Cache only valid tokens
vishalnayak Apr 27, 2017
0c65cd4
Some more logging updates
vishalnayak Apr 27, 2017
785177a
Merge branch 'oss' into sys-tidy-leases
vishalnayak Apr 27, 2017
3fdf38a
Distinguish valid and invalid tokens using bool value in cache
vishalnayak Apr 27, 2017
98cdb68
Use an atomic lock for tidy operation in token store
vishalnayak Apr 27, 2017
2ef62fe
refactor lock handling in token tidy function
vishalnayak Apr 27, 2017
a8ef2c0
Refactor locking code in lease tidy; add ending debug statements
vishalnayak Apr 27, 2017
0892102
Merge branch 'oss' into sys-tidy-leases
vishalnayak May 1, 2017
8c7b175
Skip checking the validity of an empty client token
vishalnayak May 2, 2017
853233a
Added a test for tidying of empty token
vishalnayak May 2, 2017
d07d3cb
Added steps to check if invalid token is properly cleaned up
vishalnayak May 2, 2017
79fc0d8
Check if multiple leases with same invalid token is getting cleaned up
vishalnayak May 2, 2017
497bebe
Do not duplicate log lines for invalid leases
vishalnayak May 2, 2017
aa08e5c
Added test to check the atomicity of the lease tidy operation
vishalnayak May 2, 2017
a3c2a42
Test to check that leases with valid tokens are not being cleaned up
vishalnayak May 2, 2017
b6843ec
Added summary logs to help better understand the consequence
vishalnayak May 3, 2017
0c02540
Less scary debugging
jefferai May 3, 2017
b3c6a56
change some logging output
jefferai May 3, 2017
415b0a2
Two things:
jefferai May 3, 2017
0bda5a7
Adhere to tainted status in salted accessor lookup
jefferai May 3, 2017
8d35f92
consistent logging
vishalnayak May 3, 2017
a2e431b
Added logs when deletion fails so we can rely on server logs
vishalnayak May 3, 2017
2d21bf6
logging updates
vishalnayak May 3, 2017
5bc47b0
Add taint flag for looking up by accessor
jefferai May 3, 2017
b0c4a7e
Add more cleanup if a lease fails to register and revoke tokens if re…
jefferai May 3, 2017
265b4cd
Merge remote-tracking branch 'oss/master' into sys-tidy-leases
jefferai May 3, 2017
2f6e924
Fix substitution of index/child in delete call
jefferai May 3, 2017
f1d2fc3
Merge branch 'master-oss' into sys-tidy-leases
jefferai May 4, 2017
5320da0
Move tidy-leases to leases/tidy
jefferai May 4, 2017
cacf072
Update commenting
jefferai May 4, 2017
4de09fb
Update Tidy function comment
vishalnayak May 4, 2017
106f08a
Fix up the tests
vishalnayak May 4, 2017
7829107
Update comments
jefferai May 4, 2017
1378dd5
Move client token check in exp register to top
jefferai May 4, 2017
5dde45d
Address feedback
jefferai May 5, 2017
e61298e
Update debugging around tidy
jefferai May 5, 2017
0e10477
Merge branch 'master-oss' into sys-tidy-leases
jefferai May 5, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
152 changes: 149 additions & 3 deletions vault/expiration.go
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,18 @@ import (
"path"
"strings"
"sync"
"sync/atomic"
"time"

"github.com/armon/go-metrics"
log "github.com/mgutz/logxi/v1"

"github.com/hashicorp/errwrap"
multierror "github.com/hashicorp/go-multierror"
"github.com/hashicorp/go-uuid"
"github.com/hashicorp/vault/helper/consts"
"github.com/hashicorp/vault/helper/jsonutil"
"github.com/hashicorp/vault/helper/locksutil"
"github.com/hashicorp/vault/logical"
)

Expand Down Expand Up @@ -57,6 +61,8 @@ type ExpirationManager struct {

pending map[string]*time.Timer
pendingLock sync.Mutex

tidyLock int64
}

// NewExpirationManager creates a new ExpirationManager that is backed
Expand Down Expand Up @@ -114,6 +120,112 @@ func (c *Core) stopExpiration() error {
return nil
}

// Tidy cleans up the dangling storage entries for leases. It scans the storage
// view to find all the available leases, checks if the token embedded in it is
// either empty or invalid and in both the cases, it revokes them. It also uses
// a token cache to avoid multiple lookups of the same token ID. It is normally
// not required to use the API that invokes this. This is only intended to
// clean up the corrupt storage due to bugs.
func (m *ExpirationManager) Tidy() error {
var tidyErrors *multierror.Error

if !atomic.CompareAndSwapInt64(&m.tidyLock, 0, 1) {
m.logger.Debug("expiration: tidy operation on leases is already in progress")
return fmt.Errorf("tidy operation on leases is already in progress")
}

defer atomic.CompareAndSwapInt64(&m.tidyLock, 1, 0)

m.logger.Debug("expiration: beginning tidy operation on leases")

// Create a cache to keep track of looked up tokens
tokenCache := make(map[string]bool)
countLease := 0
deletedCountEmptyToken := 0
deletedCountInvalidToken := 0

tidyFunc := func(leaseID string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some progress logging so people can be aware things are still happening.

countLease++
if countLease%500 == 0 {
m.logger.Debug("expiration: tidying leases", "progress", countLease)
}

le, err := m.loadEntry(leaseID)
if err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth adding a memoizing cache (a map would work, as a set) so that if we already know a token does still exist we don't have to look it up again.

tidyErrors = multierror.Append(tidyErrors, fmt.Errorf("failed to load the lease ID %q: %v", leaseID, err))
return
}

if le == nil {
tidyErrors = multierror.Append(tidyErrors, fmt.Errorf("nil entry for lease ID %q: %v", leaseID, err))
return
}

var isValid, ok bool
revokeLease := false
if le.ClientToken == "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case we should probably revoke the lease as well.

m.logger.Debug("expiration: revoking lease which has an empty token", "lease_id", leaseID)
revokeLease = true
deletedCountEmptyToken++
goto REVOKE_CHECK
}

isValid, ok = tokenCache[le.ClientToken]
if !ok {
saltedID := m.tokenStore.SaltID(le.ClientToken)
lock := locksutil.LockForKey(m.tokenStore.tokenLocks, le.ClientToken)
lock.RLock()
te, err := m.tokenStore.lookupSalted(saltedID, true)
lock.RUnlock()

if err != nil {
tidyErrors = multierror.Append(tidyErrors, fmt.Errorf("failed to lookup token: %v", err))
return
}

if te == nil {
m.logger.Debug("expiration: revoking lease which holds an invalid token", "lease_id", leaseID)
revokeLease = true
deletedCountInvalidToken++
tokenCache[le.ClientToken] = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary, but I think you might as well add a goto REVOKE_CHECK statement here incase more branches are added in the future. Makes it slightly more extendable.

} else {
tokenCache[le.ClientToken] = true
}
} else {
if isValid {
return
} else {
m.logger.Debug("expiration: revoking lease which contains an invalid token", "lease_id", leaseID)
deletedCountInvalidToken++
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here.

revokeLease = true
}
}

REVOKE_CHECK:
if revokeLease {
// Force the revocation and skip going through the token store
// again
err = m.revokeCommon(leaseID, true, true)
if err != nil {
tidyErrors = multierror.Append(tidyErrors, fmt.Errorf("failed to revoke an invalid lease with ID %q: %v", leaseID, err))
return
}
}
}

if err := logical.ScanView(m.idView, tidyFunc); err != nil {
return err
}

m.logger.Debug("expiration: ending tidy operation on leases")

m.logger.Debug("expiration: number of leases scanned", "total", countLease)
m.logger.Debug("expiration: number of revoked leases which had empty tokens", "deleted_count", deletedCountEmptyToken)
m.logger.Debug("expiration: number of revoked leases which had invalid tokens", "deleted_count", deletedCountInvalidToken)

return tidyErrors.ErrorOrNil()
}

// Restore is used to recover the lease states when starting.
// This is used after starting the vault.
func (m *ExpirationManager) Restore() error {
Expand Down Expand Up @@ -498,7 +610,7 @@ func (m *ExpirationManager) RenewToken(req *logical.Request, source string, toke
// Register is used to take a request and response with an associated
// lease. The secret gets assigned a LeaseID and the management of
// of lease is assumed by the expiration manager.
func (m *ExpirationManager) Register(req *logical.Request, resp *logical.Response) (string, error) {
func (m *ExpirationManager) Register(req *logical.Request, resp *logical.Response) (id string, retErr error) {
defer metrics.MeasureSince([]string{"expire", "register"}, time.Now())
// Ignore if there is no leased secret
if resp == nil || resp.Secret == nil {
Expand All @@ -515,8 +627,38 @@ func (m *ExpirationManager) Register(req *logical.Request, resp *logical.Respons
if err != nil {
return "", err
}

leaseID := path.Join(req.Path, leaseUUID)

defer func() {
// If there is an error we want to rollback as much as possible (note
// that errors here are ignored to do as much cleanup as we can). We
// want to revoke a generated secret (since an error means we may not
// be successfully tracking it), remove indexes, and delete the entry.
if retErr != nil {
revResp, err := m.router.Route(logical.RevokeRequest(req.Path, resp.Secret, resp.Data))
if err != nil {
retErr = multierror.Append(retErr, errwrap.Wrapf("an additional error was encountered revoking the newly-generated secret: {{err}}", err))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we distinguish the 4 error messages here so that if we hit them, we can know where exactly the error was caused.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

} else if revResp != nil && revResp.IsError() {
retErr = multierror.Append(retErr, errwrap.Wrapf("an additional error was encountered revoking the newly-generated secret: {{err}}", revResp.Error()))
}

if err := m.deleteEntry(leaseID); err != nil {
retErr = multierror.Append(retErr, errwrap.Wrapf("an additional error was encountered revoking the newly-generated secret: {{err}}", err))
}

if err := m.removeIndexByToken(req.ClientToken, leaseID); err != nil {
retErr = multierror.Append(retErr, errwrap.Wrapf("an additional error was encountered revoking the newly-generated secret: {{err}}", err))
}
}
}()

if req.ClientToken == "" {
return "", fmt.Errorf("expiration: cannot register a lease with an empty client token")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move the defer function to be after this check? There is no point in revoking a lease which didn't even get through the input validations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

le := leaseEntry{
LeaseID: path.Join(req.Path, leaseUUID),
LeaseID: leaseID,
ClientToken: req.ClientToken,
Path: req.Path,
Data: resp.Data,
Expand Down Expand Up @@ -548,6 +690,10 @@ func (m *ExpirationManager) Register(req *logical.Request, resp *logical.Respons
func (m *ExpirationManager) RegisterAuth(source string, auth *logical.Auth) error {
defer metrics.MeasureSince([]string{"expire", "register-auth"}, time.Now())

if auth.ClientToken == "" {
return fmt.Errorf("expiration: cannot register an auth lease with an empty token")
}

// Create a lease entry
le := leaseEntry{
LeaseID: path.Join(source, m.tokenStore.SaltID(auth.ClientToken)),
Expand Down Expand Up @@ -668,7 +814,7 @@ func (m *ExpirationManager) revokeEntry(le *leaseEntry) error {
// Revocation of login tokens is special since we can by-pass the
// backend and directly interact with the token store
if le.Auth != nil {
if err := m.tokenStore.RevokeTree(le.Auth.ClientToken); err != nil {
if err := m.tokenStore.RevokeTree(le.ClientToken); err != nil {
return fmt.Errorf("failed to revoke token: %v", err)
}

Expand Down
Loading