Batch retrieval design

Useful Links
Motivation
Design
URI
Request Body
- Headers
Response
Proposed API vs Current API
Permissions
Open questions

This document will cover the options for enhancing Batch Retrieval requests.

Useful Links

Name	Link
Secrets Provider Job Milestone - Solution Design	https://github.com/cyberark/secrets-provider-for-k8s/blob/master/design/milestone_1_2_design_doc.md
Secrets Provider - Feature Doc	https://app.zenhub.com/workspaces/palmtree-5d99d900491c060001c85cba/issues/cyberark/secrets-provider-for-k8s/163

References

https://docs.microsoft.com/en-us/graph/json-batching#first-json-batch-request
https://docs.microsoft.com/en-us/graph/json-batching#response-format

Motivation

Previously, for the init container solution, each Secrets Provider was deployed to the same Pod as the customer application containers. Because of this, when a batch retrieval request was made, the requests were dispersed across many call to the Conjur server. When making the batch retrieval request and a permission error occurred for example, then the batch request would fail and only that Pod would not spin up successfully.

Now, with the Job Milestone, the Secrets Provider is deployed outside of the customers' Pods, supporting multiple customer pods. With this, a single batch retrieval request will be made for all applications to fetch the necessary secrets to deploy properly. Therefore, if there is a failure in receiving one of the values, then the whole request will fail. In other words, no application Pod will spin up in the namespace. This leads to a faulty user experience.

The following table outlines possible solutions to overcome this:

	Solution	Pros	Cons	Effort (T-shirt size)
1	Client side: For each K8s Secret, perform a new Batch request	- No server side changes	Load on server with extra calls (*)	S, 5 days
2	Server side: Update Batch retrieval to return list of variables and their individual responses (success/failure) (**)	- Best practice / straightforward design for batch endpoint - Will help us during rotation for Milestone 2	- Requires both client/server implementation	M, 10 days
3	Stay as is	No additional work needed	- Bad UX	-

(*) Fallback solution: use this solution as a safety net, only if the original Batch Retrieval request fails

(**) This solution can be broken up in two: 1. Create a new API endpoint or 2. use existing one (return 409 response)

Decision: We decided to go with #2 and create our own API. We decided to create a new batch retrieval API because of the following reasons:

Adding onto the existing batch endpoint would lead to changing behaviors for our needs in the Secrets Provider which would have cascading impacts on other teams.
The two APIs (current and new) offer separate use-cases; partial success (new) and fail fast (current)
The current batch retrieval API accepts variable IDs as query parameters. We anticipate a long string of variableIDs to be passed in for secret retrieval and this impacts performance.
Query parameters for HTTP GET requests are limited in length (2,048 characters) which is problematic if a customer has long secrets paths.
The new route offers is more useful if we choose to expand its functionality in the future. By using POST instead of GET we can not only opt to use query parameters but also utilize the body of the request.

At this stage this new batch retrieval API endpoint will be for internal use only.

Design

Secrets Provider init container flow

As a small recap, the following explains the Secrets Provider init container flow

The cyberark-secrets-provider-for-k8s runs as an init container and authenticates with the Conjur/DAP Follower using the Kubernetes Authenticator (authn-k8s).
The cyberark-secrets-provider-for-k8s reads all Kubernetes secrets required by the applications in the same namespace.
For each mapped Kubernetes secret, the cyberark-secrets-provider-for-k8s

a. Performs a Batch Retrieval request and retrieves Conjur/DAP secrets.

b. Updates the Kubernetes secret with the Conjur/DAP secret value.
The cyberark-secrets-provider-for-k8s init container runs to completion.

The new Batch Retrieval API will effect #2a. This will effect both init and application container (Milestone Job) flows.

URI

POST /secrets

This method retuns a list of secrets, their status code, and their value if an error did not occur. For differences between proposed API and existing API see Proposed API vs Current API section below. We require account as part of the URI instead of as part of the variableIDs

POST /secrets/{account}

Request Body

The request body is a list of all variable IDs (paths to secrets in Conjur/DAP) that need to be retrieved. For example:

{
  "variables": [
    {"id": "var/path/conjurSecret"},
    {"id": "var/path/anotherConjurSecret"},
    ...
  ]
}

Headers

Content-Type: application/json
Authorization: Token token="eyJkYX...Rhb="

Response

Batch Response

# Headers
HTTP/1.1
Content-Type: application/json

# Status response code
200 OK

# Body
{
  "variables": [
  {
    "id": "var/path/name1",
    "status": 200,
    "value": "superSecret"
  },
  {
    "id": "var/path/name2",
    "status": 403,
    "error": {
      "message": "Forbidden"
    }
  },
  {
    "id": "var/path/name3",
    "status": 404,
    "error": {
      "message": "Not Found"
 		}
  },
  {
    "id": "var/path/name4",
    "status": 200,
    "value": "superSuperSecret"    
  }
 ]
}

Code	Description
200 OK	All secret values were retrieved successfully
400 Bad Request	Server cannot process request due to client error (unparsable request body)
401 Unauthorized	The request lacks valid authentication credentials

If no Conjur/DAP secrets were able to be retrieved, the response will still return a 200 OK because the [request body is parsable]([https://docs.microsoft.com/en-us/graph/json-batching#:~:text=The%20status%20code%20on%20a,requests%20inside%20the%20batch%20succeeded](https://docs.microsoft.com/en-us/graph/json-batching#:~:text=The status code on a,requests inside the batch succeeded)) and request was handled.

If an invalid request body was supplied, a 400 Bad Request will be returned.

If the request was sent with an invalid access token, a 401 Unauthorized response will be returned.

Response body

All secret values will be treated as individual entities. If a single secret fails to be retrieved, this will not impact the integrity of the request. Each secret has it's own response code.

Parameter	Description
id	ID for Conjur/DAP variable
value	Secret for variable ID
error.status	Response status code of the error. See table below
error.message	Response message of the error. See table below

Individual secret responses

The following are the possible error codes that can be returned for individual secrets.

Status/Message	Description
400 Bad Request	Server cannot process request due to client error (invalid input parameters or logical validations). For example: adding `<>` in path to secret in Conjur/DAP
403 Forbidden	The authenticated user lacks the necessary privilege
404 Not Found	Variable does not exist or variable does not have secret value

Proposed API vs Current API

	Current	Proposed
URI	`GET /secrets{?variable_ids}}`	`POST /secrets/{account}`
Request Body	-	See Request Body section
VariableID	`variable_ids=myorg:variable:var/path/conjurSecret`	`variable_ids=var/path/conjurSecret`
Response Body	JSON of only successfully retrieved secrets and their values. Single failure, fails whole request and no secret is returned.	JSON of all secrets, their status code/messages, and values for successfully retrieved secrets
Behavior	Request fails when individual request fails	Request does not fail when sub/individual requests fail

Permissions

Only an authorized Host with read and execute privileges on the Conjur/DAP secrets will be able to fetch the secret values. Otherwise, the proper 403 Forbidden error code will be returned for status and message of the response respectively.

During configuration of the Secrets Provider, the Host detailed for authnLogin is the Host that the Secrets Provider will use to authenticate to Conjur/DAP Follower and retrieve Conjur/DAP secrets. Therefore, this is the Host that will be used to make the request to this new Batch API endpoint and retrieve secrets from Conjur/DAP.

Mitigation

Now that it is easy to brute-force multiple variables at a time in the body of the request, we should implement the necessary mitigations to close the attack surface.

Audit

A new audit entry will made per secret fetched from Conjur/DAP as is done in the current Batch retrieval API endpoint. Also for all secrets that are attempted to be fetched and that the batch retrieval request was made.

Test Plan

Conjur

The following integration tests will be implemented in Conjur

Request tests

The following table are possible responses ordered in terms of prioritization.

Priority	Scenario	Response
1	Access token is invalid	401 Unauthorized
2	Account does not exist	404 Not Found (if account is in URI)
3	JSON body is not parsable	400 Bad Request

Individual secret request tests

The following table are the permutations for the possible individual sercet responses. Each permutation will be its own test.

Variable	Execute	Read	Response	Message
Not Exist	-	-	404	Not Found
Exists with value	No	No	400	Not Found
""	Yes	No	200	OK
""	Yes	Yes	200	OK
""	No	Yes	403	Forbidden *Security!
Has no value	No	No	404	Not Found
""	Yes	No	???	Has No Value
""	Yes	Yes	???	Has No value
""	No	Yes	403	Forbidden

We will ensure all the proper audit record should appear for the whole request and each individual secret. All tests will be automated.

As part of input validation, we want to ensure that we deserialize only the parts of the JSON request that we expect. For example for input in the following format:

{
  "variables": [
    {"id": "var/path/conjurSecret"},
    {"anotherId": "var/path/notToBeUsed"},
    ...
  ]
}

We will only use the parts we expect and need. For example:

{
  "variables": [
    {"id": "var/path/conjurSecret"},
    ...
  ]
}

Secrets Provider for K8s

The following integration tests will be implemented in Secrets Provider for K8s

	Scenario	Given	When	Then	Manual/Automated
1	Vanilla flow, Secret Provider Job successfully updates K8s Secrets	- Conjur is running - Authenticator is defined - Secrets defined in Conjur and mapped in K8s Secrets - All mandatory values are defined in `custom-values.yaml` - Customer installs Secrets provider Chart	Secrets Provider runs as a Job	- Secrets Provider pod authenticates and fetches Conjur secrets successfully - All K8s Secrets are updated with Conjur value - Secrets Provider Job runs to completion - Verify logs	Automated
2	Partial success	- Conjur is running - Authenticator is defined - Secrets defined in Conjur and mapped in K8s Secrets - All mandatory values are defined in `custom-values.yaml` - Host defined to Secrets Provider in `custom-values.yaml` does not have access to one of the Conjur secrets - Customer installs Secrets provider Chart	Secrets Provider runs as a Job	- Secrets Provider pod authenticates and attempts to fetch Conjur secrets - All K8s Secrets are updated with those Conjur values successfully fetched - Secrets Provider Job runs to exhaustion - Verify logs #1/2/3 appear	Automated
3	Vanilla test, Conjur server is not up-to-date	- Outdated Conjur server is running - Authenticator is defined - Secrets defined in Conjur and mapped in K8s Secrets - All mandatory values are defined in `custom-values.yaml` - Host defined to Secrets Provider in `custom-values.yaml` - Customer installs Secrets provider Chart	Secrets Provider runs as a Job	- Secrets Provider pod authenticates and attempts to fetch Conjur secrets - Application deploys successfully - Verify logs #4 appear	Automated
4	Conjur server is not up-to-date	- Outdated Conjur server is running - Authenticator is defined - Secrets defined in Conjur and mapped in K8s Secrets - All mandatory values are defined in `custom-values.yaml` - Host defined to Secrets Provider in `custom-values.yaml` does not have access to one of the Conjur secrets - Customer installs Secrets provider Chart	Secrets Provider runs as a Job	- Secrets Provider pod authenticates and attempts to fetch Conjur secrets - Secrets Provider Job runs to exhaustion - Application fails to be deployed - Verify logs #1/4 appear	Automated

Logs

	Scenario	Log message	Log level
1	Secrets Provider batch retrieval has partial success	Failed to retrieve a portion of the Conjur/DAP secrets: Reason: '%s'	Error
2	Secrets Provider batch retrieval total failure	Failed to retrieve Conjur/DAP secrets: Reason: '%s'	Error
3	Secrets Provider batch retrieval has partial success	Skipped Kubernetes Secrets '%s'. Reason: Failed to retrieve Conjur secrets	Debug
4	Secrets Provider success	Successfully updated '%d' out of '%d' Kubernetes Secrets	Info
5	Old batch retrieval endpoint	Warning: Secrets Provider cannot efficiently run because Conjur is not up to date. Please consider upgrading to the latest version	Warn

Performance tests

The following performance tests will be run using Jmeter. These tests will be run for both the new API and the existing one, and compare the results between them.

Benchmark

Section	Parameter	Value
Conjur	# of standbys	2
Conjur	# of followers	1
Environment	Kubernetes	OpenShift 3.11
Environment	# of cores	16
Environment	# of nodes	1
Test	# of requesting threads	20
Test	# of variables	1025
Test	# of variabels per policy	5
Test	# of policies	205
Test	Variable ID length	50
Test	Variable value length (secret)	50

	Scenario	Given	When	Then	Manual/Automated
1	Response time	- Conjur is running - Host is defined - 1024 Secrets are defined	For X=1, X<=1024, X=X*2: Concurrently retrieve X secrets for 5 minutes, measure average response time. Repeat each step 3 times.	Calculate the average response time between the 3 cycles for each step. Plot the average response time per number of secrets provider.	Manual
2	Throughput	- Conjur is running - Host is defined - 1024 Secrets are defined	For X=1, X<=1024, X=X*2: Concurrently retrieve X secrets for 5 minutes, measure throughput. Repeat each step 3 times.	Calculate the throughput between the 3 cycles for each step. Plot the throughput per number of secrets provider.	Manual

Open questions

Do we want this duplicated API documented? At this stage this API endpoint will be for internal use only.
Do we ultimately want to move to a single Batch Request API and depricate the existing one?
Do we want to support v4?

Future enhancements

The following are ideas for possible extension of the batch retrieval endpoint API and not necessary directly related to the Secrets Provider project.

Add secret versions as a filtering query parameter so we can get all secrets after a certain version
Fetch all secrets that the user/host has permissions to
Fetch all secrets after /var/path/* that the user/host has permissions to

Delivery Plan

Fix flaky tests 2 days
Implement new API endpoint in Conjur 5 days
Add integration tests in Conjur 4 days
Golang SDK
- Implement batch logic 2 days
- Add unit tests 2 days
Implement change in Secrets Provider 3 days
Implement test plan
- Add integration tests in Secrets Provider 3 days
Performance tests
- In Conjur 5 days
- In Secrets Provider 3 days
Team/cross-team review cycle 3 days
Logs reviewed by TW 1 day

Total: 33 days ~6.5 weeks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

batch_retrieval_design.md

batch_retrieval_design.md

Batch retrieval design

Useful Links

References

Motivation

Design

Secrets Provider init container flow

URI

POST /secrets

Request Body

Headers

Response

Batch Response

Response body

Individual secret responses

Proposed API vs Current API

Permissions

Mitigation

Audit

Test Plan

Conjur

Request tests

Individual secret request tests

Secrets Provider for K8s

Logs

Performance tests

Benchmark

Open questions

Future enhancements

Delivery Plan

Files

batch_retrieval_design.md

Latest commit

History

batch_retrieval_design.md

File metadata and controls

Batch retrieval design

Useful Links

References

Motivation

Design

Secrets Provider init container flow

URI

POST /secrets

Request Body

Headers

Response

Batch Response

Response body

Individual secret responses

Proposed API vs Current API

Permissions

Mitigation

Audit

Test Plan

Conjur

Request tests

Individual secret request tests

Secrets Provider for K8s

Logs

Performance tests

Benchmark

Open questions

Future enhancements

Delivery Plan