Seeing slow helm chart deployment #1597

MitchellGerdisch · 2021-05-28T15:04:05Z

Deploying gloo-ee helm chart on EKS cluster with a set of gloo values specified takes hours to complete.
The delay occurs in both preview and pulumi up use cases and the delay occurs before any resource updates are displayed.
If the gloo values are commented out of the values file, the deployment is speedy (under 2 minutes).

Expected behavior

Be able to handle custom values for the chart in a timely fashion.

Current behavior

Takes hours to process the gloo custom values before deploying the chart components.

Steps to reproduce

Launch an EKS cluster. Using the defaults for an eks cluster launched via the eks package is sufficient. Make sure it returns a stack output named kubeconfig.
Use the attached files (remove the .txt) to launch the gloo-ee chart on the eks cluster from step 1.
- You'll need to set a config value containing the stack name from step Prototype a Kubernetes resource provider #1
  main.py.txt
  values.yaml.txt
Test with the values file where the gloo: section is commented out and notice it deploys quickly.
Uncomment the gloo: section and notice that it takes a long time to process before deploying the components.

Context (Environment)

Affected feature

lukehoban · 2021-05-29T00:56:19Z

This is hanging (or just taking a really long time) in Output.from_input.

The call that hangs is

pulumi-kubernetes/sdk/python/pulumi_kubernetes/helm/v3/helm.py

Lines 428 to 430 in 8d00716

    
           def to_json(self): 
        
               return pulumi.Output.from_input(self.__dict__).apply( 
        
                   lambda x: json.dumps(x, default=lambda o: {k: v for (k, v) in o.__dict__.items() if v is not None}))

.

The value of print(self.__dict) there is this:

{'namespace': 'gloo-system', 'include_test_hook_resources': None, 'skip_crd_rendering': None, 'values': {'gloo': None, 'gatewayProxies': {'gatewayProxy': {'gatewaySettings': {'customHttpGateway': {'options': {'httpConnectionManagerSettings': {'tracing': {'verbose': True, 'requestHeadersForTags': ['x-user-id'], 'datadogConfig': {'clusterName': 'datadog_agent', 'service_name': 'envoy'}}}}}, 'customHttpsGateway': {'options': {'httpConnectionManagerSettings': {'tracing': {'verbose': True, 'requestHeadersForTags': ['x-user-id'], 'datadogConfig': {'clusterName': 'datadog_agent', 'service_name': 'envoy'}}}}}, 'options': {'accessLoggingService': {'accessLog': [{'fileSink': {'path': '/dev/stdout', 'jsonFormat': {'startTime': '%START_TIME(%Y/%m/%dT%H:%M:%S%z %s)%', 'requestType': '%REQ(:METHOD)%', 'requestPath': '%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%', 'protocol': '%PROTOCOL%', 'duration': '%DURATION%', 'responseCode': '%RESPONSE_CODE%', 'upstreamCluster': '%UPSTREAM_CLUSTER%', 'requestSize': '%BYTES_RECEIVED%', 'responseSize': '%BYTES_SENT%', 'clientAddress': '%DOWNSTREAM_REMOTE_ADDRESS_WITHOUT_PORT%', 'userID': '%REQ(X-USER-ID)%'}}}]}}}, 'kind': {'deployment': None, 'replicas': 10, 'customEnv': [{'name': 'DD_ENV', 'value': 'env'}, {'name': 'DD_AGENT_HOST', 'valueFrom': {'fieldRef': {'fieldPath': 'status.hostIP'}}}]}, 'service': {'extraAnnotations': {'service.beta.kubernetes.io/aws-load-balancer-type': 'nlb', 'service.beta.kubernetes.io/aws-load-balancer-proxy-protocol': '*', 'service.beta.kubernetes.io/aws-load-balancer-access-log-enabled': 'true', 'service.beta.kubernetes.io/aws-load-balancer-access-log-emit-interval': '5', 'service.beta.kubernetes.io/aws-load-balancer-access-log-s3-bucket-name': 'gloo-access-logs.env.pinecone.io'}}}}, 'tracing': {'provider': {'name': 'envoy.tracers.datadog', 'typed_config': {'@type': 'type.googleapis.com/envoy.config.trace.v3.DatadogConfig', 'collector_cluster': 'datadog_agent', 'service_name': 'envoy'}}, 'cluster': [{'name': 'datadog_agent', 'connect_timeout': '1s', 'type': 'STRICT_DNS', 'lb_policy': 'ROUND_ROBIN', 'load_assignment': {'cluster_name': 'datadog_agent', 'endpoints': [{'lb_endpoints': [{'endpoint': {'address': {'socket_address': {'address': 'datadog-tracing.datadog', 'port_value': 8126}}}}]}]}}]}, 'discovery': {'enabled': False}, 'crds': {'create': True}, 'grafana': {'defaultInstallationEnabled': False}, 'prometheus': {'enabled': False}, 'observability': {'enabled': False}, 'apiServer': {'enable': False, 'enterprise': False}, 'settings': {'replaceInvalidRoutes': True, 'invalidConfigPolicy': {'replaceInvalidRoutes': True, 'invalidRouteResponseCode': 404, 'invalidRouteResponseBody': '{"message": "Not found"}'}}, 'global': {'extensions': {'extAuth': {'deployment': {'replicas': 20}, 'envoySidecar': True, 'standaloneDeployment': False}}}}, 'transformations': None, 'resource_prefix': None, 'api_versions': None, 'chart': 'gloo-ee', 'repo': None, 'version': 'v1.6.2', 'fetch_opts': <pulumi_kubernetes.helm.v3.helm.FetchOpts object at 0x10f070340>, 'release_name': 'glooe-helm-chart'}

It is not at all clear why Output.from_input would take hours to process this value.

lukehoban · 2021-05-29T03:46:10Z

Here's a reduced program that doesn't use Kubernetes at all - but still hangs for at least several minutes (might well be much longer):

import yaml
import pulumi
loaded_values = yaml.load(open('./values.yaml'), Loader=yaml.FullLoader)
pulumi.Output.from_input(loaded_values)

With the values.yaml file containing just:

gatewayProxies:
  gatewayProxy:
    gatewaySettings:
      options:
        accessLoggingService:
          accessLog:
            - fileSink:
                path: /dev/stdout
                jsonFormat:
                  startTime: "%START_TIME(%Y/%m/%dT%H:%M:%S%z %s)%"

The leaf node value "/dev/stdout" for example is visited hundreds of times by from_input in the first minute of execution.

lukehoban · 2021-05-29T03:56:15Z

Output.from_input appears to be exponential in the depth of nested objects!

joeduffy · 2021-05-29T21:26:56Z

Wow, that's wild! I couldn't help but look. It appears Output.all takes Input[T]s, so not clear why from_input is also recursively calling from_input, before invoking Output.all -- which itself will also call from_input recursively.

I'm sure there's some subtlety with unwrapping nested outputs or somesuch, however simplifying from_input to the following causes your repro above, Luke, to drop to under 1s for me.

diff --git a/sdk/python/lib/pulumi/output.py b/sdk/python/lib/pulumi/output.py
index e7cf52faa..658320bf2 100644
--- a/sdk/python/lib/pulumi/output.py
+++ b/sdk/python/lib/pulumi/output.py
@@ -255,7 +255,7 @@ class Output(Generic[T]):
         if _types.is_input_type(typ):
             # Since Output.all works on lists early, serialize the class's __dict__ into a list of lists first.
             # Once we have a output of the list of properties, we can use an apply to re-hydrate it back as an instance.
-            items = [[k, Output.from_input(v)] for k, v in val.__dict__.items()]
+            items = val.__dict__.items()

             # pylint: disable=unnecessary-comprehension
             fn = cast(Callable[[List[Any]], T], lambda props: typ(**{k: v for k, v in props})) # type: ignore
@@ -265,15 +265,14 @@ class Output(Generic[T]):
         if isinstance(val, dict):
             # Since Output.all works on lists early, serialize this dictionary into a list of lists first.
             # Once we have a output of the list of properties, we can use an apply to re-hydrate it back into a dict.
-            dict_items = [[k, Output.from_input(v)] for k, v in val.items()]
+            dict_items = val.items()
             # type checker doesn't like returning a Dict in the apply callback
             fn = cast(Callable[[List[Any]], T], lambda props: {k: v for k, v in props}) # pylint: disable=unnecessary-comprehension
             return Output.all(*dict_items).apply(fn, True)

         if isinstance(val, list):
-            list_items: List[Union[Any, Awaitable[Any], Output[Any]]] = [Output.from_input(v) for v in val]
             # invariant: http://mypy.readthedocs.io/en/latest/common_issues.html#variance
-            output: Output[T] = cast(Output[T], Output.all(*list(list_items))) # type: ignore
+            output: Output[T] = cast(Output[T], Output.all(*list(val))) # type: ignore
             return output

         # If it's not an output, list, or dict, it must be known and not secret

These mutually recursive functions unintentionally had exponential complexity in nesting depth of objects, arg types and most likely arrays. Remove the exponential complexity by avoiding direct recursion of `from_input` on itself, and relying on mutual recursion with `all` alone to reduce nested substrcture. Also simplify the implementation to aid readability. Fixes pulumi/pulumi-kubernetes#1597

These mutually recursive functions unintentionally had exponential complexity in nesting depth of objects, arg types and most likely arrays. Remove the exponential complexity by avoiding direct recursion of from_input on itself, and relying on mutual recursion with all alone to reduce nested substrcture. Also simplify the implementation to aid readability. Fixes pulumi/pulumi-kubernetes#1597.

These mutually recursive functions unintentionally had exponential complexity in nesting depth of objects, arg types and most likely arrays. Remove the exponential complexity by avoiding direct recursion of from_input on itself, and relying on mutual recursion with all alone to reduce nested substructure. Also simplify the implementation to aid readability. Fixes pulumi/pulumi-kubernetes#1597. Fixes pulumi/pulumi-kubernetes#1425. Fixes pulumi/pulumi-kubernetes#1372. Fixes #3987.

MitchellGerdisch added the kind/bug Some behavior is incorrect or out of spec label May 28, 2021

lukehoban self-assigned this May 29, 2021

lukehoban added this to the 0.57 milestone May 29, 2021

lukehoban added the impact/performance Something is slower than expected label May 29, 2021

lblackstone mentioned this issue May 29, 2021

Kubernetes 1.18+ with server side apply featuregate causes huge CPU usage and execution time. #1372

Closed

lukehoban mentioned this issue May 31, 2021

[sdk/python] Avoid exponential complexity for from_input/all pulumi/pulumi#7175

Merged

lukehoban closed this as completed in pulumi/pulumi#7175 Jun 1, 2021

pulumi-bot added the resolution/fixed This issue was fixed label Jun 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seeing slow helm chart deployment #1597

Seeing slow helm chart deployment #1597

MitchellGerdisch commented May 28, 2021

lukehoban commented May 29, 2021

lukehoban commented May 29, 2021

lukehoban commented May 29, 2021

joeduffy commented May 29, 2021

Seeing slow helm chart deployment #1597

Seeing slow helm chart deployment #1597

Comments

MitchellGerdisch commented May 28, 2021

Expected behavior

Current behavior

Steps to reproduce

Context (Environment)

Affected feature

lukehoban commented May 29, 2021

lukehoban commented May 29, 2021

lukehoban commented May 29, 2021

joeduffy commented May 29, 2021