Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Toolkit: Python application containing N+ Resources Crashes with error: Malformed request, "API" field is required #15088

Closed
bgshacklett opened this issue Jun 11, 2021 · 30 comments
Assignees
Labels
bug This issue is a bug. language/python Related to Python bindings needs-reproduction This issue needs reproduction. p1 package/tools Related to AWS CDK Tools or CLI

Comments

@bgshacklett
Copy link

bgshacklett commented Jun 11, 2021

My team has run into an issue with the CDK toolkit crashing when when reach a certain number of resources within our application. We've ended up having to split the application multiple times, at this point, to deal with this limitation, which does not appear to be documented.

Any operation which causes CDK to run app.synth() appears to result in a crash. This may be as simple as running cdk list.

The exact number of resources in question is uknown at this time, but I suspect the number is somewhere around 1000, split across about 15 stacks.

Reproduction Steps

  1. Create a CDK application which contains more than N number of [explicitly defined] resources, where N is a yet to be determined number, likely on the order of 1000. More details incoming shortly.
  2. Ensure that the resources specified contain a sufficiently large configuration (specific requirements are currently unknown).
  3. Run cdk list within the application directory

What did you expect to happen?

CDK should output a list of stacks.

What actually happened?

CDK crashes with an error which appears to originate from JSII:

    throw new Error('Malformed request, "api" field is required');
    ^

There is a line in the JSII code which matches this error quite well:
https://github.com/aws/jsii/blob/main/packages/@jsii/runtime/lib/host.ts#L97

Environment

  • CDK CLI Version: 1.104.0
  • Framework Version: 1.106.1
  • Node.js Version: v14.16.1
  • OS : Windows
  • Language (Version): Python v3.8

Other

Further details incoming.


Update: 2021-06-24:
I've attempted numerous ways of looping over resource definitions in an attempt to recreate the issue and I have, thus far, been unable to create a test case outside of our repository, which is, sadly, not something I can share.

Above details have been updated, as well as possible, to include recent discoveries.


This is 🐛 Bug Report

@bgshacklett bgshacklett added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jun 11, 2021
@bgshacklett
Copy link
Author

As I was writing a test case for this, I found that the most recent versions of CDK are at least resulting in a useful error in this situation:

    raise JSIIError(resp.error) from JavaScriptError(resp.stack)
jsii.errors.JSIIError: Number of resources: 501 is greater than allowed maximum of 500

I think it makes sense to close this issue, though a limit of 500 resources is super low given that CDK is supposed to be dealing with multiple stacks compiled into Applications and CDK constructs often create many resources on their own.

@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

@bgshacklett
Copy link
Author

It would appear that I closed this issue prematurely. The limit that I saw was actually a per-stack limit, not a limit across the application. Effectively, my test case was wrong. In my additional attempts to replicate a proper test case, I've found that this seems to require more than a significant number of resources, but those resources must contain sufficient configuration detail to increase the size of some request payload.

I'm still working on getting a proper test case written, but I'm re-opening this in the meantime for visibility's sake, in case someone else has additional information about it.

@bgshacklett bgshacklett reopened this Jun 11, 2021
@peterwoodworth peterwoodworth added the package/tools Related to AWS CDK Tools or CLI label Jun 15, 2021
@rix0rrr
Copy link
Contributor

rix0rrr commented Jun 21, 2021

The 500 resources per stack limit is imposed by CloudFormation and not something we can influence. Sorry you're experiencing problems with that right now. You should have been seeing a warning appear when you started to approach this limit, that's unfortunately the best we can do. The only thing I can recommend is breaking your application up into Stacks or Nested Stacks to keep the resource count down.

"Malformed request" is an odd error. You're running into a jsii issue here. It might or might not be related to the synth error. Will forward to the jsii team to have them triage.

@rix0rrr rix0rrr added the language/python Related to Python bindings label Jun 21, 2021
@MrArnoldPalmer MrArnoldPalmer removed their assignment Jun 21, 2021
@MrArnoldPalmer
Copy link
Contributor

"Malformed request" doesn't really fit but my first guess would be an OOM error. @RomainMuller wdyt?

@bgshacklett
Copy link
Author

@rix0rrr It's not a per-stack limitation that we've run into, though. We're keeping the number of resources below 200 per stack, so far. OOM is an interesting thought; I'll investigate that front.

@RomainMuller
Copy link
Contributor

You can try to move past a potential OOM situation by setting the NODE_OPTIONS=--max-old-space-size=X, where X is an amount of megabytes to allow. The default can be relatively low (depends on several factors I'm not going to elaborate here). I suggest trying with 4096 (4GiB) or 8192 (8GiB).

If this allows your app to run, then your problem likely is an OOM error.

@bgshacklett
Copy link
Author

I was really hopeful that this might be the answer. Alas, setting --max-old-space-size=X had no impact on the behavior.

It definitely seems like we're overflowing some kind of storage space, and I'm beginning to think it might be the full size of the code base at this point, as I've been unable to duplicate the issue by running things in a loop. It seems to be happening only when the resources are defined explicitly. To get a valid test case I feel I might have to resort to code generation of some sort.

This may mean that the use of higher level constructs to bind things together could help us, so I'll do some investigation on that front.

@NGL321 NGL321 added p1 and removed needs-triage This issue or PR still needs to be triaged. labels Jul 23, 2021
@NGL321
Copy link
Contributor

NGL321 commented Jul 23, 2021

Were we able to determine if this was caused by the Cloudformation stack resource limitation or if it is a separate bug unrelated to said limit?

@bgshacklett
Copy link
Author

It's definitely not a stack resource limitation. We're not even coming close to the old limits on a per-stack basis.

@jordankoppole
Copy link

I ran into the same issue, there is no information in the error that relates to resource limitation.

/lib/program.js:9474 throw new Error('Malformed request, "api" field is required'); ^ Error: Malformed request, "api" field is required

@rix0rrr
Copy link
Contributor

rix0rrr commented Feb 8, 2022

I notice the platform on which this was reported is Windows. Are everyone who are seeing this error on Windows, by any chance?

@rix0rrr rix0rrr added needs-reproduction This issue needs reproduction. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. labels Feb 8, 2022
@rix0rrr rix0rrr removed their assignment Feb 9, 2022
@github-actions
Copy link

This issue has not received a response in a while. If you want to keep this issue open, please leave a comment below and auto-close will be canceled.

@github-actions github-actions bot added the closing-soon This issue will automatically close in 4 days unless further comments are made. label Feb 10, 2022
@bgshacklett
Copy link
Author

I have not seen this error outside of a Windows environment. Unfortunately I'm no-longer working with the original code base which I saw this in, but I'll reach out to the current maintainers to see if anyone can duplicate it on a non-windows machine.

@github-actions github-actions bot removed closing-soon This issue will automatically close in 4 days unless further comments are made. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. labels Feb 11, 2022
@kgeisink
Copy link

kgeisink commented Apr 6, 2022

I unfortunately ran into this issue yesterday and have been doing some debugging. A consistently reproducible scenario is still escaping me, but perhaps any of this information can help.

Environment
CDK CLI Version: 2.19.0 (build e0d3e62)
Framework Version: 2.19.0
Node.js Version: v14.17.1
OS : Windows
Language (Version): Python v3.8

  • We experienced this issue initially on Windows but were able to trigger it on Linux (via AWS CodeBuild) and MacOS as well.
  • When I print the malformed request it contains the following:
    {"complete":{"cbid":"jsii::callback::21883","err":"Maximum call stack size exceeded","result":null,"api":"complete"}}
    which makes it sound like something goes wrong in JSII?
  • We could not pin down a particular # of resources this started at, though commenting out a couple of templates more or less always removes this problem.
  • Using NODE_OPTIONS did not have any notable changes.

Edit:

  • At a certain point I am able to trigger it by enabling and disabling the feature flag "aws-iam:minimizePolicies". So it definitely has to do with stack or project size limitations.

@MrArnoldPalmer
Copy link
Contributor

@kgeisink, you're able to make the error occur by disabling "aws-iam:minimizePolicies"? IE: disabling the feature flag, bigger unminimized policies === error. Enabling, minimized policies, no error?

Just double checking. It does seem like this has only been reported by windows users thus far so at least that gives us a place to start with investigation.

As a workaround, you can try increasing the node --stack-size option as this appears to be a stack overflow and not OOM. Perhaps the default stack size on windows is smaller and therefore users are more likely to encounter there. If we can identify what recursion in the JSII runtime leads to this we may be able to optimize.

@kgeisink
Copy link

@MrArnoldPalmer, In short, yes, but I only tested it after I'd found the "breaking point".
As I started commenting out resources, there was a particular point from where CDK succeeds synthesizing again. Eventually I got to a point where I could trigger the error by adding/uncommenting one single resource. From there I tried setting aws-iam:minimizePolicies to false (it has been true as part of our default feature flags) and this triggered the error same as adding/uncommenting that one resource would.

(To clarify: The breaking point is not tied to one particular resource, just (presumably) the amount of resources.)

In addition to Windows we also got the same error on our CodeBuild instance running a Linux image aws/codebuild/standard:5.0.

I tried cdk ls --stack-size with 10000 as max value and still get the same error. Unsure how else to set the stack-size for node on a CDK call. Let me know if there is a particular way to go about it. :)

There is an AWS Support Case (9889364641) currently active in which I provided a project with a reproducible scenario. I hope that it will also be reproducible on your end, and help to provide some more insight into this issue.

@bgshacklett
Copy link
Author

If I understand correctly, it's the node stack size which needs to be increased. This should be doable by setting the NODE_OPTIONS environment variable:

# PowerShell
& (env:NODE_OPTIONS="--stack-size=10000"; cdk deploy $args)

@kgeisink
Copy link

kgeisink commented Apr 14, 2022

@bgshacklett Unfortunately --stack-size is not one of the allowed options for NODE_OPTIONS, at least according to what I found here.

Edit: Found the right way to set it, and can confirm I do not have the issue anymore locally when temporarily increasing the stack size.
cdk ls -a node.exe --stack-size 10000

@MrArnoldPalmer
Copy link
Contributor

@kgeisink thanks for the confirmation. Planning to look into this when I am able to but likely will be another couple days.

@kgeisink
Copy link

@MrArnoldPalmer Apologies for the false positive, it appears that cdk ls -a node.exe --stack-size=10000 is not a true way to modify the stack size. Still looking for an appropriate way to modify that Node property when running CDK commands. If I do find one I will let you know of the result.
Thank you for the update!

@MrArnoldPalmer
Copy link
Contributor

@kgeisink @bgshacklett wondering if either of you have code available that we can look at that causes this error? When testing the JSII python runtime and creating large numbers of objects 10 million+, I haven't gotten this error to reproduce. Is the stack trace originating in a consistent place within your code (a specific construct etc)?

@kgeisink
Copy link

@MrArnoldPalmer I have shared our codebase with a reproducible state via the AWS Support case that I mentioned above (9889364641). Unfortunately I am not able to share it via other means due to NDA restrictions. Would you be able to access it through there? If not I can try anonymising the code but that might take a little while given the size.

I have not been able to pin point it to a specific place in the code, the stack trace is also very generic. I will add it as an attachment. While commenting/uncommenting various stacks in resources I only noticed the trend that the total number of resources did seem to matter somehow.
E.g. When I removed a stack that contained 115 resources I would not have the error anymore, and if I kept that stack around, I would need to remove 2-3 smaller stacks for the error to go away.

It does appear that there is some place in our project that just seems to cause incredibly inefficient resource management, as I was also not able to reproduce it by generating a large amount of resources in loops. Though I do not have enough insight into CDK/JSII internals to know how much is benefitted off of reuse of course.

I've also added some of the JSII_DEBUG output leading up to the error including the error itself.

cdk ls stacktrace.txt
JSII_DEBUG+output.txt

@kgeisink
Copy link

kgeisink commented May 4, 2022

@MrArnoldPalmer I was wondering if the code that I shared was helpful and if you happen to have an update? If there is anything I can help with to troubleshoot on my end also please let me know.

@bgshacklett
Copy link
Author

I no-longer have access to the original code base, unfortunately.

@MrArnoldPalmer
Copy link
Contributor

@kgeisink yes! it is very helpful and I was able to reproduce the issue with that codebase on MacOS. I spent some time digging into debug logs but nothing immediate jumped out and I got pulled onto some other stuff. I will keep working on this and provide an update when I'm able to.

@lihonosov
Copy link

I had exactly the same error Malformed request, "API" field is required on a stack with ~200 resources. It didn't work on node v16 but worked fine on node v14. I've found that the root cause in my case was an aspect I used to add permissions boundary for all roles in a stack. I've removed that aspect and everything works fine now. The latest versions of CDK support this out of the box:

# This imports an existing policy.
boundary = _iam.ManagedPolicy.from_managed_policy_arn(
       scope=stack,
       id="Boundary",
       managed_policy_arn='arn:aws:iam::123456789012:policy/boundary',
)

# Apply the boundary to all Roles in a stack
_iam.PermissionsBoundary.of(stack).apply(boundary)

I hope this information will be useful for someone else

https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_iam-readme.html#permissions-boundaries https://docs.aws.amazon.com/cdk/api/v2/python/aws_cdk.aws_iam/PermissionsBoundary.html

@bgshacklett
Copy link
Author

The code base I was working with used a custom aspect for a similar purpose.

@TheRealAmazonKendra
Copy link
Contributor

It would appear that the various problems in this issue have all been solved. I'm going to go ahead and close this issue. If you believe this is in error, please feel free to open a new issue.

@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. language/python Related to Python bindings needs-reproduction This issue needs reproduction. p1 package/tools Related to AWS CDK Tools or CLI
Projects
None yet
Development

No branches or pull requests

10 participants