Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack Size increased with 1.18.x (platform-tools) #1186

Open
adrena-orex opened this issue May 4, 2024 · 20 comments
Open

Stack Size increased with 1.18.x (platform-tools) #1186

adrena-orex opened this issue May 4, 2024 · 20 comments

Comments

@adrena-orex
Copy link

Problem

Programs built with solana 1.18.x experience new stack size related issues.

Details

We started having issues after migrating from solana 1.16.16 and anchor 0.28.0 to solana 1.18.11 and anchor 0.29.0.

Started getting messages when building informing us of our ix using too much stack, i.e:

Error: Function _ZN145_$LT$adrena..instructions..add_custody..AddCustody$u20$as$u20$anchor_lang..Accounts$LT$adrena..instructions..add_custody..AddCustodyBumps$GT$$GT$12try_accounts17h13b564b67762dd31E Stack offset of 4232 exceeded max offset of 4096 by 136 bytes, please minimize large stack variables

Then at runtime:

'Program 76FLzSMB8xPPt2y4ZKWeMiQBnXnqvU2xyJtc4PogsxMa consumed 2523 of 200000 compute units',
'Program 76FLzSMB8xPPt2y4ZKWeMiQBnXnqvU2xyJtc4PogsxMa failed: Access violation in stack frame 5 at address 0x200005ff0 of size 8'

Hotfixed the issue by splitting IXs and migrating accounts to use zero_copy.

We have noticed that the init of accounts takes a lot of stack, 800 bytes+. Can't have more than 3 accounts to init within the same instruction atm.

Note: Even when testnet tests works, it may still fail on devnet.

Setup:

orex> solana --version
solana-cli 1.18.11 (src:c78a9720; feat:3469865029, client:SolanaLabs)
orex> anchor --version
anchor-cli 0.29.0
orex> solana cluster-version                                                                                             
1.18.12
orex> solana config get
Config File: /Users/orex/.config/solana/cli/config.yml
RPC URL: https://api.devnet.solana.com 
WebSocket URL: wss://api.devnet.solana.com/ (computed)
Keypair Path: /Users/orex/.config/solana/id.json 
Commitment: processed

Using https://github.com/anza-xyz/platform-tools/releases v1.41 to compile.

Related

Found an issue here talking about platform tools growing stack usage.

Related to #1158

Proposed Solution

@LucasSte
Copy link

I believe the problem is fixed is platform tools version v1.41. You can check the version of your tools using running cargo-build-sbf --version.

Either way, can you provide an example code that triggers the problem?

@etodanik
Copy link

@LucasSte we have an existing program that got into this problem when moving from 1.17 to 1.18.15, even though platform tools is 1.41, there's still significantly more stack usage compared to 1.17.

In our case anchor's init constraints were the culprit, so we fixed it by moving some ATA account inits to separate IX, but I believe 1.18 is still vastly overusing stack.

@LucasSte
Copy link

@LucasSte we have an existing program that got into this problem when moving from 1.17 to 1.18.15, even though platform tools is 1.41, there's still significantly more stack usage compared to 1.17.

In our case anchor's init constraints were the culprit, so we fixed it by moving some ATA account inits to separate IX, but I believe 1.18 is still vastly overusing stack.

Did you run into the stack access violation error as well? If so, would mind providing an example of the offending code?

@blockiosaurus
Copy link

We're seeing this all over the Metaplex Program Library as users and ourselves migrate to 1.18

@Woody4618
Copy link

Here is a simple reproduction of the stack frame problem, or at least of one possible occasion. When running the tests in this repository against a local validator that has feature: EenyoWx9UMXYKpR8mW5Jmfmy2fRjzUtM7NduYMY8bx33 diesabled the stack frame error appears, when running them locally with the feature enabled the error does not appear. You can find the code here: https://github.com/Woody4618/stackframe_repro/blob/main/README.md

@LucasSte
Copy link

Here is a simple reproduction of the stack frame problem, or at least of one possible occasion. When running the tests in this repository against a local validator that has feature: EenyoWx9UMXYKpR8mW5Jmfmy2fRjzUtM7NduYMY8bx33 diesabled the stack frame error appears, when running them locally with the feature enabled the error does not appear. You can find the code here: https://github.com/Woody4618/stackframe_repro/blob/main/README.md

The feature disables stack frame gaps, amongst other changes, to enhance the network performance. The gaps serve to detect invalid memory accesses by programs. If you enable the feature, the protection goes away, but it does not mean your contract is behaving correctly.

The Rust compiler does not guarantee a stable ABI, so default sizes may change between releases. Tempering with the compiler to change allocation would overshadow a language feature: Rust allows you to choose where to allocate items (e.g. by using Box).

The main issue I've found was the try_accounts from Anchor, which needs to be split into smaller methods to decrease stack usage.

coral-xyz/anchor#3060

@etodanik
Copy link

@LucasSte I agree that try_accounts is stack heavy, but we should also not regress like that on an already limited stack memory usage efficiency. We'll post our repro shortly as well, but it also has to do with Anchor accounts.

@Woody4618
Copy link

Woody4618 commented Jun 28, 2024

Will the feature be enabled on mainet soon? Otherwise it would help maybe disable it on localhost and devnet, so people dont get surprised when deploying to mainnet? Then they can fix the problems, by splitting instructions and boxing accounts savely.

@LucasSte
Copy link

Will the feature be enabled on mainet soon? Otherwise it would help maybe disable it on localhost and devnet, so people dont get surprised when deploying to mainnet? Then they can fix the problems, by splitting instructions and boxing accounts savely.

I've found out that the compiler has been missing stack errors. This is why you may not see a compiler error, but your contract may error out during execution. The next release will solve this. If you see the error there, your contract will hit the same error on main net.

@LucasSte
Copy link

LucasSte commented Jul 8, 2024

I discussed this issue with @acheroncrypto it looks like PR coral-xyz/anchor#2939 fixes the stack problems with try_accounts

@Woody4618
Copy link

Woody4618 commented Sep 9, 2024

Hey @LucasSte there is another ecosystem team that has stack frame issues when upgrading from 1.17 to 1.18. They created a great reproduction:
https://github.com/oeble/access_violation_repro

It has warnings like these: Error: Function ZN105$LT$vprog..InitializeStrategy$u20$as$u20$anchor_lang..Accounts$LT$vprog..InitializeStrategyBumps$GT$$GT$12try_accounts17hb64a220c71e28f36E Stack offset of 5168 exceeded max offset of 4096 by 1072 bytes, please minimize large stack variables

Are these warnings now always turn into runtime errors with version 1.18? Whats the best way to fix these? Is the only option to decrease size of the accounts?

@LucasSte
Copy link

LucasSte commented Sep 9, 2024

Are these warnings now always turn into runtime errors with version 1.18?

These warnings are an indication that your program will misbehave during execution. You do not always see the runtime error in a program with such warning, because the offending case depends on which code path you execute. In general, all stack warnings emitted will lead to a runtime error.

Whats the best way to fix these? Is the only option to decrease size of the accounts?

I identified that the compiler was missing stack warning for many cases. This rendered any fix attempt by Anchor or other developers difficult. In Agave v2.0, we shipped platform-tools v1.42, with fixes to warn about all possible scenarios the contract may cause a runtime error. This allowed the Anchor team to merge coral-xyz/anchor#2939, which decreases the stack usage of function try_accounts (the warning @Woody4618 mentioned points to this function).

I am not aware of Anchor's release schedule, but I can say that upgrading to platform tools v1.42 and using an Anchor version with such a fix will solve most problems.

@acheroncrypto
Copy link

As @LucasSte mentioned, this problem is mostly fixed from Anchor's side, but the next version is currently blocked by mpl-token-metadata being incompatible with the Solana v2 crates coral-xyz/anchor#3219 (comment). Fortunately, there is already a PR to resolve this (metaplex-foundation/mpl-token-metadata#141).

@blockiosaurus
Copy link

Yup, we're on it. As long as there isn't a blocker updating our client crate and leaving the program on 1.x then it should be done soon.

@lukenguyen0191
Copy link

Hey can you guys help? I'm using anchor v0.30.1, agave and solana-cli v1.18.26, rustc v1.83.0 and I also set the version of solana-program in Cargo.toml to "=1.18.26", but when I run anchor build it always tries to install platform tools v1.41. I want the v1.42 to be installed.
Do I have to upgrade to solana v2? Thanks!

@LucasSte
Copy link

LucasSte commented Jan 9, 2025

Hi @lukenguyen0191,
Yes, the easiest way would be to upgrade the CLI tools to v2.0 or v2.1.

@Maverick9081
Copy link

Maverick9081 commented Jan 10, 2025

Hey Guys , I'm having this issue for a while in my solana native program , Things that were previously are now not working . I tried Every versions of CLI as well.

The error stems from here in CLI 1.18.18

pub fn assert_account(key1: &Pubkey, key2: &Pubkey) -> Result<(), Error> {
    if cmp_pubkeys(key1, key2) {
        Ok(())
    } else {
        Err(PublicKeyMismatch.into())
    }
}

If i use 2.1.0, Then
I get bunch of these errors during compilation

Error: A function call in method _ZN15steward_program9processor9Processor7process17hf8cde772daf326d9E overwrites values in the frame. Please, decrease stack usage or remove parameters from the call.The function call may cause undefined behavior during execution.

Error: A function call in method _ZN15steward_program9processor9Processor7process17hf8cde772daf326d9E overwrites values in the frame. Please, decrease stack usage or remove parameters from the call.The function call may cause undefined behavior during execution.

and some error from spl-token crate

   Compiling spl-token-2022 v4.0.0
Error: Function _ZN14spl_token_20229extension21confidential_transfer12verify_proof30verify_transfer_with_fee_proof17hbae36c8691128c3bE Stack offset of 4392 exceeded max offset of 4096 by 296 bytes, please minimize large stack variables. Estimated function frame size: 4472 bytes. Exceeding the maximum stack offset may cause undefined behavior during execution.

Error: A function call in method _ZN14spl_token_20229extension21confidential_transfer12verify_proof30verify_transfer_with_fee_proof17hbae36c8691128c3bE overwrites values in the frame. Please, decrease stack usage or remove parameters from the call.The function call may cause undefined behavior during execution.

My cargo.toml file

[dependencies]
solana-program = "2.0.0"
borsh = "1.5.1"
thiserror = "1.0.20"
spl-stake-pool = { features = ["no-entrypoint"], version = "2.0.0" }

@LucasSte
Copy link

Hi @Maverick9081,
The first error is likely a problem in a function on your contract. Rust mangles function names, so you may find the problem in a function called process.

The second error will disappear when you upgrade SPL-token to the newest version (6.0 or 7.0).

@Maverick9081
Copy link

Maverick9081 commented Jan 11, 2025

@LucasSte Thanks For the Reply!
The First error just started randomly, This was not the problem previously with 1.18.18 but suddenly I get this error, This program is live on main-net, there is no error in there, But i now i recompile it with same cli 1.18.18 , i get this runtime error.

i'm not using the spl-token it is being used in spl-stake-pool and Why there are suddenly this errors during compile time

Error: A function call in method _ZN15steward_program9processor9Processor7process17hf8cde772daf326d9E overwrites values in the frame. Please, decrease stack usage or remove parameters from the call.The function call may cause undefined behavior during execution.

Error: A function call in method _ZN15steward_program9processor9Processor7process17hf8cde772daf326d9E overwrites values in the frame. Please, decrease stack usage or remove parameters from the call.The function call may cause undefined behavior during execution.

when they were previously non-existent

@LucasSte
Copy link

LucasSte commented Jan 13, 2025

i'm not using the spl-token it is being used in spl-stake-pool

You need to update spl-stake-pool then.

and Why there are suddenly this errors during compile time

The existing SBF architecture has a limitation of 4096 bytes of frame space per function. The compiler present in old versions of platform tools was generating potentially wrong code without warning users about it. I patched it so that the error is always present when there is a problem with your contract..

The 4096 bytes limitation will go away very soon with solana-foundation/solana-improvement-documents#166. When we finally release it, you won't encounter any more problems building your contracts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants