Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGSEGV on startup with large value of --max-stats if total stats size is greater than 2^32 #3463

Closed
ggreenway opened this issue May 22, 2018 · 11 comments
Assignees
Labels
Milestone

Comments

@ggreenway
Copy link
Contributor

Repro steps:
envoy --max-stats 10000000000

Call Stack:

#0  0x00000000007784cc in Envoy::BlockMemoryHashSet<Envoy::Stats::RawStatData>::initialize (this=0x255ffa0, options=...)
    at bazel-out/k8-dbg/bin/source/common/common/_virtual_includes/block_memory_hash_set_lib/common/common/block_memory_hash_set.h:232
#1  0x0000000000777074 in Envoy::BlockMemoryHashSet<Envoy::Stats::RawStatData>::BlockMemoryHashSet (this=0x255ffa0, options=..., init=true, memory=0x7fffc78760c8 "")
    at bazel-out/k8-dbg/bin/source/common/common/_virtual_includes/block_memory_hash_set_lib/common/common/block_memory_hash_set.h:73
#2  0x0000000000773825 in Envoy::Server::HotRestartImpl::HotRestartImpl (this=0x25c2400, options=...) at source/server/hot_restart_impl.cc:127
#3  0x00000000004333c6 in Envoy::MainCommonBase::MainCommonBase (this=0x25b8998, options=...) at source/exe/main_common.cc:52
#4  0x0000000000433acc in Envoy::MainCommon::MainCommon (this=0x25b8500, argc=3, argv=0x7fffffffe438) at source/exe/main_common.cc:97
#5  0x0000000000422b6e in std::make_unique<Envoy::MainCommon, int&, char**&> () at /usr/include/c++/5/bits/unique_ptr.h:765
#6  0x0000000000419c9e in main (argc=3, argv=0x7fffffffe438) at source/exe/main.cc:19
@ggreenway ggreenway added the bug label May 22, 2018
@ggreenway
Copy link
Contributor Author

Various bits of code in BlockMemoryHashSet use uint32_t for the size, so this is the result of an undetected overflow.

@ggreenway
Copy link
Contributor Author

cc @jmarantz

@jmarantz
Copy link
Contributor

I assume the action-item is just to check that and print a reasonable error...or do you think we should be using larger ints for the offsets in the structure?

@ggreenway
Copy link
Contributor Author

I think checking/printing is good enough for now.

@jmarantz
Copy link
Contributor

can you assign to me?

@mattklein123 mattklein123 added this to the 1.7.0 milestone May 23, 2018
@jmarantz
Copy link
Contributor

Actually the system can't handle 1G cells and it should be able to;, even though the number of bytes may be >4G, so there are I think two parts to this bug:

  • provide access to up to 4G stats
  • error out when asked for more than that

@jmarantz
Copy link
Contributor

jmarantz commented May 24, 2018

In my workspace, I've changed some uint32 to size_t that were artificially limiting the # of stats to <4G, but I think that doesn't matter in practice, at least on my machine, because I run out of memory before I run out of addressable stats.

I can handle about 200M stats, but this bug is about 10G stats, which requires 1.6T shared memory. My workstation doesn't have that much, but creating the shared-memory and mmaping it succeed. It only fails when you access enough pages, and the failure-indication (for me) is SIGBUS, and that failure happens when I ask for 250M stats, which can be indexed in 32 bits easily.

I think we should just declare a max # of stats at most 100M or maybe some smaller reasonable number, independent of available memory. What should that number be?

@jmarantz
Copy link
Contributor

Fun fact: if you are debugging some boundary conditions with an egregious amount of shared memory, make sure you get a clean run with a sane size, otherwise your workstation will be left with zero bytes of shared memory available, and a lot of things will not work, in and out of Envoy...

@ggreenway
Copy link
Contributor Author

In my case, I have much larger-than-default names. I think I'm setting max-obj-name-len to 384, and the crashing value for max-stats was 10M. These are much smaller than the numbers you're talking about. Was your testing after converting uint32-->size_t? If so, that may be right. I think the original bug was just an overflow of a uint32.

@jmarantz
Copy link
Contributor

Yeah, there's a real limitation imposed by the impl, which I have fixed, that some of the byte-size values were represented by uint32_t. Without my change I segv on 100M stats (the limit I just set), and with it I don't. There probably also should be a limit on max-stats * max-obj-name-len to be something that's likely to fit into shared memory.

@jmarantz
Copy link
Contributor

With existing system, 25M stats works:

bazel-bin/source/exe/envoy-static --config-path  $CFG --mode init_only --max-stats 25000000

and 30M stats fails:

bazel-bin/source/exe/envoy-static --config-path $CFG --mode init_only --max-stats 30000000

because it's 160 bytes per entry, and 30M * 160 = 4.8e9, whereas 25*160=4e9, which fits in 32 bits. Fixing the offsets resolves this, which I have pending.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants