Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Envoy Segfaults #6744

Closed
kiranisaac opened this issue Apr 29, 2019 · 9 comments · Fixed by #6785
Closed

Envoy Segfaults #6744

kiranisaac opened this issue Apr 29, 2019 · 9 comments · Fixed by #6785
Assignees
Labels
Milestone

Comments

@kiranisaac
Copy link

kiranisaac commented Apr 29, 2019

Title: Envoy is segfaulting

Description:
Envoy is segfaulting. I am sending http1 GET requests. Envoy has grpc transcoding setup. Backend service is a grpc service.

I am running envoyproxy/envoy-alpine-debug:v1.10.0 within a docker container on a VM.

Repro steps:

Include sample requests, environment, etc. All data and inputs
required to reproduce the bug.

[2019-04-29 19:13:28.720][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:75] #29: [0x7f9f0deefcf8] 
[2019-04-29 19:13:28.720][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #28: Envoy::Thread::ThreadImplPosix::ThreadImplPosix()::{lambda()#1}::_FUN() [0xfa90e5] 
[2019-04-29 19:13:28.717][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #27: Envoy::Server::WorkerImpl::threadRoutine() [0xa8d10e] 
[2019-04-29 19:13:28.715][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #26: Envoy::Event::DispatcherImpl::run() [0xa9226a] 
[2019-04-29 19:13:28.712][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #25: event_base_loop [0xdca5c8] 
[2019-04-29 19:13:28.709][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #24: event_process_active [0xdc89bf] 
[2019-04-29 19:13:28.706][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #23: event_process_active_single_queue.isra.31 [0xdc8489] 
[2019-04-29 19:13:28.703][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #22: Envoy::Event::FileEventImpl::assignEvents()::{lambda()#1}::_FUN() [0xa92bea] 
[2019-04-29 19:13:28.701][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #21: Envoy::Network::ConnectionImpl::onFileEvent() [0xa98f2a] 
[2019-04-29 19:13:28.698][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #20: Envoy::Network::ConnectionImpl::onReadReady() [0xa9881a] 
[2019-04-29 19:13:28.695][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #19: Envoy::Network::FilterManagerImpl::onContinueReading() [0xa9bfac] 
[2019-04-29 19:13:28.692][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #18: Envoy::Http::CodecClient::CodecReadFilter::onData() [0xbf0f9d] 
[2019-04-29 19:13:28.690][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #17: Envoy::Http::CodecClient::onData() [0xbf0dc1] 
[2019-04-29 19:13:28.687][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #16: Envoy::Http::Http2::ConnectionImpl::dispatch() [0xc5d6a7] 
[2019-04-29 19:13:28.684][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #15: nghttp2_session_mem_recv [0xc66a5b] 
[2019-04-29 19:13:28.681][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #14: nghttp2_session_on_data_received [0xc61f20] 
[2019-04-29 19:13:28.679][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #13: Envoy::Http::Http2::ConnectionImpl::onFrameReceived() [0xc5c4a1] 
[2019-04-29 19:13:28.676][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #12: Envoy::Http::ConnectionManagerImpl::ActiveStream::encodeData() [0xc4c36b] 
[2019-04-29 19:13:28.673][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #11: Envoy::Http::ConnectionManagerImpl::ActiveStreamFilterBase::commonHandleAfterDataCallback() [0xc499aa] 
[2019-04-29 19:13:28.670][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #10: Envoy::Http::ConnectionManagerImpl::ActiveStreamFilterBase::commonHandleBufferData() [0xc40170] 
[2019-04-29 19:13:28.668][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #9: Envoy::Http::ConnectionManagerImpl::ActiveStreamEncoderFilter::responseDataTooLarge() [0xc40347] 
[2019-04-29 19:13:28.665][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #8: Envoy::Http::ConnectionManagerImpl::doEndStream() [0xc3fa40] 
[2019-04-29 19:13:28.662][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #7: Envoy::Http::ConnectionManagerImpl::doDeferredStreamDestroy() [0xc3f91e] 
[2019-04-29 19:13:28.659][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #6: Envoy::Router::Filter::cleanup() [0xc02c9f] 
[2019-04-29 19:13:28.657][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #5: Envoy::Router::Filter::UpstreamRequest::~UpstreamRequest() [0xc0b601] 
[2019-04-29 19:13:28.654][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #4: Envoy::Router::Filter::UpstreamRequest::~UpstreamRequest() [0xc0b48f] 
[2019-04-29 19:13:28.651][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #3: Envoy::Extensions::AccessLoggers::File::FileAccessLog::log() [0xe3bc85] 
[2019-04-29 19:13:28.648][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #2: _ZNK5Envoy9AccessLog13FormatterImpl6formatB5cxx11ERKNS_4Http9HeaderMapES5_S5_RKNS_10StreamInfo10StreamInfoE [0xd5945b] 
[2019-04-29 19:13:28.645][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #1: _ZNK5Envoy9AccessLog23ResponseHeaderFormatter6formatB5cxx11ERKNS_4Http9HeaderMapES5_S5_RKNS_10StreamInfo10StreamInfoE [0xd5aa08] 
[2019-04-29 19:13:28.642][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:75] #0: [0x7f9f0defa4b0] 
[2019-04-29 19:13:28.642][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:69] Backtrace (use tools/stack_decode.py to get line numbers): 
[2019-04-29 19:13:28.642][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:81] Caught Segmentation fault, suspect faulting address 0x35d0150 
[2019-04-29 19:13:27.809][1][info][config] [source/server/listener_manager_impl.cc:1006] all dependencies initialized. starting workers 

Envoy config here


# specify an admin endpoint for Envoy
admin:
  access_log_path: /tmp/admin_access.log
  address:
    socket_address: { address: 0.0.0.0, port_value: 10000 }

static_resources:
  listeners:
  - name: listener1
    address:
      socket_address: { address: 0.0.0.0, port_value: 8080 }
    filter_chains:
      # We want Envoy support SSL.
      - tls_context:
              common_tls_context:
                alpn_protocols: h2,http/1.1
                tls_certificates:
                  # certificate chain - these could be self signed certificates
                  - certificate_chain:
                      filename: /certs/server.crt
                    private_key:
                      filename: /certs/server.key
        # The filters we want to add
        filters:
          - name: envoy.http_connection_manager
            config:
              stat_prefix: grpc_json
              codec_type: AUTO
              # idle time for the connection from google load balancer
              idle_timeout: 720s
              # drain timeout for the connection from the google load balancer
              drain_timeout: 140s
              # stream idle timeout for the connection from the google load balancer
              stream_idle_timeout: 720s
              server_name: myserver
              route_config:
                name: local_route
                virtual_hosts:
                - name: local_service
                  domains: ["*"]
                  routes:
                  - match:
                      prefix: "/"
                    route:
                      # this cluster name should be defined in the clusters section
                      cluster: my_service
                      timeout: 110s
                      retry_policy:
                        # retry only on connection failures
                        retry_on: connect-failure,refused-stream
                        num_retries: 2
                        num_retries: 2
                  - match:
                      # this is the grpc route. Grpc requests are actually http2 requests to the
                      # fully qualified name of the service.
                      # This should match the name of the service of the in the proto file
                      prefix: "/com.my.service/"
                    route:
                      timeout: 110s
                       # this cluster name should be defined in the clusters section
                      cluster: my_service
                      retry_policy:
                        # retry only on connection failures
                        retry_on: connect-failure,refused-stream
                        num_retries: 2
                        num_retries: 2
                # we would like remove all three headers. But envoy won't remove grpc-status header
                response_headers_to_remove: ["grpc-encoding", "grpc-accept-encoding"]
              http_filters:
              # configure envoy to respond to health check
              - name: envoy.health_check
                config:
                    # we want envoy to handle to health check
                    pass_through_mode: false
                    cluster_min_healthy_percentages:
                      # we require 100% of cluster 'my_service' to healthy
                      my_service:
                        value: 100
                    headers:
                    - name: ":path"
                      exact_match: "/health"
              # we want envoy to accept json body and support REST endpoints
              - name: envoy.grpc_json_transcoder
                config:
                  # Provide the descriptor file generated by protobuf
                  proto_descriptor: "/etc/envoy/pcs-model.desc"
                  # What services need envoy to support
                  services: ["com.my.service"]
                  print_options:
                    add_whitespace: true
                    always_print_primitive_fields: false
                    always_print_enums_as_ints: false
                    preserve_proto_field_names: true

              - name: envoy.router
                config:
                    suppress_envoy_headers: true
                    upstream_log:
                        name: envoy.file_access_log
                        filter:
                            not_health_check_filter: {}
                        config:
                            path: /host/tmp/request_log.txt

  # backends backing our envoy
  clusters:
  # this must match the name cluster name provided in the routes section
  - name: my_service
    connect_timeout: 1.25s
    # static route
    type: static
    http2_protocol_options: {}
    # localhost is our cluster
    hosts:
    - socket_address:
        address: 127.0.0.1
        port_value: 8000
    # configure health check for the cluster
    health_checks:
    # health check section deleted

stats_sinks:
  - name: "envoy.statsd"
    config:
      address:
        socket_address:
          protocol: "UDP"
          address: "127.0.0.1"
          port_value: "8125"

stats_flush_interval:
  seconds: 60

watchdog:
  kill_timeout: 90s
  multikill_timeout: 90s

@kiranisaac
Copy link
Author

Update: Increasing the per_connection_buffer_limit_bytes fixed the issue. I still think envoy shouldn't segfault in this case

@mattklein123
Copy link
Member

cc @lizan.

@lizan
Copy link
Member

lizan commented May 1, 2019

Looks like a flow control issue? cc @alyssawilk

@alyssawilk
Copy link
Contributor

At first glance we're getting data from upstream, we go over the limit, send en error response, call doDeferredStreamDestroy, which triggers router cleanup and deletes the upstream request while we're still under the stack of the upstream request. That seems pretty sketchy from a lifetime perspective. I'll have to poke around tomorrow and see what changed and/or if I can reproduce. I know we have tests where we go over the limit and fail and I'm not sure what's different about this set up and the tests that they're passing and this isn't

@alyssawilk alyssawilk self-assigned this May 1, 2019
@alyssawilk
Copy link
Contributor

Oh, I think I know what happens, and it happens when we sendLocalReply after receiving response headers for a gRPC response.

When the router decodes headers, it latches a pointer to the upstream headers, then hands off ownership to the HCM.
If the HCM does sendLocalReply after that, the sendLocalReply is_grpc branch creates gRPC compatible response headers, overwriting the headers the HCM has, and thus invalidating the router filter's pointer.

We don't have a response-data-too-large test which has both access logs (accessing the headers after they were destroyed) and gRPC (doing the header overwrite) but I'll get started on one.

@mattklein123 we could fix either by updating the headers in place for the gRPC response path, or by making response headers accessible via callbacks and not having the router latch. I feel like the latter is more future-proof. Thoughts?

@mattklein123
Copy link
Member

@mattklein123 we could fix either by updating the headers in place for the gRPC response path, or by making response headers accessible via callbacks and not having the router latch. I feel like the latter is more future-proof. Thoughts?

Yeah I've wanted to get rid of all headers latching for quite some time and have headers/trailers only be accessible via callbacks like body. More work but definitely the much better change IMO.

@mattklein123 mattklein123 added this to the 1.11.0 milestone May 2, 2019
@kiranisaac
Copy link
Author

kiranisaac commented May 2, 2019

@alyssawilk / @mattklein123 : For now we have per_connection_buffer_limit_bytes set to 10MB (in both the listeners and clusters section). Is this the right workaround? Is this a bad idea (performance/memory)? Do we need to set it in both places?

Additionally, is this setting the buffer size or just an upper bound on the buffer size? Our payload size will depend on the page size that is requested by the user. So we don't know how much buffer size is required.

@alyssawilk
Copy link
Contributor

Having a large per connection buffer limit basically makes it easier for your server to OOM, either accidentally by sending many large responses or intentionally if malicious clients try to DoS you.

If you're hitting the response too large callback it means you're using buffering filters (where headers will not be sent to the users until Envoy has received and buffered the whole response) and some of your responses are too large, so if the bug were fixed you'd be failing to proxy your own responses. So figuring out how large your responses actually are and capping the buffers there may be optimal.

@lizan
Copy link
Member

lizan commented May 2, 2019 via email

alyssawilk added a commit that referenced this issue May 3, 2019
… to upstream access logs (#6785)

Risk Level: Medium (will have a perf effect for folks using access logs)
Testing: regression integration test
Docs Changes: n/a
Release Notes: n/a
Fixes #6744

Signed-off-by: Alyssa Wilk <[email protected]>
jeffpiazza-google pushed a commit to jeffpiazza-google/envoy that referenced this issue May 3, 2019
… to upstream access logs (envoyproxy#6785)

Risk Level: Medium (will have a perf effect for folks using access logs)
Testing: regression integration test
Docs Changes: n/a
Release Notes: n/a
Fixes envoyproxy#6744

Signed-off-by: Alyssa Wilk <[email protected]>

Signed-off-by: Jeff Piazza <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants