Skip to content

Commit

Permalink
Add jaeger/opentracing tracing to m3query (#1321)
Browse files Browse the repository at this point in the history
This diff adds opentracing support to m3query. Spans are started for every HTTP request (courtesy middleware from `github.com/opentracing-contrib/go-stdlib`). I then have child spans started for compilation + execution, with drilldowns into each node of the query. Sample traces are in the pull request (#1321)

Query is `sum(increase(coordinator_engine_datapoints{type="fetched"}[10s]))`

![jaeger_trace_no_children](https://user-images.githubusercontent.com/830218/51855774-0dcaac80-22fc-11e9-9b9b-b33396d57ede.png)

One implementation note: node instrumentation is a bit less clean than I would have liked; since all nodes call their children before exiting, i.e.:

```
nextBlock := doStuff()
n.controller.Process(nextBlock)
```
I instrumented them like:

```
startSpan()
nextBlock := doStuff()
stopSpan()

n.controller.Process(nextBlock)
```
The alternative would be to do:

```
startSpan()
nextBlock := doStuff()

n.controller.Process(nextBlock)
stopSpan()
```
but that gives unnecessary nesting imo (see PR for other screenshot).
  • Loading branch information
andrewmains12 authored Feb 21, 2019
1 parent 1012004 commit fa7efee
Show file tree
Hide file tree
Showing 39 changed files with 1,083 additions and 184 deletions.
80 changes: 80 additions & 0 deletions docs/operational_guide/monitoring.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
## Metrics

TODO: document how to retrieve metrics for M3DB components.

## Logs

TODO: document how to retrieve logs for M3DB components.

## Tracing

M3DB is integrated with [opentracing](https://opentracing.io/) to provide
insight into query performance and errors.

### Configuration
Currently, only [Jaeger](https://www.jaegertracing.io/) is supported as a backend.

To enable it, set tracing.backend to "jaeger" (see also our
[sample local config](https://github.com/m3db/m3/blob/master/src/query/config/m3query-local-etcd.yml):

```
tracing:
backend: jaeger # enables jaeger with default configs
jaeger:
# optional configuration for jaeger -- see
# https://github.com/jaegertracing/jaeger-client-go/blob/master/config/config.go#L37
# for options
...
```

Jaeger can be run locally with docker as described in
https://www.jaegertracing.io/docs/1.9/getting-started/.

The default configuration will report traces via udp to localhost:6831;
using the all-in-one jaeger container, they will be accessible at

http://localhost:16686

N.B.: for production workloads, you will almost certainly want to use
sampler.type=remote with
[adaptive sampling](https://www.jaegertracing.io/docs/1.10/sampling/#adaptive-sampler)
for Jaeger, as write volumes are likely orders of magnitude higher than
read volumes in most timeseries systems.

#### Alternative backends

If you'd like additional backends, we'd love to support them!

File an issue against M3 and we can work with you on how best to add
the backend. The first time's going to be a little rough--opentracing
unfortunately doesn't support Go plugins (yet--see
https://github.com/opentracing/opentracing-go/issues/133), and `glide`'s
update model means that adding dependencies directly will update
*everything*, which isn't ideal for an isolated dependency change.
These problems are all solvable though,
and we'll work with you to make it happen!

### Use cases

Note: all URLs assume a local jaeger setup as described in Jaeger's
[docs](https://www.jaegertracing.io/docs/1.9/getting-started/).


#### Finding slow queries

To find prom queries longer than <threshold>, filter for `minDuration >= <threshold>` on
`operation="GET /api/v1/query_range"`.

Sample query:
http://localhost:16686/search?end=1548876672544000&limit=20&lookback=1h&maxDuration&minDuration=1ms&operation=GET%20%2Fapi%2Fv1%2Fquery_range&service=m3query&start=1548873072544000

#### Finding queries with errors

Search for `error=true` on `operation="GET /api/v1/query_range"`
http://localhost:16686/search?operation=GET%20%2Fapi%2Fv1%2Fquery_range&service=m3query&tags=%7B%22error%22%3A%22true%22%7D

#### Finding 500 (Internal Server Error) responses

Search for `http.status_code=500`.

http://localhost:16686/search?limit=20&lookback=24h&maxDuration&minDuration&operation=GET%20%2Fapi%2Fv1%2Fquery_range&service=m3query&start=1548802430108000&tags=%7B"http.status_code"%3A"500"%7D
51 changes: 38 additions & 13 deletions glide.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

18 changes: 16 additions & 2 deletions glide.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
package: github.com/m3db/m3
import:
- package: github.com/m3db/m3x
version: e98ec326dd7b4d4bc97390c709dacf73d49a0bfc
version: b66c9c466c4726e3c9b47b1f837abbbe0f14be81
vcs: git
subpackages:
- checked
Expand Down Expand Up @@ -64,7 +64,7 @@ import:
version: f85c78b1dd998214c5f2138155b320a4a43fbe36

- package: github.com/opentracing/opentracing-go
version: 855519783f479520497c6b3445611b05fc42f009
version: 1.0.2

- package: github.com/spaolacci/murmur3
version: 9f5d223c60793748f04a9d5b4b4eacddfc1f755d
Expand Down Expand Up @@ -201,6 +201,20 @@ import:
subpackages:
- capnslog

# START_JAEGER_DEPS
- package: github.com/uber/jaeger-lib
version: ^1.5.0

- package: github.com/uber/jaeger-client-go
version: ^2.7.0

- package: github.com/opentracing-contrib/go-stdlib
# Pin this on recommendation of the repo (no stable release yet). Still arguably better than rewriting
# the same code.
version: 77df8e8e70b403c6b13c0fffaa4867c9044ff4e9

# END_JAEGER_DEPS

# To avoid conflicting packages not resolving the latest GRPC
- package: google.golang.org/grpc
version: ~1.7.3
Expand Down
3 changes: 3 additions & 0 deletions src/cmd/services/m3query/config/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,9 @@ type Configuration struct {
// Metrics configuration.
Metrics instrument.MetricsConfiguration `yaml:"metrics"`

// Tracing configures opentracing. If not provided, tracing is disabled.
Tracing instrument.TracingConfiguration `yaml:"tracing"`

// Clusters is the DB cluster configurations for read, write and
// query endpoints.
Clusters m3.ClustersStaticConfiguration `yaml:"clusters"`
Expand Down
10 changes: 8 additions & 2 deletions src/query/api/v1/handler/prometheus/native/read.go
Original file line number Diff line number Diff line change
Expand Up @@ -32,9 +32,12 @@ import (
"github.com/m3db/m3/src/query/executor"
"github.com/m3db/m3/src/query/models"
"github.com/m3db/m3/src/query/ts"
"github.com/m3db/m3/src/query/util/httperrors"
"github.com/m3db/m3/src/query/util/logging"
xhttp "github.com/m3db/m3/src/x/net/http"
opentracingutil "github.com/m3db/m3/src/query/util/opentracing"

opentracingext "github.com/opentracing/opentracing-go/ext"
opentracinglog "github.com/opentracing/opentracing-go/log"
"github.com/uber-go/tally"
"go.uber.org/zap"
)
Expand Down Expand Up @@ -124,7 +127,7 @@ func (h *PromReadHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {

result, params, respErr := h.ServeHTTPWithEngine(w, r, h.engine)
if respErr != nil {
xhttp.Error(w, respErr.Err, respErr.Code)
httperrors.ErrorWithReqInfo(w, r, respErr.Code, respErr.Err)
return
}

Expand Down Expand Up @@ -167,6 +170,9 @@ func (h *PromReadHandler) ServeHTTPWithEngine(

result, err := read(ctx, engine, h.tagOpts, w, params)
if err != nil {
sp := opentracingutil.SpanFromContextOrNoop(ctx)
sp.LogFields(opentracinglog.Error(err))
opentracingext.Error.Set(sp, true)
logger.Error("unable to fetch data", zap.Error(err))
h.promReadMetrics.fetchErrorsServer.Inc(1)
return nil, emptyReqParams, &RespError{Err: err, Code: http.StatusInternalServerError}
Expand Down
12 changes: 12 additions & 0 deletions src/query/api/v1/handler/prometheus/native/read_common.go
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,9 @@ import (
"github.com/m3db/m3/src/query/models"
"github.com/m3db/m3/src/query/parser/promql"
"github.com/m3db/m3/src/query/ts"
opentracingutil "github.com/m3db/m3/src/query/util/opentracing"

opentracinglog "github.com/opentracing/opentracing-go/log"
)

func read(
Expand All @@ -45,6 +48,15 @@ func read(
ctx, cancel := context.WithTimeout(reqCtx, params.Timeout)
defer cancel()

sp := opentracingutil.SpanFromContextOrNoop(ctx)
sp.LogFields(
opentracinglog.String("params.query", params.Query),
opentracingutil.Time("params.start", params.Start),
opentracingutil.Time("params.end", params.End),
opentracingutil.Time("params.now", params.Now),
opentracingutil.Duration("params.step", params.Step),
)

opts := &executor.EngineOptions{}
// Detect clients closing connections
handler.CloseWatcher(ctx, cancel, w)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,8 @@ import (
"github.com/m3db/m3/src/query/api/v1/handler/prometheus"
"github.com/m3db/m3/src/query/executor"
"github.com/m3db/m3/src/query/models"
"github.com/m3db/m3/src/query/util/httperrors"
"github.com/m3db/m3/src/query/util/logging"
xhttp "github.com/m3db/m3/src/x/net/http"

"go.uber.org/zap"
)
Expand Down Expand Up @@ -69,7 +69,7 @@ func (h *PromReadInstantHandler) ServeHTTP(w http.ResponseWriter, r *http.Reques
logger := logging.WithContext(ctx)
params, rErr := parseInstantaneousParams(r, h.timeoutOpts)
if rErr != nil {
xhttp.Error(w, rErr.Inner(), rErr.Code())
httperrors.ErrorWithReqInfo(w, r, rErr.Code(), rErr)
return
}

Expand All @@ -80,7 +80,7 @@ func (h *PromReadInstantHandler) ServeHTTP(w http.ResponseWriter, r *http.Reques
result, err := read(ctx, h.engine, h.tagOpts, w, params)
if err != nil {
logger.Error("unable to fetch data", zap.Error(err))
xhttp.Error(w, err, http.StatusBadRequest)
httperrors.ErrorWithReqInfo(w, r, http.StatusBadRequest, rErr)
return
}

Expand Down
30 changes: 22 additions & 8 deletions src/query/api/v1/httpd/handler.go
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ package httpd
import (
"encoding/json"
"errors"
"fmt"
"net/http"
_ "net/http/pprof" // needed for pprof handler registration
"time"
Expand Down Expand Up @@ -52,6 +53,8 @@ import (
"github.com/m3db/m3/src/x/net/http/cors"

"github.com/gorilla/mux"
"github.com/opentracing-contrib/go-stdlib/nethttp"
"github.com/opentracing/opentracing-go"
"github.com/uber-go/tally"
)

Expand Down Expand Up @@ -102,13 +105,7 @@ func NewHandler(
) (*Handler, error) {
r := mux.NewRouter()

// apply middleware. Just CORS for now, but we could add more here as needed.
withMiddleware := &cors.Handler{
Handler: r,
Info: &cors.Info{
"*": true,
},
}
handlerWithMiddleware := applyMiddleware(r, opentracing.GlobalTracer())

var timeoutOpts = &prometheus.TimeoutOpts{}
if embeddedDbCfg == nil || embeddedDbCfg.Client.FetchTimeout == nil {
Expand All @@ -123,7 +120,7 @@ func NewHandler(

h := &Handler{
router: r,
handler: withMiddleware,
handler: handlerWithMiddleware,
storage: downsamplerAndWriter.Storage(),
downsamplerAndWriter: downsamplerAndWriter,
engine: engine,
Expand All @@ -139,6 +136,23 @@ func NewHandler(
return h, nil
}

func applyMiddleware(base *mux.Router, tracer opentracing.Tracer) http.Handler {
withMiddleware := http.Handler(&cors.Handler{
Handler: base,
Info: &cors.Info{
"*": true,
},
})

// apply jaeger middleware, which will start a span
// for each incoming request
withMiddleware = nethttp.Middleware(tracer, withMiddleware,
nethttp.OperationNameFunc(func(r *http.Request) string {
return fmt.Sprintf("%s %s", r.Method, r.URL.Path)
}))
return withMiddleware
}

// RegisterRoutes registers all http routes.
func (h *Handler) RegisterRoutes() error {
// Wrap requests with response time logging as well as panic recovery.
Expand Down
Loading

0 comments on commit fa7efee

Please sign in to comment.