Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add jaeger/opentracing tracing to m3query #1321

Merged
merged 4 commits into from
Feb 21, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 80 additions & 0 deletions docs/operational_guide/monitoring.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
## Metrics

TODO: document how to retrieve metrics for M3DB components.

## Logs

TODO: document how to retrieve logs for M3DB components.

## Tracing

M3DB is integrated with [opentracing](https://opentracing.io/) to provide
insight into query performance and errors.

### Configuration
Currently, only [Jaeger](https://www.jaegertracing.io/) is supported as a backend.

To enable it, set tracing.backend to "jaeger" (see also our
[sample local config](https://github.com/m3db/m3/blob/master/src/query/config/m3query-local-etcd.yml):

```
tracing:
andrewmains12 marked this conversation as resolved.
Show resolved Hide resolved
backend: jaeger # enables jaeger with default configs
jaeger:
# optional configuration for jaeger -- see
# https://github.com/jaegertracing/jaeger-client-go/blob/master/config/config.go#L37
# for options
andrewmains12 marked this conversation as resolved.
Show resolved Hide resolved
...
```

Jaeger can be run locally with docker as described in
https://www.jaegertracing.io/docs/1.9/getting-started/.

The default configuration will report traces via udp to localhost:6831;
using the all-in-one jaeger container, they will be accessible at

http://localhost:16686

N.B.: for production workloads, you will almost certainly want to use
sampler.type=remote with
[adaptive sampling](https://www.jaegertracing.io/docs/1.10/sampling/#adaptive-sampler)
for Jaeger, as write volumes are likely orders of magnitude higher than
read volumes in most timeseries systems.

#### Alternative backends

If you'd like additional backends, we'd love to support them!

File an issue against M3 and we can work with you on how best to add
the backend. The first time's going to be a little rough--opentracing
unfortunately doesn't support Go plugins (yet--see
https://github.com/opentracing/opentracing-go/issues/133), and `glide`'s
update model means that adding dependencies directly will update
*everything*, which isn't ideal for an isolated dependency change.
These problems are all solvable though,
and we'll work with you to make it happen!

### Use cases

Note: all URLs assume a local jaeger setup as described in Jaeger's
[docs](https://www.jaegertracing.io/docs/1.9/getting-started/).


#### Finding slow queries

To find prom queries longer than <threshold>, filter for `minDuration >= <threshold>` on
`operation="GET /api/v1/query_range"`.

Sample query:
http://localhost:16686/search?end=1548876672544000&limit=20&lookback=1h&maxDuration&minDuration=1ms&operation=GET%20%2Fapi%2Fv1%2Fquery_range&service=m3query&start=1548873072544000
andrewmains12 marked this conversation as resolved.
Show resolved Hide resolved

#### Finding queries with errors

Search for `error=true` on `operation="GET /api/v1/query_range"`
http://localhost:16686/search?operation=GET%20%2Fapi%2Fv1%2Fquery_range&service=m3query&tags=%7B%22error%22%3A%22true%22%7D

#### Finding 500 (Internal Server Error) responses

Search for `http.status_code=500`.

http://localhost:16686/search?limit=20&lookback=24h&maxDuration&minDuration&operation=GET%20%2Fapi%2Fv1%2Fquery_range&service=m3query&start=1548802430108000&tags=%7B"http.status_code"%3A"500"%7D
andrewmains12 marked this conversation as resolved.
Show resolved Hide resolved
51 changes: 38 additions & 13 deletions glide.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

18 changes: 16 additions & 2 deletions glide.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
package: github.com/m3db/m3
import:
- package: github.com/m3db/m3x
version: e98ec326dd7b4d4bc97390c709dacf73d49a0bfc
version: b66c9c466c4726e3c9b47b1f837abbbe0f14be81
vcs: git
subpackages:
- checked
Expand Down Expand Up @@ -64,7 +64,7 @@ import:
version: f85c78b1dd998214c5f2138155b320a4a43fbe36

- package: github.com/opentracing/opentracing-go
version: 855519783f479520497c6b3445611b05fc42f009
version: 1.0.2

- package: github.com/spaolacci/murmur3
version: 9f5d223c60793748f04a9d5b4b4eacddfc1f755d
Expand Down Expand Up @@ -201,6 +201,20 @@ import:
subpackages:
- capnslog

# START_JAEGER_DEPS
- package: github.com/uber/jaeger-lib
version: ^1.5.0

- package: github.com/uber/jaeger-client-go
version: ^2.7.0

- package: github.com/opentracing-contrib/go-stdlib
# Pin this on recommendation of the repo (no stable release yet). Still arguably better than rewriting
# the same code.
version: 77df8e8e70b403c6b13c0fffaa4867c9044ff4e9

# END_JAEGER_DEPS

# To avoid conflicting packages not resolving the latest GRPC
- package: google.golang.org/grpc
version: ~1.7.3
Expand Down
3 changes: 3 additions & 0 deletions src/cmd/services/m3query/config/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,9 @@ type Configuration struct {
// Metrics configuration.
Metrics instrument.MetricsConfiguration `yaml:"metrics"`

// Tracing configures opentracing. If not provided, tracing is disabled.
Tracing instrument.TracingConfiguration `yaml:"tracing"`
andrewmains12 marked this conversation as resolved.
Show resolved Hide resolved

// Clusters is the DB cluster configurations for read, write and
// query endpoints.
Clusters m3.ClustersStaticConfiguration `yaml:"clusters"`
Expand Down
10 changes: 8 additions & 2 deletions src/query/api/v1/handler/prometheus/native/read.go
Original file line number Diff line number Diff line change
Expand Up @@ -32,9 +32,12 @@ import (
"github.com/m3db/m3/src/query/executor"
"github.com/m3db/m3/src/query/models"
"github.com/m3db/m3/src/query/ts"
"github.com/m3db/m3/src/query/util/httperrors"
"github.com/m3db/m3/src/query/util/logging"
xhttp "github.com/m3db/m3/src/x/net/http"
opentracingutil "github.com/m3db/m3/src/query/util/opentracing"

opentracingext "github.com/opentracing/opentracing-go/ext"
opentracinglog "github.com/opentracing/opentracing-go/log"
"github.com/uber-go/tally"
"go.uber.org/zap"
)
Expand Down Expand Up @@ -124,7 +127,7 @@ func (h *PromReadHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {

result, params, respErr := h.ServeHTTPWithEngine(w, r, h.engine)
if respErr != nil {
xhttp.Error(w, respErr.Err, respErr.Code)
httperrors.ErrorWithReqInfo(w, r, respErr.Code, respErr.Err)
return
}

Expand Down Expand Up @@ -167,6 +170,9 @@ func (h *PromReadHandler) ServeHTTPWithEngine(

result, err := read(ctx, engine, h.tagOpts, w, params)
if err != nil {
sp := opentracingutil.SpanFromContextOrNoop(ctx)
sp.LogFields(opentracinglog.Error(err))
opentracingext.Error.Set(sp, true)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you have sp.Finish() here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, no; that function is intended to extract the span from context if it exists, and return a noop/dummy if not. In the first case, the calling function is already in charge of calling Finish(); in the second, it shouldn't be needed. Does that make sense?

I updated the function to be opentracingutil.SpanFromContextOrNoop, making it return a noop span, which should ensure that it doesn't need to be closed even if called with a context without a span.

logger.Error("unable to fetch data", zap.Error(err))
h.promReadMetrics.fetchErrorsServer.Inc(1)
return nil, emptyReqParams, &RespError{Err: err, Code: http.StatusInternalServerError}
Expand Down
12 changes: 12 additions & 0 deletions src/query/api/v1/handler/prometheus/native/read_common.go
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,9 @@ import (
"github.com/m3db/m3/src/query/models"
"github.com/m3db/m3/src/query/parser/promql"
"github.com/m3db/m3/src/query/ts"
opentracingutil "github.com/m3db/m3/src/query/util/opentracing"

opentracinglog "github.com/opentracing/opentracing-go/log"
)

func read(
Expand All @@ -45,6 +48,15 @@ func read(
ctx, cancel := context.WithTimeout(reqCtx, params.Timeout)
andrewmains12 marked this conversation as resolved.
Show resolved Hide resolved
defer cancel()

sp := opentracingutil.SpanFromContextOrNoop(ctx)
sp.LogFields(
opentracinglog.String("params.query", params.Query),
opentracingutil.Time("params.start", params.Start),
opentracingutil.Time("params.end", params.End),
opentracingutil.Time("params.now", params.Now),
opentracingutil.Duration("params.step", params.Step),
)
andrewmains12 marked this conversation as resolved.
Show resolved Hide resolved

opts := &executor.EngineOptions{}
// Detect clients closing connections
handler.CloseWatcher(ctx, cancel, w)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,8 @@ import (
"github.com/m3db/m3/src/query/api/v1/handler/prometheus"
"github.com/m3db/m3/src/query/executor"
"github.com/m3db/m3/src/query/models"
"github.com/m3db/m3/src/query/util/httperrors"
"github.com/m3db/m3/src/query/util/logging"
xhttp "github.com/m3db/m3/src/x/net/http"

"go.uber.org/zap"
)
Expand Down Expand Up @@ -69,7 +69,7 @@ func (h *PromReadInstantHandler) ServeHTTP(w http.ResponseWriter, r *http.Reques
logger := logging.WithContext(ctx)
params, rErr := parseInstantaneousParams(r, h.timeoutOpts)
if rErr != nil {
xhttp.Error(w, rErr.Inner(), rErr.Code())
httperrors.ErrorWithReqInfo(w, r, rErr.Code(), rErr)
return
}

Expand All @@ -80,7 +80,7 @@ func (h *PromReadInstantHandler) ServeHTTP(w http.ResponseWriter, r *http.Reques
result, err := read(ctx, h.engine, h.tagOpts, w, params)
if err != nil {
logger.Error("unable to fetch data", zap.Error(err))
xhttp.Error(w, err, http.StatusBadRequest)
httperrors.ErrorWithReqInfo(w, r, http.StatusBadRequest, rErr)
return
}

Expand Down
30 changes: 22 additions & 8 deletions src/query/api/v1/httpd/handler.go
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ package httpd
import (
"encoding/json"
"errors"
"fmt"
"net/http"
_ "net/http/pprof" // needed for pprof handler registration
"time"
Expand Down Expand Up @@ -52,6 +53,8 @@ import (
"github.com/m3db/m3/src/x/net/http/cors"

"github.com/gorilla/mux"
"github.com/opentracing-contrib/go-stdlib/nethttp"
"github.com/opentracing/opentracing-go"
"github.com/uber-go/tally"
)

Expand Down Expand Up @@ -102,13 +105,7 @@ func NewHandler(
) (*Handler, error) {
r := mux.NewRouter()

// apply middleware. Just CORS for now, but we could add more here as needed.
withMiddleware := &cors.Handler{
Handler: r,
Info: &cors.Info{
"*": true,
},
}
handlerWithMiddleware := applyMiddleware(r, opentracing.GlobalTracer())

var timeoutOpts = &prometheus.TimeoutOpts{}
if embeddedDbCfg == nil || embeddedDbCfg.Client.FetchTimeout == nil {
Expand All @@ -123,7 +120,7 @@ func NewHandler(

h := &Handler{
router: r,
handler: withMiddleware,
handler: handlerWithMiddleware,
storage: downsamplerAndWriter.Storage(),
downsamplerAndWriter: downsamplerAndWriter,
engine: engine,
Expand All @@ -139,6 +136,23 @@ func NewHandler(
return h, nil
}

func applyMiddleware(base *mux.Router, tracer opentracing.Tracer) http.Handler {
withMiddleware := http.Handler(&cors.Handler{
Handler: base,
Info: &cors.Info{
"*": true,
},
})

// apply jaeger middleware, which will start a span
// for each incoming request
withMiddleware = nethttp.Middleware(tracer, withMiddleware,
nethttp.OperationNameFunc(func(r *http.Request) string {
return fmt.Sprintf("%s %s", r.Method, r.URL.Path)
}))
return withMiddleware
}

// RegisterRoutes registers all http routes.
func (h *Handler) RegisterRoutes() error {
// Wrap requests with response time logging as well as panic recovery.
Expand Down
Loading