Towards better tracking of gateway performance #5783

whyrusleeping · 2018-11-19T00:42:57Z

We need to start tracking gateway response latencies a bit better. Doing so will give us some really valuable information about where our efforts will be best applied in making ipfs faster.

For every request that comes in, unless we already have it cached, we generally spend some amount of time searching for peers who have the content, and then some amount of time connecting to said peers, and then finally some amount of time actually fetching that content.

The average request consists of a single hash. In this case, the first node gets resolved by ResolvePath. All of the content routing work and swarm dialing work happens during the execution of a single 'dag.Get' call. I suspect this is where the bulk of the latency comes from. Within dag.Get this work is further abstracted down into bitswap. Bitswap will search for peers, connect and then request data for each request internally.

My approach would be to pass some sort of request ID down along with the context, hoping it makes it all the way to bitswap, then use that to tie log timing information about how much time is spent in which activities to a given request.

cc @Stebalien @hannahhoward @frrist

The text was updated successfully, but these errors were encountered:

frrist · 2018-11-19T17:30:58Z

This sounds like something that could be addressed with tracing, which go-log currently supports via opentracing.

We could first instrument the methods we want to collect timing information from with Start and Finish calls. This will give us the duration of each methods execution, and any relevant metadata along with it, such as peerID's and CID's. These logs will be viewable via the ipfs log tail command, or by using a tracer like the go-jaeger-tracer plugin for ipfs.

If we want to collect traces spanning multiple peers (meaning we collect timing metrics of our local node and any other node participating in the request) we can use the serialize and deserialize methods to pass the context along with the request. This will likely require adding optional byte array fields to DHT and Bitswap protobufs and will require all nodes we wish to collect traces from be connected to the same tracer -- this will be tricky wrt infra.

- partially addresses #5783 License: MIT Signed-off-by: frrist <[email protected]>

- partially addresses ipfs/kubo#5783 License: MIT Signed-off-by: frrist <[email protected]>

whyrusleeping · 2018-11-19T21:12:59Z

@eefahy do you think we could run a jaeger tracer instance and feed it data from one of the gateways?

eefahy · 2018-11-19T21:28:10Z

Yeah, absolutely. It might not be this week though unfortunately. Too swamped.

@lanzafame was just doing some of this tracing using opencensus v opentracing. Got any notes for us here Adrian?

Stebalien · 2018-11-19T22:47:38Z

If we want to collect traces spanning multiple peers (meaning we collect timing metrics of our local node and any other node participating in the request) we can use the serialize and deserialize methods to pass the context along with the request. This will likely require adding optional byte array fields to DHT and Bitswap protobufs and will require all nodes we wish to collect traces from be connected to the same tracer -- this will be tricky wrt infra.

We're not going to want to do this on the gateways (or any public node for that matter). However, we should be using tracing to get timing like this.

But remember, if this is going to happen, someone needs to own it and see it all the way to the end.

lanzafame · 2018-11-20T05:50:32Z

I would suggest OpenCensus (OC) as it provides both tracing and metrics, with the tracing output being compatible with OpenTracing collectors.

Also with OC, you can record request latencies on the gateways as well as server other measurements of http server stats via the ochttp plugin and then when testing you can turn tracing on as well to get the spans. The metrics can get exported to prometheus which we already have setup. OC also allows for packages to provide stats views that if registered by the user will be collected but if not registered then they won't be collected, see here for more details.

eefahy · 2018-11-20T18:39:29Z

But remember, if this is going to happen, someone needs to own it and see it all the way to the end.

Yes please! This should also be tied to some project planning as whatever we learn from this tracing (now and in the future) will need to be a resourced and prioritized fix as well.

b5 · 2018-12-05T14:48:27Z

This might be obvious, but is worth mentioning. If you're planning on attaching request tracing to the gateways, using a sampling technique of some sort would be smart. I've stupidly blown things up by tracing all requests on even a low-volume service. Turns those contexts get very large quickly, clog the tubes. Tuning what to sample is the tougher bit. Starting with something in the vicinity of 1% total request volume makes for a sane starting point, and usually gives lots of data for initial thinking.

It'd be really nice to be able to attach tracing to specific peerIDs. Gateways would always enable tracing for requests that touch a given PeerID. Project mantainers could provision peerIDs known to the Ops team and configured their local machines send tracing spans via a secure backchannel to gateway loggers. That way you get a real image of how IPFS is behaving in production, combining tracing details across the network. If this existed we'd definitely configure our own gateways to do the same, logging to our tracing systems, and would happily share our findings.

Stebalien · 2018-12-20T19:56:19Z

Really, we want to trace requests from the web. However, we can probably find quite a few issues by just tracing our own traffic (by IP) instead of tracing everyone's traffic.

frrist added a commit that referenced this issue Nov 19, 2018

add tracing to namesys and pathresolver

0c05807

- partially addresses #5783 License: MIT Signed-off-by: frrist <[email protected]>

frrist added a commit to libp2p/go-libp2p-kad-dht that referenced this issue Nov 19, 2018

add tracing to routing

2e9d3ed

- partially addresses ipfs/kubo#5783 License: MIT Signed-off-by: frrist <[email protected]>

Stebalien added the kind/enhancement A net-new feature or improvement to an existing feature label Nov 19, 2018

lanzafame self-assigned this Jan 16, 2019

hannahhoward mentioned this issue Jan 17, 2019

WIP: Feat/logging and docs ipfs/go-bitswap#57

Closed

scout mentioned this issue Mar 18, 2019

OKRs - 2019 Q2 Go Core Dev ipfs/team-mgmt#903

Merged

4 tasks

eingenito mentioned this issue May 14, 2019

Logging performance and status measurements #1763

Closed

lanzafame added the topic/gateway Topic gateway label Jul 16, 2019

Stebalien unassigned lanzafame Apr 22, 2021

guseggert mentioned this issue Dec 2, 2021

Add comprehensive tracing #8578

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Towards better tracking of gateway performance #5783

Towards better tracking of gateway performance #5783

whyrusleeping commented Nov 19, 2018

frrist commented Nov 19, 2018 •

edited

Loading

whyrusleeping commented Nov 19, 2018

eefahy commented Nov 19, 2018

Stebalien commented Nov 19, 2018

lanzafame commented Nov 20, 2018

eefahy commented Nov 20, 2018

b5 commented Dec 5, 2018

Stebalien commented Dec 20, 2018

Towards better tracking of gateway performance #5783

Towards better tracking of gateway performance #5783

Comments

whyrusleeping commented Nov 19, 2018

frrist commented Nov 19, 2018 • edited Loading

whyrusleeping commented Nov 19, 2018

eefahy commented Nov 19, 2018

Stebalien commented Nov 19, 2018

lanzafame commented Nov 20, 2018

eefahy commented Nov 20, 2018

b5 commented Dec 5, 2018

Stebalien commented Dec 20, 2018

frrist commented Nov 19, 2018 •

edited

Loading