Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Towards better tracking of gateway performance #5783

Open
whyrusleeping opened this issue Nov 19, 2018 · 8 comments
Open

Towards better tracking of gateway performance #5783

whyrusleeping opened this issue Nov 19, 2018 · 8 comments
Labels
kind/enhancement A net-new feature or improvement to an existing feature topic/gateway Topic gateway

Comments

@whyrusleeping
Copy link
Member

We need to start tracking gateway response latencies a bit better. Doing so will give us some really valuable information about where our efforts will be best applied in making ipfs faster.

For every request that comes in, unless we already have it cached, we generally spend some amount of time searching for peers who have the content, and then some amount of time connecting to said peers, and then finally some amount of time actually fetching that content.

The average request consists of a single hash. In this case, the first node gets resolved by ResolvePath. All of the content routing work and swarm dialing work happens during the execution of a single 'dag.Get' call. I suspect this is where the bulk of the latency comes from. Within dag.Get this work is further abstracted down into bitswap. Bitswap will search for peers, connect and then request data for each request internally.

My approach would be to pass some sort of request ID down along with the context, hoping it makes it all the way to bitswap, then use that to tie log timing information about how much time is spent in which activities to a given request.

cc @Stebalien @hannahhoward @frrist

@frrist
Copy link
Member

frrist commented Nov 19, 2018

This sounds like something that could be addressed with tracing, which go-log currently supports via opentracing.

We could first instrument the methods we want to collect timing information from with Start and Finish calls. This will give us the duration of each methods execution, and any relevant metadata along with it, such as peerID's and CID's. These logs will be viewable via the ipfs log tail command, or by using a tracer like the go-jaeger-tracer plugin for ipfs.

If we want to collect traces spanning multiple peers (meaning we collect timing metrics of our local node and any other node participating in the request) we can use the serialize and deserialize methods to pass the context along with the request. This will likely require adding optional byte array fields to DHT and Bitswap protobufs and will require all nodes we wish to collect traces from be connected to the same tracer -- this will be tricky wrt infra.

frrist added a commit that referenced this issue Nov 19, 2018
- partially addresses #5783

License: MIT
Signed-off-by: frrist <[email protected]>
frrist added a commit to libp2p/go-libp2p-kad-dht that referenced this issue Nov 19, 2018
- partially addresses ipfs/kubo#5783

License: MIT
Signed-off-by: frrist <[email protected]>
@whyrusleeping
Copy link
Member Author

@eefahy do you think we could run a jaeger tracer instance and feed it data from one of the gateways?

@eefahy
Copy link

eefahy commented Nov 19, 2018

Yeah, absolutely. It might not be this week though unfortunately. Too swamped.

@lanzafame was just doing some of this tracing using opencensus v opentracing. Got any notes for us here Adrian?

@Stebalien
Copy link
Member

If we want to collect traces spanning multiple peers (meaning we collect timing metrics of our local node and any other node participating in the request) we can use the serialize and deserialize methods to pass the context along with the request. This will likely require adding optional byte array fields to DHT and Bitswap protobufs and will require all nodes we wish to collect traces from be connected to the same tracer -- this will be tricky wrt infra.

We're not going to want to do this on the gateways (or any public node for that matter). However, we should be using tracing to get timing like this.


But remember, if this is going to happen, someone needs to own it and see it all the way to the end.

@Stebalien Stebalien added the kind/enhancement A net-new feature or improvement to an existing feature label Nov 19, 2018
@lanzafame
Copy link
Contributor

I would suggest OpenCensus (OC) as it provides both tracing and metrics, with the tracing output being compatible with OpenTracing collectors.

Also with OC, you can record request latencies on the gateways as well as server other measurements of http server stats via the ochttp plugin and then when testing you can turn tracing on as well to get the spans. The metrics can get exported to prometheus which we already have setup. OC also allows for packages to provide stats views that if registered by the user will be collected but if not registered then they won't be collected, see here for more details.

@eefahy
Copy link

eefahy commented Nov 20, 2018

But remember, if this is going to happen, someone needs to own it and see it all the way to the end.

Yes please! This should also be tied to some project planning as whatever we learn from this tracing (now and in the future) will need to be a resourced and prioritized fix as well.

@b5
Copy link
Contributor

b5 commented Dec 5, 2018

This might be obvious, but is worth mentioning. If you're planning on attaching request tracing to the gateways, using a sampling technique of some sort would be smart. I've stupidly blown things up by tracing all requests on even a low-volume service. Turns those contexts get very large quickly, clog the tubes. Tuning what to sample is the tougher bit. Starting with something in the vicinity of 1% total request volume makes for a sane starting point, and usually gives lots of data for initial thinking.

It'd be really nice to be able to attach tracing to specific peerIDs. Gateways would always enable tracing for requests that touch a given PeerID. Project mantainers could provision peerIDs known to the Ops team and configured their local machines send tracing spans via a secure backchannel to gateway loggers. That way you get a real image of how IPFS is behaving in production, combining tracing details across the network. If this existed we'd definitely configure our own gateways to do the same, logging to our tracing systems, and would happily share our findings.

@Stebalien
Copy link
Member

Really, we want to trace requests from the web. However, we can probably find quite a few issues by just tracing our own traffic (by IP) instead of tracing everyone's traffic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement A net-new feature or improvement to an existing feature topic/gateway Topic gateway
Projects
None yet
Development

No branches or pull requests

6 participants