Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

m3coordinator network usage disparity #949

Open
BertHartm opened this issue Sep 26, 2018 · 22 comments
Open

m3coordinator network usage disparity #949

BertHartm opened this issue Sep 26, 2018 · 22 comments
Assignees

Comments

@BertHartm
Copy link
Contributor

BertHartm commented Sep 26, 2018

I moved m3coordinator to a dedicated box, so that prometheus is writing to it over the network, and then it's writing to 6 m3db instances over the network.

I would expect network traffic to be roughly equivalent, but I'm seeing about 9.5MB/s coming into the coordinator, but 165MB/s going back out which seems surprisingly disproportional.

I'm running 0.4.1 as released

image

@richardartoul
Copy link
Contributor

@BertHartm Thanks for filing the issue! Could you provide a few more details, specifically:

  1. What is your replication factor?
  2. What are your namespace configurations (I.E do you have more than one?)

If would expect <NETWORK_OUT> to be at least <REPLICATION_FACTOR> * <NETWORK_IN> which would explain some of the discrepancy, but their might be more going on here.

@BertHartm
Copy link
Contributor Author

sure, Replication Factor is 2, and I only have 1 (unaggregated) namespace.

I'll be spinning up a few other configurations today, so I'll try to post more statistics as I have them.

Other anecdotal stuff that may be relevant: This has more outbound traffic than when I was running this as a sidecar on the prometheus box, but it also seems to cause a lower prometheus_remote_storage_dropped_samples_total, so I think it might just be better networking making it more reliable.

@richardartoul
Copy link
Contributor

richardartoul commented Sep 27, 2018

Honestly now that I'm think about it this a bit more, I think it makes sense once you consider the read workload. If you're reading from this m3coordinator instance, then it is returning uncompressed data, but the data it receives from M3DB is compressed.

so M3DB --> M3Coordination is compressed (counts towards received)
M3Coordinator --> Prometheus for reads is uncompressed (counts towards transmitted)

I was only considering the write workload in my original response

@BertHartm
Copy link
Contributor Author

This is a write only workload though. I have a different coordinator instance set up for reads

@BertHartm
Copy link
Contributor Author

so more datapoints from other machines I spun up today: I have multiple prometheuses going to 1 m3coordinator going to 6 dbnodes in all cases. All 0.4.3, RF=2, and 1 namespace. In this case the prometheus instances come in pairs because they're set up HA

A: 2 prometheus: M3coordinator has 4MB/s in, just under 100MB/s out
B: 4 prometheus (2 HA pairs): M3 coordinator has 5.5MB/s in, ~130MB/s out
C: 2 prometheus: 370kB/s in, 8.15MB/s out

before I send traffic to the coordinator, I'm seeing about 70k in, 50k out.

It looks like these 3 setups have a ratio closer to 25:1 which is a bit higher than I was seeing earlier

@arnikola
Copy link
Collaborator

Hey, thanks for putting this together!

We've had a quick investigation and we've found the following:

  1. Traffic coming in from Prometheus is snappy encoded, while internal coordinator -> m3db traffic isn't, which can account for at least 2x increase, depending on how well your data gets encoded.

  2. We currently expand tags and add them to the series ID, which will duplicate the largest part of the incoming timeseries

  3. Incoming prom metrics are bundled together for the same series: i.e. all datapoints with the same tags come from prom in one list with one set of tags, but when coordinator sends it to the db, we have one set of tags for each datapoint.

We're currently looking over ways to make some easy wins here, will keep this ticket updated

@arnikola
Copy link
Collaborator

My bad on letting this fall off for a while:

Our findings at the time were that we currently must duplicate the IDs since we allow additional tags to be added, while keeping the IDs the same. We discussed a solution where we'd send through a ID which would then be interpreted as the default ID on the m3 side, but unfortunately that fell by the wayside. We'll sync up early next year to see if we can build that functionality, and plan encoding/batching coordinator metrics to try and drive the network load down

@arnikola arnikola self-assigned this Jan 2, 2019
@arnikola
Copy link
Collaborator

To follow up, this PR allows us to generate IDs in a better fashion, and after some consultation with db folk, once this is approved I may go ahead and put that logic as the default when passing in nil IDs, which should fulfill point 2 at least.

@comradedraganov
Copy link

Has anyone looked into addressing this further? I'm experiencing a similar issue with a ~30x markup in network traffic in my coordinators as below (6 dbnodes, RF 3)
Screen Shot 2019-10-10 at 15 53 23

@arnikola
Copy link
Collaborator

Hey, it's still planned/ongoing work, unfortunately it's not super high priority at the moment so it's kind of on the backburner.

@comradedraganov
Copy link

@arnikola Thanks, I might look at picking up one of the unaddressed points if I can (unless people have already got progress on them? I couldn't find anything related)

@yangxiangyu
Copy link

image

same issue

@arnikola
Copy link
Collaborator

@arnikola Thanks, I might look at picking up one of the unaddressed points if I can (unless people have already got progress on them? I couldn't find anything related)

Yeah nothing ongoing or planned for the near future unfortunately... If you were interested in having a look, I'd say pushing default ID generation (i.e. if identID here is nil) down to the DBNodes if not present is probably the easiest and highest impact change (should reduce output size by close to 50%)

@fingon
Copy link
Contributor

fingon commented Nov 22, 2019

Low-hanging fruit we identified was simply sticking LZ4 compression on the whole in-cluster protocol (including m3controller -> m3db write path). With it, we had ~92% bandwidth savings.

~constant input rate, looking at ifconfig of m3coordinator node:

3.5h 0.14.2 run - prometheus write is about 1/15 of sending output to m3db:

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1460
        RX packets 602404  bytes 397168881 (378.7 MiB)
        TX packets 4546274  bytes 6260099487 (5.8 GiB)

15h 0.14.2 + lz4 run - prometheus write is 77% of sending output to m3db:

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1460
        RX packets 2777579  bytes 1704216244 (1.5 GiB)
        TX packets 3712787  bytes 2214657932 (2.0 GiB)

Our design was to wrap the net.Conn:s used in tchannel, and transparently compress/decompress all of it, with following logic:

  • gather up to 4 MB block of data to be compressed
  • flush earlier if:
    • 5ms since last write, or
    • 30ms since first write that is not sent

CPU impact we are not sure about yet, but at least as an option it would be very nice for us to have, because we pay by the byte for intra-AZ traffic.

@robskillington
Copy link
Collaborator

robskillington commented Nov 23, 2019

@fingon that’s really fantastic results, we would be more than happy to get your change into master - we can even do a lot of the tests, etc, happy to help out as much as we can to make it easy to get it in.

@fingon
Copy link
Contributor

fingon commented Jan 31, 2020

#2079 will eventually fix this. While probably slightly less efficient than my experimental branch (that we use in prod, cough cough), the PR's version will be backward and forward compatible and use less memory than our version.

@genericgithubuser
Copy link

This would be very helpful to have available, either as referenced here or in #2079 since currently the huge bandwidth usage between our coordinator-write pool -> m3db pool is our heaviest used resource.

@genericgithubuser
Copy link

To improve the situation here, I had to go with the LZ4 route that @fingon did. It might be worth adding his approach in until the longer term options are completed.

Just to show some actual results, we're seeing a huge drop in network traffic (no config changes for inbound metrics)

image

@fingon
Copy link
Contributor

fingon commented Sep 18, 2020

Why was this closed? @gibbscullen it isn't still addressed. e.g. #2079 is not merged in.

@gibbscullen
Copy link
Collaborator

It was a part of effort to clean up stale issues. Will re-open to continue investigating.

@gibbscullen gibbscullen reopened this Sep 18, 2020
@genericgithubuser
Copy link

Since it looks like #2079 has been closed out without being implemented, would it make sense to improve things by adding @fingon 's LZ4 compression approach? Is that something that would be accepted if a MR was put in with those changes, or any thoughts on next steps for this one?

@fingon
Copy link
Contributor

fingon commented Oct 10, 2020

We have moved on since to Snappy based solution with configuration in our own use. However we can open PR about that too if it is desired.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants