Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider improving network survey schema #4169

Closed
marta-lokhova opened this issue Jan 29, 2024 · 8 comments · Fixed by #4275
Closed

Consider improving network survey schema #4169

marta-lokhova opened this issue Jan 29, 2024 · 8 comments · Fixed by #4275

Comments

@marta-lokhova
Copy link
Contributor

Currently, peers respond to network survey with the following information:

    * `bytesRead`: The total number of bytes read from this peer.
    * `bytesWritten`: The total number of bytes written to this peer.
    * `duplicateFetchBytesRecv`: The number of bytes received that were duplicate transaction sets and quorum sets.
    * `duplicateFetchMessageRecv`: The count of duplicate transaction sets and quorum sets received from this peer.
    * `duplicateFloodBytesRecv`: The number of bytes received that were transactions and SCP votes duplicates.
    * `duplicateFloodMessageRecv`: The count of duplicate transactions and SCP votes received from this peer.
    * `messagesRead`: The total number of messages read from this peer.
    * `messagesWritten`: The total number of messages written to this peer.
    * `nodeId`: Node's public key.
    * `secondsConnected`: The total number of seconds this peer has been connected to the surveyed node.
    * `uniqueFetchBytesRecv`: The number of bytes received that were unique transaction sets and quorum sets.
    * `uniqueFetchMessageRecv`: The count of unique transaction sets and quorum sets received from this peer.
    * `uniqueFloodBytesRecv`: The number of bytes received that were unique transactions and SCP votes.
    * `uniqueFloodMessageRecv`: The count of unique transactions and SCP votes received from this peer.
    * `version`: stellar-core version.

In practice, some of these metrics aren't particularly useful (for example, it's a bit hard to reason about the absolute number of messages/bytes given that the rate fluctuates significantly overtime). In addition, we might be missing some key information about nodes on the network:

  • connection latency (to help assess quality of the connection)
  • whether node is a validator or a watcher
  • node's quorum set
  • basic high-level metrics that can give hints about overall node health: scheduler queue average waiting time, average block application latency, ledger age. Note that these should be aggregated over a reasonably long time period to hash out noise (e.g. over 10 minutes).
  • maybe node's upgrade status? (to understand what the network intends to vote on)

We might want to introduce different request types (node health metrics are quite a bit different from connectivity stats, for example, plus we should keep response sizes sane)

Tagging as discussion to get the conversation started.

@ire-and-curses
Copy link
Member

Taking a step back, what are the goals of the survey? I can think of several possible goals:

  • measuring performance: network latency / throughput / resource starvation / redundant traffic
  • measuring graph properties (e.g. topology structure, quorum memberships)
  • measuring node health: responsiveness, uptime, connection stability, churn rate
  • summing gross network stats: number of validators, number of watchers, number of older clients, number of alternative clients
  • using visible nodes and their network contribution as the tip of the iceberg to estimate the size of the hidden node membership

I'm personally most interested in network stats and graph properties. From a decentralisation perspective, it would be extremely valuable to understand the extent of the network, the number of validators and watchers, their versions, and the impact of those groups on the communications and stability of the overlay.

For example, I would like to be able to answer questions such as

  • how many potential voting entities exist on stellar?
  • how big is the stellar network compared to last year?
  • how many validators could the overlay reasonably support?
  • how close are we to that number?
  • what is the impact of watcher nodes on the network load?

Perhaps this might not be a goal best executed by the survey mechanism. Alternatives could include recursive crawlers or IP scanners. Would love to hear thoughts on this.

@bboston7
Copy link
Contributor

bboston7 commented Feb 15, 2024

Perhaps this might not be a goal best executed by the survey mechanism. Alternatives could include recursive crawlers or IP scanners. Would love to hear thoughts on this.

One of the major benefits of using a survey mechanism built into stellar-core is the ability to reach nodes behind NATs. Recursive crawlers / IP scanners will miss NATed nodes that do not accept inbound connections. A quick look at prior survey results shows many nodes on the network have no inbound peers. Of course we don't know exactly why that is, but if it's largely due to NATs then recursive crawlers / IP scanners may paint a misleading picture of the network. Moreover, if individuals are running nodes on residential ISPs then it's very likely they're behind some kind of NAT, especially with ISP level CGNAT becoming more common.

Whether or not missing these nodes is important gets more at your question of what goal the survey is trying to achieve. I think measuring network health and decentralization is an important goal, which would require reaching as many nodes as possible.

@bboston7 bboston7 self-assigned this Feb 21, 2024
@bboston7
Copy link
Contributor

In practice, some of these metrics aren't particularly useful (for example, it's a bit hard to reason about the absolute number of messages/bytes given that the rate fluctuates significantly overtime)

I think these metrics become useful when they're defined over time slices. If we can see how much data every node ingested for the same window of time we can start reasoning about the differences between them better.

We might want to introduce different request types (node health metrics are quite a bit different from connectivity stats, for example, plus we should keep response sizes sane)

Another idea is to break request types down by the underlying object the metric is measured over. So far, all of the metrics we're talking about either concern a node, or a connection. When we perform a survey request, the node responds with information about itself, as well as information about a subset of its connections. If that subset of connections isn't the full set of connections, we query the node again to (hopefully) get the remaining peers. This causes the node to send the information about itself again!

If we always want all of the survey information but we also want to minimize the data sent, we could send one request for node data, and separate request(s) for connection data. To clear up the difference, here's what I'm thinking of adding for each type of data based on this thread and other conversations. Note that most of the existing metrics in the ticket description are per-connection.

Additional per-connection data:

  • Average latency

Additional per-node data:

  • Number of connections added or dropped (per time slice). This will help to understand network churn.
  • Latency from surveyor. Measure time taken to receive a survey response after sending a request. This will help to analyze survey timeout parameters.
  • Node type (validator or watcher)
  • Node’s quorum set
  • Scheduler queue average waiting time
  • Average block application latency
  • Upgrade status

@MonsieurNicolas
Copy link
Contributor

We should probably split out the work on that front:

  • data necessary to improve overlay itself --> great candidate for this
    • a very small subset of high level metrics that are a good proxy for node health can be useful here as this impacts overlay performance. Things like: SCP latency (first to self), if the node is in sync (or which ledger it's on at a specific timestamp).
      • scheduler queue wait time is not a high level metric as having good numbers there does not imply much for the node.
  • more detailed node health
    • I don't know if survey is the right way to move forward as we really need a lot of metrics to understand what is going on. We may want to consider something that can be applied to other systems SDF maintains (core, Horizon, Soroban-RPC, etc): something like an optional telemetry stream that scrapes the metrics endpoint and uploads data to a server somewhere (we can run that server). We can coordinate to get this done cross team.
  • other information
    • quorum set information -- this can get pretty large and without a clear goal on what we'd like to do with it, I am not sure it's worth doing at this time (that being said -- it could be used to understand the resilience of the network of watcher nodes from a consensus point of view)
    • upgrade information -- same thing, wrt goals. I am actually not sure that we'd want intent to be visible as it changes voting dynamics. On the voting front, I would like to see actual votes after a protocol upgrade though (something that should end up in archives, not survey) -- I am not sure the current SCP messages uploaded allow to infer votes.

note that anything that depends on clocks being synchronized will require estimating the clock skew somehow or the data will be noisy,

@bboston7
Copy link
Contributor

note that anything that depends on clocks being synchronized will require estimating the clock skew somehow or the data will be noisy

Does stellar-core do any clock synchronization? It looks like stellar-core used to synchronize with NTP, but that functionality was removed. In poking around I didn't see whether we later added a different method for synchronization.

If there is no synchronization whatsoever then clock skews could be quite large. Another idea is to support surveys over time slices by broadcasting a start-survey-recording <nonce> message, then x minutes later broadcasting a stop-survey-recording <nonce> message and querying nodes about data during the <nonce> survey. This still isn't perfect as some nodes will receive the {start, stop}-survey-recording message before others, but if that difference is measured in miliseconds/seconds and the survey duration is measured many minutes, then the data acquired during the relatively small discrepancy in start/stop time may average out to have a negligible impact on the results.

@MonsieurNicolas
Copy link
Contributor

The NTP code was only there to warn the operator (and was buggy/not secure) so we removed it; also it didn't do anything about the system clock that we use basically everywhere else (like: we use the local clock to quantize metrics) because we actually need a steady clock there.

The new message "start-survey-recording " could be an interesting idea:

  • are you thinking that it would basically create some sort of local accumulator identified by "nonce", which then allows to reason based on a specific accumulator.
  • the accumulator would be deleted after some TTL in the order of something "long enough" (like 15 minutes).
  • it can also be deleted by "stop-survey-recording "
  • its TTL would be reset when responding to survey requests that corresponds to that nonce
    • this could allow to accumulate data for window sizes decided by the surveyor, potentially running for much longer than the TTL.

What would nodes that don't have the nonce do? (new nodes for example)?

@bboston7
Copy link
Contributor

bboston7 commented Mar 4, 2024

The NTP code was only there to warn the operator (and was buggy/not secure) so we removed it; also it didn't do anything about the system clock that we use basically everywhere else (like: we use the local clock to quantize metrics) because we actually need a steady clock there.

Got it, thanks for the clarification!

  • are you thinking that it would basically create some sort of local accumulator identified by "nonce", which then allows to reason based on a specific accumulator.

Yep!

  • the accumulator would be deleted after some TTL in the order of something "long enough" (like 15 minutes).

Yes, we'd need some TTL to prevent survey requests/data from potentially growing unbounded.

  • it can also be deleted by "stop-survey-recording "

Actually, I was thinking stop-survey-recording would define the end of a time slice and nodes would delete the accumulator some time after that, giving the surveying node ample time to collect the data from the time slice. From a surveyed node's perspective, the algorithm looks like this:

  1. The node receives start-survey-recording n where n is the nonce for this survey
  2. The node creates an accumulator a for nonce n and begins recording data. The accumulator has some TTL t.
  3. If t passes before receiving a stop-survey-recording n, the node deletes a and the algorithm terminates here.
  4. Upon receipt of stop-survey-recording n, the node freezes the data in a and assigns it a new TTL u.
  5. Upon receipt of survey-request-data n, the node responds with the data in a.
  6. After u passes, the node deletes a.
  • its TTL would be reset when responding to survey requests that corresponds to that nonce

    • this could allow to accumulate data for window sizes decided by the surveyor, potentially running for much longer than the TTL.

Good point. There should be some way to extend the TTL.

What would nodes that don't have the nonce do? (new nodes for example)?

The main point of time slicing is that it makes the data easier to compare between nodes. Given that, I think nodes without the nonce (either because they're new or they missed the start-survey-recording message) should either not respond, or should respond with a different message type indicating the node exists but does not have full survey data. I'm thinking this second message type would only include non-time sliced data (such as version number, or if the node is currently in sync).

I like the partial response solution better than the no-response solution because it helps differentiate between unresponsive nodes and (likely) new nodes.

@MonsieurNicolas
Copy link
Contributor

yeah makes sense... actually nodes can just respond with the existing survey response if they don't have the accumulator, that way we share the logic with the old clients that don't understand accumulators/timeslicing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants
@bboston7 @ire-and-curses @MonsieurNicolas @marta-lokhova and others