-
Notifications
You must be signed in to change notification settings - Fork 267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add monitoring with Kamon (disabled by default) #1126
Conversation
For now: - we only track some tasks (especially in the router, but not even `node_announcement` and `channel_update` - all db calls are monitored
Codecov Report
@@ Coverage Diff @@
## master #1126 +/- ##
==========================================
+ Coverage 83.44% 83.68% +0.24%
==========================================
Files 102 103 +1
Lines 7677 7743 +66
Branches 318 315 -3
==========================================
+ Hits 6406 6480 +74
+ Misses 1271 1263 -8
|
def timeFuture[T](name: String)(f: => Future[T])(implicit ec: ExecutionContext): Future[T] = { | ||
val timer = Kamon.timer(name).withoutTags().start() | ||
val res = f | ||
res onComplete { case _ => timer.stop } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be res onComplete { case Success(_) => timer.stop }
? I'd assume we don't want to mix measurements for failed calls with the successful ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to stop the timer somehow? I don't think it matters anyway, because Kamon has first-class handling of distribution of latencies, and we would detect outliers easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's where we'd use tags or some equivalent.
Usually we should split our measurements on values of a specific tag with Success, Failure1, Failure2, etc depending on our set of possible failures.
I'm marking this as reviewable to minimize conflicts with other branches. It is just a first step. Thanks @ivantopo for all the support! |
def timeFuture[T](name: String)(f: => Future[T])(implicit ec: ExecutionContext): Future[T] = { | ||
val timer = Kamon.timer(name).withoutTags().start() | ||
val res = f | ||
res onComplete { case _ => timer.stop } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's where we'd use tags or some equivalent.
Usually we should split our measurements on values of a specific tag with Success, Failure1, Failure2, etc depending on our set of possible failures.
eclair-core/src/main/scala/fr/acinq/eclair/blockchain/bitcoind/rpc/ExtendedBitcoinClient.scala
Show resolved
Hide resolved
// if this returns true, it means that the spending tx is *not* in the blockchain | ||
isTransactionOutputSpendable(txid, outputIndex, includeMempool = false).map { | ||
case res => UtxoStatus.Spent(spendingTxConfirmed = !res) | ||
val span = Kamon.spanBuilder("validate-bitcoin-client").start() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's kamon recommendation for naming spans? Do they automatically included the containing namespace/class/object? It's usually recommended to use some kind of namespacing in spans, so either Kamon does it automatically for us or maybe we should do something like blockchain.bitcoind.rpc.extendedbitcoinclient.validate
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, let's address this in a 2nd iteration
eclair-core/src/main/scala/fr/acinq/eclair/blockchain/bitcoind/rpc/ExtendedBitcoinClient.scala
Show resolved
Hide resolved
@@ -43,6 +44,7 @@ class Authenticator(nodeParams: NodeParams) extends Actor with DiagnosticActorLo | |||
def ready(switchboard: ActorRef, authenticating: Map[ActorRef, PendingAuth]): Receive = { | |||
case pending@PendingAuth(connection, remoteNodeId_opt, address, _) => | |||
log.debug(s"authenticating connection to ${address.getHostString}:${address.getPort} (pending=${authenticating.size} handlers=${context.children.size})") | |||
Kamon.counter("peers.connecting.count").withTag("state", "authenticating").increment() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be useful to centralize counters for each component (like the Metrics
object you created for Peer
).
For example in Authenticator's companion object, define private val connectingPeersCounter = Kamon.counter("peers.connecting.count")
.
This is usually easier for future maintenance.
@@ -52,6 +53,7 @@ class Server(nodeParams: NodeParams, authenticator: ActorRef, address: InetSocke | |||
def listening(listener: ActorRef): Receive = { | |||
case Connected(remote, _) => | |||
log.info(s"connected to $remote") | |||
Kamon.counter("peers.connecting.count").withTag("state", "connected").increment() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same counter used in Authenticator
. So definitely worth centralizing somewhere to make sure these two actors keep using the same counter correctly
@@ -520,7 +536,11 @@ class Router(val nodeParams: NodeParams, watcher: ActorRef, initialized: Option[ | |||
stay | |||
} else { | |||
log.info("validating shortChannelId={}", c.shortChannelId) | |||
watcher ! ValidateRequest(c) | |||
Kamon.runWithContextEntry(shortChannelIdKey, c.shortChannelId) { | |||
Kamon.runWithSpan(Kamon.spanBuilder("validate-channel").tag("shortChannelId", c.shortChannelId.toString).start(), finishSpan = false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will that span be stopped? I'm not sure it's useful to start it here since we're just sending a message to another actor, it's more useful to do all the spanning inside the watcher isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is stopped in the ValidateResult
handler. The point is to be able to measure time spent in the queue, which I think is important to measure backlogs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, got it.
<artifactId>kamon-annotation_${scala.version.short}</artifactId> | ||
<version>2.0.0-RC1</version> | ||
</dependency> | ||
<!--dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe remove entirely? I don't expect it to be useful to monitor akka-http since we only use it for the RPC API...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think monitoring the API is still useful, and I'm waiting for a compatible version.
@t-bast I addressed the most pressing points. I think we can take care of the rest in a later iteration? |
This reverts commit ff0b4c8. # Conflicts: # eclair-core/pom.xml # eclair-node/src/main/scala/fr/acinq/eclair/Boot.scala # pom.xml
* Update list of commands in eclair-cli help (#1091) * Add missing API endpoints to eclair-cli help * Documentation update (#1092) * Typed amounts (#1088) * Route computation: fix fee check (#1101) Fee check during route computation is: - fee is below maximum value - OR fee is below amout * maximum percentage The second check was buggy and route computation would failed when fees we above maximum value but below maximum percentage of amount being paid. * Publish transactions during transitions (#1089) Follow up to #1082. The goal is to be able to publish transactions only after we have persisted the state. Otherwise we may run into corner cases like [1] where a refund tx has been published, but we haven't kept track of it and generate a different one (with different fees) the next time. As a side effect, we can now remove the special case that we were doing when publishing the funding tx, and remove the `store` function. NB: the new `calling` transition method isn't restricted to publishing transactions but that is the only use case for now. [1] ACINQ/eclair-mobile#206 * Typed cltv expiry (#1104) Untyped cltv expiry was confusing: delta and absolute expiries really need to be handled differently. Even variable names were sometimes misleading. Now the compiler will help us catch errors early. * Extended queries optional (#899) This is the implementation of lightning/bolts#557. * Correctly handle multiple channel_range_replies The scheme we use to keep tracks of channel queries with each peer would forget about missing data when several channel_range_replies are sent back for a single channel_range_queries. * RoutingSync: remove peer entry properly * Remove peer entry on our sync map only when we've received a `reply_short_channel_ids_end` message. * Make routing sync test more explicit * Do not send channel queries if we don't want to sync * Router: clean our sync state when we (re)connect to a peer We must clean up leftovers for the previous session and start the sync process again. * Router: reset sync state on reconnection When we're reconnected to a peer we will start a new sync process and should reset our sync state with that peer. * Extended Queries: use TLV format for optional data Optional query extensions now use TLV instead of a custom format. Flags are encoded as varint instead of bytes as originally proposed. With the current proposal they will all fit on a single byte, but will be much easier to extends this way. * TLV Stream: Implement a generic "get" method for TLV fields If a have a TLV stream of type MyTLV which is a subtype of TLV, and MyTLV1 and MYTLV2 are both subtypes of MyTLV then we can use stream.get[MyTLV1] to get the TLV record of type MYTLV1 (if any) in our TLV stream. * Channel range queries: send back node announcements if requested (#1108) This PR adds support for sending back node announcements when replying to channel range queries: - when explicitly requested (bit is set in the optional query flag) - when query flags are not used and a channel announcement is sent (as per the BOLTs) A new configuration option `request-node-announcements` has been added in the `router` section. If set to true, we will request node announcements when we receive a channel id (through channel range queries) that we don't know of. This is a setting that we will probably turn off on mobile devices. * Rework router data structures (#902) Instead of using two separate maps (for channels and channel_updates), we now use a single map, which groups channel+channel_updates. This is also true for data storage, resulting in the removal of the channel_updates table. * Add more numeric utilities to MilliSatoshi (#1103) Add comparisons and postfix operators. Update most of the codebase to leverage those. * Use unsigned comparison for 'maxHtlcValueInFlightMsat' (#1105) * Add a sync whitelist (#954) We will only sync with whilelisted peer. If the whitelist is empty then we sync with everyone. * Move http APIs to subproject eclair-node (#1102) * Fix regression in `Commitments.availableForSend` (#1107) We must consider `nextRemoteCommit` when applicable. This is a regression caused in #784. The core bug only exists when we have a pending unacked `commit_sig`, but since we only send the `AvailableBalanceChanged` event when sending a signature (not when receiving a revocation), actors relying on this event to know the current available balance (e.g. the `Relayer`) will have a wrong value in-between two outgoing sigs. * Bolt4: remove final_expiry_too_soon error message (#1106) It allowed probing attacks and the spec deprecated it in favor of IncorrectOrUnknownPaymentDetails. Also add better support for unknown failure messages. * Fix maven mirror (#1120) * Use Long to back the UInt64 type (#1109) * Define comparison operators between UInt64 and MilliSatoshi * Implement Bolt 11 invoice feature bits (#1121) lightning/bolts#656 introduced invoice feature bits as a pre-requisite for AMP and other advanced payment use-cases. * Update docker build (#1123) * Update docker base image to jdk11, update maven to 3.6.2 [ci skip] * Reject expired invoices before payment flow starts (#1117) * Made sync params configurable (#1124) This allows us to choose smaller parameters for tests and reduce cpu requirement during testing. NB: The default value of 3500 for `reply_channel_range` was wrong. Theoretical max is ~2700. * Activate support for variable-length onion (#1087) This is now enabled by default. We forward variable-length onions if we receive some. We accept variable-length payments. However for maximum compatibility with the network, we send payments using legacy payloads. * Add Semaphore CI (#1125) * Router computes network stats (#1116) * Add comments and fix warnings in graph processing * Add small feature to set the htlcMaximumMsat for routing hints (otherwise the graph processing algorithm used a minimum value which slightly reduced the benefits of those routing hints) * Add the computation of network statistics to the router: this will be useful for multi-part payments to decide what thresholds should be used to split a payment * Add monitoring with Kamon (disabled by default) (#1126) For now: - we only track some tasks (especially in the router, but not even `node_announcement` and `channel_update` - all db calls are monitored - kamon is disabled by default * Check funds in millisatoshi when sending/receiving an HTLC (#1128) Instead of satoshi, which could introduce rounding errors. Also, we check first the balance before the max-inflight amount, because it makes more sense in terms of error management. Co-Authored-By: Bastien Teinturier <[email protected]> * Don't hardcode the channel version (#1129) Instead of hardcoding the channel version when we instantiate the `Commitments` object, we rather define it when the channel is instantiated. This is saner and prepares future usage. * Removed Globals class (#1127) This is a prerequisite to parallelization of tests. * Make tests run in parallel (#1112) There are two level of parallelization: - between test suites (a suite = a test file) - within a suite (depends on tests suites, some rely on sequential execution of tests, some don't) * Add codecov integration to semaphore CI (#1134) * Remove codecov integration from travis CI * Drop support for Java 8 (#1135) We already have Java 7 (for Android) and Java 11. Supporting Java 8 would require crossbuilding, which we are not doing (two recent PRs broke the build on Java 8). * Sphinx: accept invalid downstream errors (#1137) When a downstream node sends us an onion error with an invalid length, we must forward the failure. The recipient won't be able to extract the error but at least it knows the payment failed. * Update string to match on bitcoind while it's indexing (#1138) * Check for bitcoind's getrawtransaction availablilty during startup * Peer: disable kamon * Payment lifecycle refactoring (#1130) * Unify payment events (no more duplication between payment types and events) * Factorize DB and eventStream interactions: this paves the way for sub-payments that shouldn't be stored in the DB nor emit events. * Add more fields to the payments DB: * bolt 11 invoice for sent payment * external id (for app developers) * parent id (AMP) * target node id * fees * route (if success) * failures (if failed) * Re-work the PaymentsDb interface * Clarify use of seconds / milliseconds in DB interfaces -> milliseconds everywhere * Run SQL migrations inside transactions * Improve error handling when we couldn't find all the channels for a supplied route in /sendtoroute API (#1142) * Improve error handling when we couldn't find all the channels for a supplied route in /sendtoroute * Handle fees increases when channel is OFFLINE (#1080) * Add 'close-on-offline-feerate-mismatch' configuration to avoid closing offline channel when the feerate mismatch if over the threshold. * Derive channel keys from the channel funding pubkey (#1097) We now generate a random funding key for each new channel, and use its public key to deterministically derive all channel keys and secrets. This will let us easily recover funds using DLP even if we've lost everything but our seed: we just need to connect to the node we had a channel with, ask them to publish their commit tx, and once we see it on the blockchain we can extract our funding pubkey, recompute channel keys and spend our output. * Add a "funding pubkey path" option to the channel version field This option is checked when we need to compute channel keys. For old channels it won't be set, and we always set it for new ones. * ChannelVersion: make sure that all bits are set to 0 for legacy channels * ChannelVersion: USE_PUBKEY_KEYPATH is set by default * Check if remote funder can handle an updated commit fee when sending HTLC (#1084) If the sender of an htlc isn't the funder, then both sides will have to afford the payment: - the sender needs to be able to afford the htlc amount - the funder needs to be able to afford the greater commit tx fee incurred by the additional htlc output. Fixes #1081. Co-Authored-By: Pierre-Marie Padiou <[email protected]> * Fix and expand channel keypath (#1147) * Fix funding pubkey to channel key path computation Channel key path is generated from 8 bytes computed from our funding pubkey, but we extracted 4 uint32 values instead of 2 (last 2 were always 0). We now use 128 bits to derive channel key paths. * Add a channel key path compatibility test This test will fail if we change the way we compute channel key paths, which would break existing channels. * Use the same chain hash reference in all channel updates To save memory, once we check that a channel_update's chain hash matches what we expect we just replace it with a reference to our own chain hash. * Commitments: take HTLC fee into account (#1152) Our balance computation was slightly incorrect. If you want to know how much you can send (or receive), you need to take into account the fact that you'll add a new HTLC which adds weight to the commit tx (and thus adds fees). * Android: add a spray-based API to eclair-node This is a copy of the spray-based API developped by @araspitzu (akka-http does not work for akka 2.3 which we use on the android branch) * HTTP API: add type hints for payment status (#1150) Cleans up the JSON payment status (easier to interpret for callers). * Use "mock" Kamon library Kamon does not work on Android and does not make much sense, so we replace it with a basic Mock implementation that does nothing. * Electrum: improve coin selection (fixes #1146) (#1149) Our previous coin selection would sometimes fail when there was one wallet utxo and and low feerate, because our first pass used a fee estimate that was too high and could sometimes not be met. * Extend funding key path to 256 bits (#1154) Our random funding key path is now 8 * 32 bits plus a 1' (funder) or 0' (fundee). Channel key paths are computed from the sha256 of the funding public key (we take all 256 bits). * Use bitcoin 0.18.1 in the test (#1148) * Upgrade new unit tests to bitcoin 0.18.1 API (#1157) We had 2 open PRs, one that added new tests using the 0.API, one that switched to 0.18.1, when they were merged the new tests failed since they had not been upgraded.... * Update netty dependency to 4.1.32 (#1160) Also: * explicitely set endpoint identification algorithm in strict mode * force TLS protocols 1.2/1.3 in strict mode Co-Authored-By: Bastien Teinturier <[email protected]> * Add execution time limit (#1161) * Android: wipe channels table during db migration We already wipe the updates table, and this make upgrading much simpler since we had different structures on android vs mater. * Activate extended channel range queries (#1165) By default we now set the `gossip_queries_ex` feature bit. We also change how we compare feature bits, and will use channel queries (or extended queries) only if the corresponding feature bit is set in both local and remote init messages. * Use guava to compute CRC32C checksums (#1166) CRC32C is not available in JDK 7 which we target on Android.
For now:
node_announcement
andchannel_update
)