-
Notifications
You must be signed in to change notification settings - Fork 59
feat(dup): add metrics for duplication #393
Conversation
@@ -273,9 +273,6 @@ void replication_options::initialize() | |||
|
|||
duplication_disabled = dsn_config_get_value_bool( | |||
"replication", "duplication_disabled", duplication_disabled, "is duplication disabled"); | |||
if (allow_non_idempotent_write && !duplication_disabled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为什么不要这个约束了?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
暂时不考虑这个约束,一方面因为我们线上默认开启 allow_non_idempotent_write
,一方面是开启热备份的表可以在接入层面对业务进行要求,不一定要写死在程序里。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
那现在开热备的表能够同时进行非幂等吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
目前不行,如果禁止的话,可能也不会依赖配置来禁止非幂等的写,毕竟一个集群可能有的表热备份,有的表不热备份。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
嗯,好,那目前有什么措施保证在热备的表没有进行非幂等操作呢?如果这个由业务控制而我们代码上没有限制,感觉还是有些不安全
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
目前还没有,初步的想法是在 pegasus 那边改,遇到 INCR 和 CHECK_AND_SET 就写一个 empty write,然后返回错误。但是还没实现。HBase 是支持热备份 INCR 的,就是复制的过程中,把 INCR 转为 PUT,但是这个流程 pegasus 这边很难写。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
嗯,好,记个TODO吧,这个最好还是代码上限制一下,靠业务的自觉性太不安全了
This PR introduces several metrics for duplication:
log_read_bytes_rate
name:
replica*eon.replica_stub*dup.log_read_bytes_rate
Calculates the bytes rate read from the private-log.
The curve line is usually identical with
replica*eon.replica_stub*shared.log.recent.write.size
. Because when everything normal, what is written is what duplicated, then:log_read_bytes_rate = shared.log.recent.write.size = shipped_bytes_rate
But in some failure conditions,
log_read_bytes_rate
may be much larger, which can be used to identify if log reading during duplication works abnormally.log_read_mutations_rate
name:
eon.replica_stub dup.log_read_mutations_rate
Read rate in mutations number. The same usage as "log_read_bytes_rate".
shipped_bytes_rate
name:
eon.replica_stub dup.shipped_bytes_rate
The network output bytes for successfully delivered duplication_request.
In some failure conditions, the curve may be dropped to 0, for example when the inter-cluster network is unavailable.
confirmed_rate
eon.replica_stub dup.confirmed_rate
The rate of confirmed writes, which indicates the number of writes that are duplicated and also confirmed by meta server.
pending_mutations_count
eon.replica_stub dup.pending_mutations_count
The number of writes that are not duplicated, this is one of the most important metrics for duplication. The more pending means weaker consistency. By practice, it's recommended to set an alarm threshold for this metric. Beyond the threshold, the duplication should
time_lag(ms)
eon.replica_stub dup.time_lag(ms)
The "latency" between 1. time of the client write arrives at replica server 2. time that the write duplicated and applied to the remote cluster.
t0 -> t1 -> t2
client -> replica server -> remote cluster
time_lag = t2-t1