feat(metrics): Add bloom filter related metrics #521

acelyc111 · 2020-04-23T05:43:49Z

What problem does this PR solve?

Support to monitor more rocksdb metrics to observe how the storage system works
Provide a way to optimize rocksdb configurations and user workload

New metrics

replica*app.pegasus*rdb.bf_seek_total@<gpid>

Aka rocksdb::Tickers::BLOOM_FILTER_PREFIX_CHECKED. Number of times bloom was checked before creating iterator on a file.

replica*app.pegasus*rdb.bf_seek_negatives<gpid>

Aka rocksdb::Tickers::BLOOM_FILTER_PREFIX_USEFUL. The number of times the check was useful in avoiding iterator creation (and thus likely IOPs).

replica*app.pegasus*rdb.bf_point_positive_true<gpid>

Aka rocksdb::Tickers::BLOOM_FILTER_FULL_TRUE_POSITIVE. Of times bloom FullFilter has not avoided the reads and data actually exist.

replica*app.pegasus*rdb.bf_point_positive_total<gpid>

Aka rocksdb::Tickers::BLOOM_FILTER_FULL_POSITIVE. Of times bloom FullFilter has not avoided the reads.

replica*app.pegasus*rdb.bf_point_negatives<gpid>

Aka rocksdb::Tickers::BLOOM_FILTER_USEFUL. Of times bloom filter has avoided file reads, i.e., negatives.

collector*app.pegasus*app.stat.rdb_bf_seek_negatives_rate#<app_name>

Rate of avoided iterator creations (and thus likely IOPs) after checking prefix bloom filter.

value = SUM(bf_seek_negatives) / SUM(bf_seek_total)

collector*app.pegasus*app.stat.rdb_bf_point_negatives_rate#<app_name>

Rate of avoided point lookups after checking full key bloom filter.

value = SUM(bf_point_negatives) / (SUM(bf_point_negatives) + SUM(point_positive_total))

collector*app.pegasus*app.stat.rdb_bf_point_false_positive_rate#<app_name>

False positive rate of checking full key bloom filter.

value = (SUM(bf_point_positive_total) - SUM(bf_point_positive_true)) / (SUM(bf_point_positive_total) - SUM(bf_point_positive_true) + SUM(bf_point_negatives))

The naming of the above metrics are according to rocksdb document about bloom filter:

What is changed and how it works?

Add 3 columns (seek_n_rate, point_n_rate, point_fp_rate) in shell command app_stat

table level:

>>> app_stat
[app_stat]
app_name     app_id  pcount   GET  MGET   PUT  MPUT   DEL  MDEL  INCR   CAS   CAM  SCAN   RCU   WCU  expire  filter  abnormal  delay  reject  file_mb  file_num  mem_tbl_mb  mem_idx_mb  hit_rate  seek_n_rate  point_n_rate  point_fp_rate
temp             10      16  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00    0.00    0.00      0.00   0.00    0.00     0.00        16      108.18        0.03      0.99         0.50          0.99           0.01
(total:1)         0      16  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00    0.00    0.00      0.00   0.00    0.00     0.00        16      108.18        0.03      0.99         0.50          0.99           0.01

partition level:

>>> app_stat -a temp
[app_stat]
pidx           GET  MGET    PUT  MPUT   DEL  MDEL  INCR   CAS   CAM  SCAN   RCU     WCU  expire  filter  abnormal  delay  reject  file_mb  file_num  mem_tbl_mb  mem_idx_mb  hit_rate  seek_n_rate  point_n_rate  point_fp_rate
0             0.00  0.00   5.98  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00   60.00    0.00    0.00      0.00   0.00    0.00     0.00         1        6.62        0.00      0.99         0.49          0.99           0.01
1             0.00  0.00   0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00    0.00    0.00    0.00      0.00   0.00    0.00     0.00         1        5.54        0.00      0.99         0.50          0.99           0.01
2             0.00  0.00  24.03  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  241.00    0.00    0.00      0.00   0.00    0.00     0.00         1        7.48        0.00      0.99         0.49          0.99           0.01
3             0.00  0.00   0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00    0.00    0.00    0.00      0.00   0.00    0.00     0.00         1        6.45        0.00      0.99         0.50          0.99           0.01
4             0.00  0.00   0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00    0.00    0.00    0.00      0.00   0.00    0.00     0.00         1        7.49        0.00      0.99         0.50          0.99           0.01
5             0.00  0.00   4.68  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00   47.00    0.00    0.00      0.00   0.00    0.00     0.00         1        7.78        0.00      0.99         0.49          0.99           0.01
6             0.00  0.00   0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00    0.00    0.00    0.00      0.00   0.00    0.00     0.00         1        6.39        0.00      0.99         0.50          0.99           0.01
7             0.00  0.00  22.04  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  221.00    0.00    0.00      0.00   0.00    0.00     0.00         1        6.18        0.00      0.99         0.50          0.99           0.01
8             0.00  0.00   0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00    0.00    0.00    0.00      0.00   0.00    0.00     0.00         1        5.70        0.00      0.99         0.50          0.99           0.01
9             0.00  0.00   0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00    0.00    0.00    0.00      0.00   0.00    0.00     0.00         1        5.61        0.00      0.99         0.49          0.99           0.01
10            0.00  0.00   3.78  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00   38.00    0.00    0.00      0.00   0.00    0.00     0.00         1        6.35        0.00      0.99         0.50          0.99           0.01
11            0.00  0.00   0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00    0.00    0.00    0.00      0.00   0.00    0.00     0.00         1        8.06        0.00      0.99         0.49          0.99           0.01
12            0.00  0.00  22.23  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  223.00    0.00    0.00      0.00   0.00    0.00     0.00         1        7.69        0.00      0.99         0.50          0.99           0.01
13            0.00  0.00   0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00    0.00    0.00    0.00      0.00   0.00    0.00     0.00         1        5.46        0.00      0.99         0.50          0.99           0.01
14            0.00  0.00   0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00    0.00    0.00    0.00      0.00   0.00    0.00     0.00         1        6.81        0.00      0.99         0.49          0.99           0.01
15            0.00  0.00   5.28  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00   53.00    0.00    0.00      0.00   0.00    0.00     0.00         1        7.95        0.00      0.99         0.49          0.99           0.01
(total:16)    0.00  0.00  88.02  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  883.00    0.00    0.00      0.00   0.00    0.00     0.00        16      107.56        0.03      0.99         0.50          0.99           0.01

Add 3 metrics (rdb_bf_point_false_positive_rate , rdb_bf_point_negatives_rate , rdb_bf_seek_negatives_rate ) to app level entity

Check List

Tests

Manual test (add detailed scripts or steps below)

#!/usr/bin/env python
# coding:utf-8

from pypegasus.pgclient import *

from twisted.internet import reactor
from twisted.internet.defer import inlineCallbacks, Deferred


@inlineCallbacks
def basic_test():
    # init
    c = Pegasus(['meta1:port', 'meta2:port'], 'temp')

    suc = yield c.init()
    if not suc:
        reactor.stop()
        print('ERROR: connect pegasus server failed')
        return

    kCount = 10000
    # write test data set A
    print("start to set")
    for i in range(kCount):
        (ret, ign) = yield c.set('hkey_' + str(i), 'skey', 'value_' + str(i), 0, 500)
        #if ret == error_types.ERR_OK.value: continue
        #print('set hkey_' + str(i) + ' : skey => value_' + str(i) + ' ' + str(ret))

    # get test data set A and B, B has the same size of A
    print("start to get")
    for i in range(2*kCount):
        (ret, v) = yield c.get('hkey_' + str(i), 'skey')
        #if ret != error_types.ERR_OK.value:
        #    print('hkey_' + str(i) + ' : skey => ' + v)

    # scan test data set A and B, B has the same size of A. we can get 'seek_n_rate' in shell and 'rdb_bf_seek_negatives_rate' in metric around value of 0.5
    print("start to scan")
    o = ScanOptions()
    o.batch_size = 1
    for i in range(2*kCount):
        s = c.get_scanner('hkey_' + str(i), '6', '8', o)
        while True:
            try:
                ret = yield s.get_next()
                #print('get_next ret: ', ret)
            except Exception as e:
                print(e)
                break

            if not ret:
                break
        s.close()

    reactor.stop()


if __name__ == "__main__":
    reactor.callWhenRunning(basic_test)
    reactor.run()

Related changes

Need to cherry-pick to the release branch
Yes
Need to update the documentation
Yes
Need to be included in the release note
Yes

src/server/pegasus_server_impl.cpp

neverchanje · 2020-04-24T06:55:29Z

Why BLOOM_FILTER_PREFIX_USEFUL was called "seek_negatives"? I prefer what rocksdb called - "useful". "bf_seek_negatives_rate" to "bf_seek_useful_rate" is definitely easier to remember.

acelyc111 · 2020-04-24T07:12:02Z

Why BLOOM_FILTER_PREFIX_USEFUL was called "seek_negatives"? I prefer what rocksdb called - "useful". "bf_seek_negatives_rate" to "bf_seek_useful_rate" is definitely easier to remember.

Of course, 'negative' is longer than 'useful', but when 'negative' stand with 'positive', or 'false negative', it's more clear IMO.
Like test results is 'negative' or 'positive' for Coronavirus

neverchanje · 2020-04-24T08:17:23Z

Yes. But if you want to make it clear, you should call the metrics "false_negative" rather than "negative".

    INIT_COUNTER(rdb_bf_seek_negatives_rate);
    INIT_COUNTER(rdb_bf_point_negatives_rate);
    INIT_COUNTER(rdb_bf_point_false_positive_rate);

These names some have "false" prefixed but some are not.

acelyc111 · 2020-04-24T08:26:49Z

Yes. But if you want to make it clear, you should call the metrics "false_negative" rather than "negative".
    INIT_COUNTER(rdb_bf_seek_negatives_rate);
    INIT_COUNTER(rdb_bf_point_negatives_rate);
    INIT_COUNTER(rdb_bf_point_false_positive_rate);
These names some have "false" prefixed but some are not.

rdb_bf_point_negatives_rate doesn't mean rdb_bf_point_false_negatives_rate , but means rdb_bf_point_[true]_negatives_rate, that is to say, BF says this key definity not exist (negative).
On the other hand, for rdb_bf_point_false_positive_rate , BF says this key may exist (positive), but actual not exist after read data file, so it's a false positive.

acelyc111 force-pushed the rocksdb_bf_metrics branch 2 times, most recently from 0a5b8d8 to b0104b9 Compare April 23, 2020 10:46

acelyc111 changed the title ~~[metrics] Add bloom filter related metrics from rocksdb~~ [metrics] Add bloom filter related metrics Apr 23, 2020

acelyc111 changed the title ~~[metrics] Add bloom filter related metrics~~ [feat] Add bloom filter related metrics Apr 23, 2020

acelyc111 closed this Apr 23, 2020

acelyc111 reopened this Apr 23, 2020

feat(metrics): Add bloom filter related metrics

ee691eb

acelyc111 force-pushed the rocksdb_bf_metrics branch from 7a75f76 to ee691eb Compare April 23, 2020 12:28

acelyc111 changed the title ~~[feat] Add bloom filter related metrics~~ feat(metrics): Add bloom filter related metrics Apr 23, 2020

acelyc111 marked this pull request as ready for review April 23, 2020 12:28

format

27bea6f

hycdong previously approved these changes Apr 24, 2020

View reviewed changes

neverchanje added the type/perf-counter PR that made modification on perf-counter, which should be noted in release note. label Apr 24, 2020

neverchanje linked an issue Apr 24, 2020 that may be closed by this pull request

Monitor and optimize options for bloom filter #496

Closed

levy5307 reviewed Apr 24, 2020

View reviewed changes

src/server/pegasus_server_impl.cpp Outdated Show resolved Hide resolved

format

7eddaf2

acelyc111 dismissed hycdong’s stale review via 7eddaf2 April 24, 2020 06:26

levy5307 approved these changes Apr 24, 2020

View reviewed changes

neverchanje approved these changes Apr 24, 2020

View reviewed changes

neverchanje merged commit fb811ba into apache:master Apr 24, 2020

neverchanje mentioned this pull request May 14, 2020

Release 2.0.0 #536

Closed

neverchanje added the v2.0.0 label Jun 5, 2020

neverchanje mentioned this pull request Jun 10, 2020

Release 1.12.4 #547

Closed

acelyc111 pushed a commit that referenced this pull request Jun 23, 2022

refactor(log): separate log_file from mutation_log (#521)

f76b3e2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(metrics): Add bloom filter related metrics #521

feat(metrics): Add bloom filter related metrics #521

acelyc111 commented Apr 23, 2020 •

edited by neverchanje

Loading

neverchanje commented Apr 24, 2020 •

edited

Loading

acelyc111 commented Apr 24, 2020 •

edited

Loading

neverchanje commented Apr 24, 2020

acelyc111 commented Apr 24, 2020 •

edited

Loading

feat(metrics): Add bloom filter related metrics #521

feat(metrics): Add bloom filter related metrics #521

Conversation

acelyc111 commented Apr 23, 2020 • edited by neverchanje Loading

What problem does this PR solve?

New metrics

What is changed and how it works?

Check List

neverchanje commented Apr 24, 2020 • edited Loading

acelyc111 commented Apr 24, 2020 • edited Loading

neverchanje commented Apr 24, 2020

acelyc111 commented Apr 24, 2020 • edited Loading

acelyc111 commented Apr 23, 2020 •

edited by neverchanje

Loading

neverchanje commented Apr 24, 2020 •

edited

Loading

acelyc111 commented Apr 24, 2020 •

edited

Loading

acelyc111 commented Apr 24, 2020 •

edited

Loading