Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downsampling Non Functional v0.4.5 #931

Closed
jacob-scheatzle opened this issue Sep 21, 2018 · 4 comments
Closed

Downsampling Non Functional v0.4.5 #931

jacob-scheatzle opened this issue Sep 21, 2018 · 4 comments

Comments

@jacob-scheatzle
Copy link

jacob-scheatzle commented Sep 21, 2018

Go Runtime version: go1.10.2
Build Version: v0.4.5
Build Revision: eef049a
Build Branch: master
Build Date: 2018-09-24-23:58:17

Expected Behavior:
M3Coordinator (used as a side car) would automatically downsample and write metrics when config specifying "aggregated" and "resolution" is picked up.

Actual Behavior:
M3Coordinator acknowledges the aggregated namespace but does not write anything to it. M3DB writes to unaggregated namespace as expected. No errors regarding aggregation have been observed in the log.

Log Snippet:

2018-09-21T16:09:32.704-0400        INFO        resolved cluster namespace        {"namespace": "lowtime"}
2018-09-21T16:09:32.704-0400        INFO        resolved cluster namespace        {"namespace": "timetest"}
2018-09-21T16:09:32.708-0400        INFO        configuring downsampler to use with aggregated cluster namespaces        {"numAggregatedClusterNamespaces": 1}

Layout:
Currently as a test setup M3DB is on 5 nodes. Each is a seed.
The Sidecar M3Coordinator is on 2 nodes writing to M3DB and 1 node for reading.

M3Coordinator Config (sidecar)

listenAddress:
  type: "config"
  value: "0.0.0.0:7201"

metrics:
  scope:
    prefix: "coordinator"
  prometheus:
    handlerPath: /metrics
    listenAddress: 0.0.0.0:7203 # until https://github.com/m3db/m3/issues/682 is resolved
  sanitization: prometheus
  samplingRate: 1.0
  extended: none

clusters:
   - namespaces:
       - namespace: timetest
         retention: 400h
         storageMetricsType: aggregated
         resolution: 2m
       - namespace: lowtime
         retention: 12h
         storageMetricsType: unaggregated
     client:
       config:
         service:
           env: default_env
           zone: embedded
           service: m3db
           cacheDir: /var/lib/m3kv
           etcdClusters:
             - zone: embedded
               endpoints:
# We have five M3DB nodes they are listed here.
                 - <removed>
                 - <removed>
                 - <removed>
                 - <removed>
                 - <removed>
       writeConsistencyLevel: majority
       readConsistencyLevel: unstrict_majority
       writeTimeout: 20s
       fetchTimeout: 25s
       connectTimeout: 20s
       writeRetry:
         initialBackoff: 500ms
         backoffFactor: 3
         maxRetries: 2
         jitter: true
       fetchRetry:
         initialBackoff: 500ms
         backoffFactor: 2
         maxRetries: 3
         jitter: true
       backgroundHealthCheckFailLimit: 4
       backgroundHealthCheckFailThrottleFactor: 0.5

M3DB Config:

coordinator:
  listenAddress:
    type: "config"
    value: "0.0.0.0:7201"

  metrics:
    scope:
      prefix: "coordinator"
    prometheus:
      handlerPath: /metrics
      listenAddress: 0.0.0.0:7203 # until https://github.com/m3db/m3/issues/682 is resolved
    sanitization: prometheus
    samplingRate: 1.0
    extended: none

db:
  logging:
    level: info

  metrics:
    prometheus:
      handlerPath: /metrics
    sanitization: prometheus
    samplingRate: 1.0
    extended: detailed

  hostID:
    resolver: environment
    envVarName: M3DB_HOST_ID

#Fill-out the following and un-comment before using.
  config:
    service:
      env: default_env
      zone: embedded
      service: m3db
      cacheDir: /var/lib/m3kv
      etcdClusters:
        - zone: embedded
          endpoints:
            - <removed>
            - <removed>
            - <removed>
            - <removed>
            - <removed>

    seedNodes:
      initialCluster:
        - hostID: m3db1
          endpoint: <removed>
        - hostID: m3db2
          endpoint: <removed>
        - hostID: m3db3
          endpoint: <removed>
        - hostID: m3db4
          endpoint: <removed>
        - hostID: m3db5
          endpoint: <removed>

  listenAddress: 0.0.0.0:9000
  clusterListenAddress: 0.0.0.0:9001
  httpNodeListenAddress: 0.0.0.0:9002
  httpClusterListenAddress: 0.0.0.0:9003
  debugListenAddress: 0.0.0.0:9004

  client:
    writeConsistencyLevel: majority
    readConsistencyLevel: unstrict_majority
    writeTimeout: 10s
    fetchTimeout: 15s
    connectTimeout: 20s
    writeRetry:
        initialBackoff: 500ms
        backoffFactor: 3
        maxRetries: 2
        jitter: true
    fetchRetry:
        initialBackoff: 500ms
        backoffFactor: 2
        maxRetries: 3
        jitter: true
    backgroundHealthCheckFailLimit: 4
    backgroundHealthCheckFailThrottleFactor: 0.5

  gcPercentage: 100

  writeNewSeriesAsync: true
  writeNewSeriesLimitPerSecond: 1048576
  writeNewSeriesBackoffDuration: 2ms

  bootstrap:
    bootstrappers:
        - filesystem
        - commitlog
        - peers
        - uninitialized_topology
    fs:
        numProcessorsPerCPU: 0.125

  cache:
    series:
      policy: lru

  commitlog:
    flushMaxBytes: 524288
    flushEvery: 1s
    queue:
        calculationType: fixed
        size: 2097152
    blockSize: 10m

  fs:
    filePathPrefix: /var/lib/m3db
    writeBufferSize: 65536
    dataReadBufferSize: 65536
    infoReadBufferSize: 128
    seekReadBufferSize: 4096
    throughputLimitMbps: 100.0
    throughputCheckEvery: 128

  repair:
    enabled: false
    interval: 2h
    offset: 30m
    jitter: 1h
    throttle: 2m
    checkInterval: 1m

  pooling:
    blockAllocSize: 16
    type: simple
    seriesPool:
        size: 262144
        lowWatermark: 0.7
        highWatermark: 1.0
    blockPool:
        size: 262144
        lowWatermark: 0.7
        highWatermark: 1.0
    encoderPool:
        size: 262144
        lowWatermark: 0.7
        highWatermark: 1.0
    closersPool:
        size: 104857
        lowWatermark: 0.7
        highWatermark: 1.0
    contextPool:
        size: 262144
        lowWatermark: 0.7
        highWatermark: 1.0
    segmentReaderPool:
        size: 16384
        lowWatermark: 0.7
        highWatermark: 1.0
    iteratorPool:
        size: 2048
        lowWatermark: 0.7
        highWatermark: 1.0
    fetchBlockMetadataResultsPool:
        size: 65536
        capacity: 32
        lowWatermark: 0.7
        highWatermark: 1.0
    fetchBlocksMetadataResultsPool:
        size: 32
        capacity: 4096
        lowWatermark: 0.7
        highWatermark: 1.0
    hostBlockMetadataSlicePool:
        size: 131072
        capacity: 3
        lowWatermark: 0.7
        highWatermark: 1.0
    blockMetadataPool:
        size: 65536
        lowWatermark: 0.7
        highWatermark: 1.0
    blockMetadataSlicePool:
        size: 65536
        capacity: 32
        lowWatermark: 0.7
        highWatermark: 1.0
    blocksMetadataPool:
        size: 65536
        lowWatermark: 0.7
        highWatermark: 1.0
    blocksMetadataSlicePool:
        size: 32
        capacity: 4096
        lowWatermark: 0.7
        highWatermark: 1.0
    identifierPool:
        size: 262144
        lowWatermark: 0.7
        highWatermark: 1.0
    bytesPool:
        buckets:
            - capacity: 16
              size: 524288
              lowWatermark: 0.7
              highWatermark: 1.0
            - capacity: 32
              size: 262144
              lowWatermark: 0.7
              highWatermark: 1.0
            - capacity: 64
              size: 131072
              lowWatermark: 0.7
              highWatermark: 1.0
            - capacity: 128
              size: 65536
              lowWatermark: 0.7
              highWatermark: 1.0
            - capacity: 256
              size: 65536
              lowWatermark: 0.7
              highWatermark: 1.0
            - capacity: 1440
              size: 16384
              lowWatermark: 0.7
              highWatermark: 1.0
            - capacity: 4096
              size: 8192
              lowWatermark: 0.7
              highWatermark: 1.0

Namespace Declaration:

curl -X POST localhost:7201/api/v1/namespace -d '{
  "name": "timetest",
  "options": {
    "bootstrapEnabled": true,
    "flushEnabled": true,
    "writesToCommitLog": true,
    "cleanupEnabled": true,
    "snapshotEnabled": true,
    "repairEnabled": false,
    "retentionOptions": {
      "retentionPeriodDuration": "400h",
      "blockSizeDuration": "4h",
      "bufferFutureDuration": "1h",
      "bufferPastDuration": "1h",
      "blockDataExpiry": true,
      "blockDataExpiryAfterNotAccessPeriodDuration": "5m"
    },
    "indexOptions": {
      "enabled": true,
      "blockSizeDuration": "4h"
    }
  }
}'

curl -X POST localhost:7201/api/v1/namespace -d '{
  "name": "lowtime",
  "options": {
    "bootstrapEnabled": true,
    "flushEnabled": true,
    "writesToCommitLog": true,
    "cleanupEnabled": true,
    "snapshotEnabled": true,
    "repairEnabled": false,
    "retentionOptions": {
      "retentionPeriodDuration": "12h",
      "blockSizeDuration": "4h",
      "bufferFutureDuration": "1h",
      "bufferPastDuration": "1h",
      "blockDataExpiry": true,
      "blockDataExpiryAfterNotAccessPeriodDuration": "5m"
    },
    "indexOptions": {
      "enabled": true,
      "blockSizeDuration": "4h"
    }
  }
}'
@jacob-scheatzle jacob-scheatzle changed the title Downsampling Non Functional v0.4.4 Downsampling Non Functional v0.4.5 Sep 26, 2018
@richardartoul
Copy link
Contributor

@jacob-scheatzle Hey sorry you ran into this issue, this is a very new feature. We spent a lot of time the last few days looking into this and resolved a variety of issues:

#991
#989

Do you mind rebuilding m3coordinator from master and trying again? You should also get some nice performance boost as an added benefit.

@jacob-scheatzle
Copy link
Author

@richardartoul I'll get a new build in tomorrow and report back with results

@jacob-scheatzle
Copy link
Author

jacob-scheatzle commented Oct 3, 2018

@richardartoul It looks like writing to aggregated name spaces is working as expected now. I'm able to write to both unaggregated and aggregated namespaces and am able to confirm the data in them with the API.

The issue I am running into now is reading from both namespaces. Prometheus (through the coordinator) can read from the unaggregated namespace no issue until data is written to the aggregated one. At that point no data is read by Prometheus. If I remove the aggregated namespace from the read node (the Prometheus node writing data is still writing to both) I can read from the unaggregated namespace again.

The namespace definition in the coordinator is as such:

       - namespace: metrics
         retention: 48h
         type: unaggregated
       - namespace: metrics-15m-1y
         retention: 8760h
         type: aggregated 
         resolution: 15m

Additional Info:
I think I've narrowed the behavior I'm seeing down a little more. I had a second node begin writing its own distinct data to M3, but this time only writing to the unaggregated namespace. The Prometheus read node that is setup to read from both aggregated and unaggregated returns results from the unaggregated namespace no issue for that second node. It seems like if data that satisfies the query exists in the unaggregated and aggregated namespace then nothing is returned.

Additional behavior I've observed, once the node that is writing only to the unagg space passes the agg resolution time the Prometheus query returns all data to satisfy the query from both namespaces. It seems that somehow having some data to fulfill the query in only the unagg namespace triggers a return of all of the proper data from both namespaces. I tested this by allowing the second node to write to both agg and unagg namespaces which resulted in only reading from agg namespace after the resolution window and then I set that node back to writing to the unagg namespace only and after waiting for the resolution time Prometheus was back to returning data from both agg and unagg

Scenarios:

a. There is data in the unagg immediately, there is no data in agg until resolution time - Expected and proper behavior I believe
b. While there is no data in the agg and data in the unagg, Prometheus gets data
c. As soon as there is data in the agg that has similar data in the unagg, Prometheus only gets data from agg
d. When there is data in the agg and similar data + data no in the agg in the unagg, Prometheus gets all data again

Its almost like once the agg namespace has data with similar tags to unagg, coordinator simply ignores the unagg. Under scenario c above I ran a query with [30m] and got the results from agg, but running the query instant I get nothing. Using scenario D (third node writing only to unagg to make data different) I get all results from agg and unagg for the different data as well as the data with similar tags between agg and unagg.

@jacob-scheatzle
Copy link
Author

Everything is working as expected with release 0.4.6. Thank you everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants