Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[vmstorage] panic: FATAL: cannot remove "/vm-data/indexdb/...": unlinkat /vm-data/indexdb/.../metadata.json: permission denied #749

Closed
ziouf opened this issue Sep 4, 2023 · 9 comments
Assignees
Labels
bug Something isn't working

Comments

@ziouf
Copy link

ziouf commented Sep 4, 2023

Describe the bug

At startup, VMStorage crashes with the following message (full log bellow):

panic: FATAL: cannot remove "/vm-data/indexdb/177E9C9207F871EC/1780E117A2091B9B": unlinkat /vm-data/indexdb/177E9C9207F871EC/1780E117A2091B9B/metadata.json: permission denied

To Reproduce

I dont know. It happen on pod restart after upgrade to the latest operator helm chart.
VMStorage is out-of-service even if I force to rollback to previous versions by forcing value of spec.vmstorage.image.tag field in VMCluster CRD.

Version

vmstorage-20230902-002932-tags-v1.93.3-cluster-0-gf78d8b994d

Logs

2023-09-04T21:42:56.367Z info VictoriaMetrics/lib/logger/flag.go:12 build version: vmstorage-20230902-002932-tags-v1.93.3-cluster-0-gf78d8b994d2023-09-04T21:42:56.367Z info VictoriaMetrics/lib/logger/flag.go:13 command-line flags2023-09-04T21:42:56.367Z info VictoriaMetrics/lib/logger/flag.go:20 -dedup.minScrapeInterval="15s"2023-09-04T21:42:56.367Z info VictoriaMetrics/lib/logger/flag.go:20 -httpListenAddr=":8482"2023-09-04T21:42:56.367Z info VictoriaMetrics/lib/logger/flag.go:20 -retentionPeriod="90d"2023-09-04T21:42:56.367Z info VictoriaMetrics/lib/logger/flag.go:20 -search.maxConcurrentRequests="16"2023-09-04T21:42:56.367Z info VictoriaMetrics/lib/logger/flag.go:20 -storageDataPath="/vm-data"2023-09-04T21:42:56.367Z info VictoriaMetrics/lib/logger/flag.go:20 -vminsertAddr=":8400"2023-09-04T21:42:56.367Z info VictoriaMetrics/lib/logger/flag.go:20 -vmselectAddr=":8401"2023-09-04T21:42:56.367Z info VictoriaMetrics/app/vmstorage/main.go:100 opening storage at "/vm-data" with -retentionPeriod=90d2023-09-04T21:42:56.370Z info VictoriaMetrics/lib/memory/memory.go:42 limiting caches to 2576980377 bytes, leaving 1717986919 bytes to the OS according to -memory.allowedPercent=602023-09-04T21:42:56.378Z info VictoriaMetrics/lib/storage/storage.go:950 discarding /vm-data/cache/curr_hour_metric_ids, since it contains outdated hour; got 470515; want 4705172023-09-04T21:42:56.398Z info VictoriaMetrics/lib/storage/storage.go:950 discarding /vm-data/cache/prev_hour_metric_ids, since it contains outdated hour; got 470514; want 4705162023-09-04T21:42:56.399Z panic VictoriaMetrics/lib/fs/dir_remover.go:53 FATAL: cannot remove "/vm-data/indexdb/177E9C9207F871EC/1780E117A2091B9B": unlinkat /vm-data/indexdb/177E9C9207F871EC/1780E117A2091B9B/metadata.json: permission deniedpanic: FATAL: cannot remove "/vm-data/indexdb/177E9C9207F871EC/1780E117A2091B9B": unlinkat /vm-data/indexdb/177E9C9207F871EC/1780E117A2091B9B/metadata.json: permission deniedgoroutine 1 [running]:github.com/VictoriaMetrics/VictoriaMetrics/lib/logger.logMessage({0x8e5642, 0x5}, {0xc000130370, 0xa7}, 0x2?) github.com/VictoriaMetrics/VictoriaMetrics/lib/logger/logger.go:281 +0xa91github.com/VictoriaMetrics/VictoriaMetrics/lib/logger.logLevelSkipframes(0x1, {0x8e5642, 0x5}, {0x8ef483?, 0x1?}, {0xc0001def80?, 0x0?, 0x10?}) github.com/VictoriaMetrics/VictoriaMetrics/lib/logger/logger.go:138 +0x194github.com/VictoriaMetrics/VictoriaMetrics/lib/logger.logLevel(...) github.com/VictoriaMetrics/VictoriaMetrics/lib/logger/logger.go:130github.com/VictoriaMetrics/VictoriaMetrics/lib/logger.Panicf(...) github.com/VictoriaMetrics/VictoriaMetrics/lib/logger/logger.go:126github.com/VictoriaMetrics/VictoriaMetrics/lib/fs.tryRemoveAll({0xc0000e4080, 0x32}) github.com/VictoriaMetrics/VictoriaMetrics/lib/fs/dir_remover.go:53 +0x105github.com/VictoriaMetrics/VictoriaMetrics/lib/fs.MustRemoveAll({0xc0000e4080, 0x32}) github.com/VictoriaMetrics/VictoriaMetrics/lib/fs/dir_remover.go:19 +0x25github.com/VictoriaMetrics/VictoriaMetrics/lib/mergeset.mustOpenParts({0xc0000b2300, 0x21}) github.com/VictoriaMetrics/VictoriaMetrics/lib/mergeset/table.go:1362 +0x34agithub.com/VictoriaMetrics/VictoriaMetrics/lib/mergeset.MustOpenTable({0xc0000b2300?, 0x10?}, 0x98c010, 0x98c018, 0xc00011e678) github.com/VictoriaMetrics/VictoriaMetrics/lib/mergeset/table.go:333 +0x55github.com/VictoriaMetrics/VictoriaMetrics/lib/storage.mustOpenIndexDB({0xc0000b2300, 0x21}, 0xc00011e4e0, 0x0?) github.com/VictoriaMetrics/VictoriaMetrics/lib/storage/index_db.go:147 +0x188github.com/VictoriaMetrics/VictoriaMetrics/lib/storage.(*Storage).mustOpenIndexDBTables(0xc00011e4e0, {0xc0000c8190, 0x10}) github.com/VictoriaMetrics/VictoriaMetrics/lib/storage/storage.go:2702 +0x985github.com/VictoriaMetrics/VictoriaMetrics/lib/storage.MustOpenStorage({0x7ffc17468764?, 0x8e5363?}, 0x1cf7c5800, 0x0, 0x0) github.com/VictoriaMetrics/VictoriaMetrics/lib/storage/storage.go:249 +0xb45main.main() github.com/VictoriaMetrics/VictoriaMetrics/app/vmstorage/main.go:102 +0x4c5

Screenshots

No response

Used command-line flags

-dedup.minScrapeInterval=15s
-httpListenAddr=:8482
-retentionPeriod=90d
-search.maxConcurrentRequests=16
-storageDataPath=/vm-data
-vminsertAddr=:8400
-vmselectAddr=:8401

Additional information

No response

@ziouf ziouf added the bug Something isn't working label Sep 4, 2023
@hagen1778
Copy link
Contributor

Hi @ziouf!

It happen on pod restart after upgrade to the latest operator helm chart.

Could you specify the exact version please?

panic: FATAL: cannot remove "/vm-data/indexdb/177E9C9207F871EC/1780E117A2091B9B": unlinkat /vm-data/indexdb/177E9C9207F871EC/1780E117A2091B9B/metadata.json: permission denied

What filesystem do you use? Is it NFS?

@hagen1778 hagen1778 self-assigned this Sep 5, 2023
@ziouf
Copy link
Author

ziouf commented Sep 5, 2023

What filesystem do you use? Is it NFS?

I use the default storage class from my cloud provider (Scaleway). I don't know what is under the hood.
Here is the CRD value used for field vmcluster.spec.vmstorage

  vmstorage:
    replicaCount: 3
    storageDataPath: "/vm-data"
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: scw-bssd
          resources:
            requests:
              storage: 100Gi
    extraArgs:
      dedup.minScrapeInterval: 15s
      search.maxConcurrentRequests: "16"

Could you specify the exact version please?

The helm chart version is helm.sh/chart: victoria-metrics-operator-0.26.1
The operator image deployed is victoriametrics/operator:v0.37.1
The VMStorage image is victoriametrics/vmstorage:v1.93.3-cluster

@zekker6
Copy link
Contributor

zekker6 commented Sep 5, 2023

@hagen1778 This is most likely related to https://github.com/VictoriaMetrics/operator/releases/tag/v0.36.0

vmoperator parameters: Add option VM_ENABLESTRICTSECURITY and enable strict security context by default. See #637, #692 and #712 PR for details.

Maybe it is better to transfer this to operator issues?

@ziouf Could you try to add the following env variable to operator and see if that helps? (full list of supported variables can be found here)

VM_ENABLESTRICTSECURITY: false

@hagen1778
Copy link
Contributor

According to changelog, the issue is caused by the following change:

vmoperator parameters: Add option VM_ENABLESTRICTSECURITY and enable strict security context by default. See #637, #692 and #712 PR for details.

Is this correct @zekker6 ?

@zekker6
Copy link
Contributor

zekker6 commented Sep 5, 2023

Yes, it seems to me like it is.

@hagen1778
Copy link
Contributor

@Amper @Haleygo I think this change should be a part of BreakingChanges section here https://github.com/VictoriaMetrics/operator/releases/tag/v0.36.0
wdyt?

@ziouf
Copy link
Author

ziouf commented Sep 5, 2023

@zekker6 I added the given variable and it seems to solve the issue
VMStorage is starting well now.
Thank you for your quick and effective support !

@Haleygo Haleygo transferred this issue from VictoriaMetrics/VictoriaMetrics Sep 5, 2023
@Haleygo Haleygo assigned Haleygo and unassigned hagen1778 Sep 5, 2023
@Haleygo
Copy link
Contributor

Haleygo commented Sep 6, 2023

VM operator has EnableStrictSecurity=true since v0.36.0, it will add below default securityContext to all the pods[vminsert/vmselect/vmstorage/alertmanger...]

  securityContext:
    // '65534' refers to 'nobody' in all the used default images like alpine, busybox.
    fsGroup: 65534
    fsGroupChangePolicy: OnRootMismatch
    runAsGroup: 65534
    runAsNonRoot: true
    runAsUser: 65534
    seccompProfile:
      type: RuntimeDefault

If component existed before operator upgrade and had no customized securityContext, volumes will be already mounted with root:root. When operator got upgraded, statefulset will gain default securityContext, directory fsGroup will change correctly but not fsUser, then user 65534 won't be able to operate those directory[root:65534] and fail the service.

Components which could encounter this problem:

  1. vmsingle/vmselect/vmstorage
  2. other components who has extra volumeClaim

Temporary fix:

  1. disable EnableStrictSecurity by adding the following env variable to operator
    VM_ENABLESTRICTSECURITY: false
  2. set securityContext: {} to override the default securityContext if no securityContext needed
    vmstorage:
      securityContext: {} 
  1. attach extra initContainer for failed component once and remove it after permissions is right
    For example, add following to vmcluster for vmstorage:
    vmstorage:
      initContainers:
      - command: ["chown", "-R", "65534:65534", "/vm-data"] 
         image: busybox:latest
         name: busybox
         securityContext:
           runAsNonRoot: false
           runAsUser: 0
         volumeMounts:
         - mountPath: /vm-data
            name: vmstorage-db

And Fix#1 will be included in next release of operator

@Haleygo
Copy link
Contributor

Haleygo commented Sep 11, 2023

v0.38.0 is released with the fix, close this as completed.
Feel free to reopen if there is still an issue)

@Haleygo Haleygo closed this as completed Sep 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants