`save transaction: database is locked` on `podman exec` #20809

jpalus · 2023-11-28T12:50:54Z

Issue Description

Upgraded to podman 4.8.0, did a system reset and took sqlite db for a spin. Started creation of 3 independent containers in parallel followed by some setup steps done with podman exec and one such invocation ended up with:

ERRO[0008] Container 68e95ee6822e1975764367770b5f9def2717de18a0624d71f1c857809303343d exec session cda744d36c8c72ce7f701cad851001f22e32432855dae12f1cda745bf4372aea error: saving container 68e95ee6822e1975764367770b5f9def2717de18a0624d71f1c857809303343d state: beginning container 68e95ee6822e1975764367770b5f9def2717de18a0624d71f1c857809303343d save transaction: database is locked

Steps to reproduce the issue

No reproducible steps I'm afraid. I assume it's a timing issue.

Describe the results you received

podman exec failed once with:

ERRO[0008] Container 68e95ee6822e1975764367770b5f9def2717de18a0624d71f1c857809303343d exec session cda744d36c8c72ce7f701cad851001f22e32432855dae12f1cda745bf4372aea error: saving container 68e95ee6822e1975764367770b5f9def2717de18a0624d71f1c857809303343d state: beginning container 68e95ee6822e1975764367770b5f9def2717de18a0624d71f1c857809303343d save transaction: database is locked

Describe the results you expected

No error should be present.

podman info output

host:
  arch: arm64
  buildahVersion: 1.33.2
  cgroupControllers:
  - cpu
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: Unknown
    path: /usr/bin/conmon
    version: 'conmon version 2.1.8, commit: 9aa8067313b1282097df84ab1a218149999397e4'
  cpuUtilization:
    idlePercent: 98.65
    systemPercent: 1.05
    userPercent: 0.29
  cpus: 8
  databaseBackend: sqlite
  distribution:
    distribution: pld
    version: "3.0"
  eventLogger: journald
  freeLocks: 2045
  hostname: rock
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 5.10.110-8-rockchip
  linkmode: dynamic
  logDriver: journald
  memFree: 9260699648
  memTotal: 16487534592
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: Unknown
      path: /usr/libexec/podman/aardvark-dns
      version: aardvark-dns 1.9.0
    package: Unknown
    path: /usr/libexec/podman/netavark
    version: netavark 1.9.0
  ociRuntime:
    name: crun
    package: Unknown
    path: /usr/bin/crun
    version: |-
      crun version 1.12
      commit: ce429cb2e277d001c2179df1ac66a470f00802ae
      rundir: /run/user/1000/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  pasta:
    executable: ""
    package: ""
    version: ""
  remoteSocket:
    exists: false
    path: /run/user/1000/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: ""
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: Unknown
    version: |-
      slirp4netns version 1.2.1
      commit: 09e31e92fa3d2a1d3ca261adaeb012c8d75a8194
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.4
  swapFree: 8096051200
  swapTotal: 8242851840
  uptime: 1284h 47m 38.00s (Approximately 53.50 days)
  variant: v8
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - docker.io
store:
  configFile: /home/users/jan/.config/containers/storage.conf
  containerStore:
    number: 3
    paused: 0
    running: 3
    stopped: 0
  graphDriverName: btrfs
  graphOptions: {}
  graphRoot: /mnt/build-storage/jan/containers
  graphRootAllocated: 274877906944
  graphRootUsed: 12658704384
  graphStatus:
    Build Version: Btrfs v6.6.2
    Library Version: "102"
  imageCopyTmpDir: /home/users/jan/tmp
  imageStore:
    number: 3
  runRoot: /run/user/1000/containers
  transientStore: false
  volumePath: /mnt/build-storage/jan/containers/volumes
version:
  APIVersion: 4.8.0
  Built: 1701166825
  BuiltTime: Tue Nov 28 11:20:25 2023
  GitCommit: bf038720ca8b9301a3aa80966759567c41199dd0
  GoVersion: go1.21.4
  Os: linux
  OsArch: linux/arm64
  Version: 4.8.0

Podman in a container

No

Privileged Or Rootless

Rootless

Upstream Latest Release

Yes

Additional environment details

No response

Additional information

No response

The text was updated successfully, but these errors were encountered:

Luap99 · 2023-11-28T13:01:34Z

@mheon @edsantiago Did you see this error before?

@jpalus Any chance you can share what commands you were running? I agree that it looks like a race when running podman in parallel.

jpalus · 2023-11-28T13:09:30Z

@Luap99 I guess I could though I guess they won't be of much use. These are pretty specific containers. From a high level perspective:

podman create of systemd enabled container running full init
Followed by podman start
Followed by multiple simple podman exec doing some package installations and simple setup tuning.

Command that failed with mentioned error was package installation -- after installation was done and podman was returning back to shell.

edsantiago · 2023-11-28T13:10:04Z

Ugh.... no, I haven't seen "database is locked" since May. But we do zero parallel testing.

mheon · 2023-11-28T13:11:49Z

@Luap99 Yes - this is why we disabled WAL mode, which was enough to calm CI. Evidently not sufficient.

I think we may need some form of retry wrapper around SQLite operations, because that error should not be fatal - we ought to be able to spin until the DB is unlocked and continue without issue.

mheon · 2023-11-28T13:12:12Z

(As a bonus, we could investigate reenabling WAL mode if we had such a wrapper)

jpalus · 2023-11-28T13:14:50Z

And a new occurrence when I wanted to remove those containers:

$ podman rm -f -a
68e95ee6822e1975764367770b5f9def2717de18a0624d71f1c857809303343d
ca58f9876ea2081d5d9881fefae04926a36444e1b85810b9b0a97229fba1824a
Error: database is locked

Luap99 · 2023-11-28T13:17:16Z

Doesn't retry just hide the real issue? At least reading https://www2.sqlite.org/cvstrac/wiki?p=DatabaseIsLocked this seems like a logic bug in our code.

By looking at all the transaction code paths. I found at least one call that does not call Commit() on the transaction.

podman/libpod/sqlite_state.go

Line 1659 in 8387d2d

func (s *SQLiteState) RemovePodContainers(pod *Pod) (defErr error) {

But this is rare pod rm code paths lo I find it unlikely that it is hit here.

Luap99 · 2023-11-29T13:29:47Z

I had 10+ podman run processes running in parallel for hours but never got any error, I also tried with exec. I think there might be more to the story here.

One difference is I am on x86_64 and you one aarch64. Also looking at performance during my stress testing podman was IO bound so I wonder if it has to do anything with the underlying fs/disk speed.
Another difference is kernel 5.10 vs 6.5 on my system.

jpalus · 2023-11-29T14:13:27Z

Checked on arm64/kernel 6.6.3 as well as x86/kernel 6.6.0 with same outcome. I had more luck with reproducing the latter error during podman rm -f -a. For me following works pretty consistently (although not a 100% reproducer):

podman system reset
Create 100 containers possibly more

for i in `seq 1 100`; do podman create docker.io/fedora; done

Remove all containers but slow it down with strace to possibly make timing issues more likely:

strace -o log -s1024 -f podman rm -a

Luap99 · 2023-11-29T15:16:20Z

Thanks, running with strace does indeed let me reproduce.

Only one process can write to the sqlite db at the same time, if another process tries to use it at that time it fails and a database is locked error is returned. If this happens sqlite should keep retrying until it can write. To do that we can just set the _busy_timeout option. A 100s timeout should be enough even on slower systems but not to much in case there is a deadlock so it still returns in a reasonable time. [NO NEW TESTS NEEDED] I think we strongly need to consider some form of parallel stress testing to catch bugs like this. Fixes containers#20809 Signed-off-by: Paul Holzinger <[email protected]>

Luap99 · 2023-11-29T17:11:01Z

@jpalus Any chance you can rebuild your podman with #20838 and test again?

Only one process can write to the sqlite db at the same time, if another process tries to use it at that time it fails and a database is locked error is returned. If this happens sqlite should keep retrying until it can write. To do that we can just set the _busy_timeout option. A 100s timeout should be enough even on slower systems but not to much in case there is a deadlock so it still returns in a reasonable time. [NO NEW TESTS NEEDED] I think we strongly need to consider some form of parallel stress testing to catch bugs like this. Fixes containers#20809 Signed-off-by: Paul Holzinger <[email protected]>

jpalus · 2023-11-30T16:20:06Z

A little late to the party but so far so good. Didn't notice any database is locked errors. However #20838 makes container creation (all the setup steps combined) take about 15% more time than BoltDB or SQLite without a fix.

jpalus · 2023-11-30T17:04:14Z

One strange issue occurred just now, that is with a stock podman 4.8.0 without a fix and sqlite db backend. Not sure if it's hidden "database locked" issue or wrong ordering somewhere. podman system reset with 3 containers running failed with:

ERRO[0014] image used by 929b991f97c04ce7d5a203abb206626c2285e63fe45e9b690a44fce236006417: image is in use by a container: consider listing external containers and force-removing image

mheon · 2023-11-30T17:18:17Z

That's probably separate; can you open a fresh bug for that? Probably missing a --force in the system reset logic somewhere.

jpalus · 2023-12-19T16:09:02Z

Just got another instance of it with podman 4.8.1 which is supposed to fix the issue:

$ podman rm -f -a
3b804f42923c8b5674a20fc0a820d46d6a0afcbcd80883d3b7953c589a1d4c2e
8a19b804c4f766404697acf8a32e1065063802a13c04770564ba3faa1dd895c8
Error: cleaning up container 1e6a8fb3e45f5fc7139583632bff90cc048601cac11e28bf5e11816b77125e2b: unmounting container 1e6a8fb3e45f5fc7139583632bff90cc048601cac11e28bf5e11816b77125e2b storage: unmounting container 1e6a8fb3e45f5fc7139583632bff90cc048601cac11e28bf5e11816b77125e2b: saving container 1e6a8fb3e45f5fc7139583632bff90cc048601cac11e28bf5e11816b77125e2b state: beginning container 1e6a8fb3e45f5fc7139583632bff90cc048601cac11e28bf5e11816b77125e2b save transaction: database is locked

These 3 containers are slow to remove (lots of files, spinning storage) so not really sure if those transactions are really "short-lived".

mheon · 2023-12-19T17:08:12Z

Removal is done in steps, we hold the transaction lock only for the last part of it - when the container’s storage has already been removed. The only thing under that lock should be deleting two rows from the database.

…

On Tue, Dec 19, 2023 at 11:09 Jan Palus ***@***.***> wrote: Just got another instance of it with podman 4.8.1 which is supposed to fix the issue: $ podman rm -f -a 3b804f42923c8b5674a20fc0a820d46d6a0afcbcd80883d3b7953c589a1d4c2e 8a19b804c4f766404697acf8a32e1065063802a13c04770564ba3faa1dd895c8 Error: cleaning up container 1e6a8fb3e45f5fc7139583632bff90cc048601cac11e28bf5e11816b77125e2b: unmounting container 1e6a8fb3e45f5fc7139583632bff90cc048601cac11e28bf5e11816b77125e2b storage: unmounting container 1e6a8fb3e45f5fc7139583632bff90cc048601cac11e28bf5e11816b77125e2b: saving container 1e6a8fb3e45f5fc7139583632bff90cc048601cac11e28bf5e11816b77125e2b state: beginning container 1e6a8fb3e45f5fc7139583632bff90cc048601cac11e28bf5e11816b77125e2b save transaction: database is locked These 3 containers are slow to remove (lots of files, spinning storage) so not really sure if those transactions are really "short-lived". — Reply to this email directly, view it on GitHub <#20809 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB3AOCAADJHKSIARKKADOC3YKG32RAVCNFSM6AAAAAA75WSINCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRTGA2TQMBUGM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

jpalus added the kind/bug Categorizes issue or PR as related to a bug. label Nov 28, 2023

Luap99 added the sqlite Bugs pertaining to the sqlite database backend label Nov 28, 2023

Luap99 self-assigned this Nov 29, 2023

Luap99 mentioned this issue Nov 29, 2023

sqlite: set busy timeout to 100s #20838

Merged

openshift-merge-bot bot closed this as completed in #20838 Nov 29, 2023

jpalus mentioned this issue Nov 30, 2023

image is in use on podman system reset #20856

Closed

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Mar 19, 2024

github-actions bot locked as resolved and limited conversation to collaborators Mar 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`save transaction: database is locked` on `podman exec` #20809

`save transaction: database is locked` on `podman exec` #20809

jpalus commented Nov 28, 2023 •

edited

Loading

Luap99 commented Nov 28, 2023

jpalus commented Nov 28, 2023

edsantiago commented Nov 28, 2023

mheon commented Nov 28, 2023

mheon commented Nov 28, 2023

jpalus commented Nov 28, 2023

Luap99 commented Nov 28, 2023

Luap99 commented Nov 29, 2023

jpalus commented Nov 29, 2023

Luap99 commented Nov 29, 2023

Luap99 commented Nov 29, 2023

jpalus commented Nov 30, 2023

jpalus commented Nov 30, 2023

mheon commented Nov 30, 2023

jpalus commented Dec 19, 2023

mheon commented Dec 19, 2023 via email

save transaction: database is locked on podman exec #20809

save transaction: database is locked on podman exec #20809

Comments

jpalus commented Nov 28, 2023 • edited Loading

Issue Description

Steps to reproduce the issue

Describe the results you received

Describe the results you expected

podman info output

Podman in a container

Privileged Or Rootless

Upstream Latest Release

Additional environment details

Additional information

Luap99 commented Nov 28, 2023

jpalus commented Nov 28, 2023

edsantiago commented Nov 28, 2023

mheon commented Nov 28, 2023

mheon commented Nov 28, 2023

jpalus commented Nov 28, 2023

Luap99 commented Nov 28, 2023

Luap99 commented Nov 29, 2023

jpalus commented Nov 29, 2023

Luap99 commented Nov 29, 2023

Luap99 commented Nov 29, 2023

jpalus commented Nov 30, 2023

jpalus commented Nov 30, 2023

mheon commented Nov 30, 2023

jpalus commented Dec 19, 2023

mheon commented Dec 19, 2023 via email

`save transaction: database is locked` on `podman exec` #20809

`save transaction: database is locked` on `podman exec` #20809

jpalus commented Nov 28, 2023 •

edited

Loading