feat(server): enable data page checksums #9384

mmomjian · 2024-05-10T23:53:34Z

Proposal

Currently, the Immich Postgres database container does not use/enable data page checksums. Data page checksums are additional pages stored by Postgres that allow the DB to validate and detect corrupted data within the database tables. They have a minimal performance hit that I believe is more than offset by the ability to detect this type of data damage.

The official Postgres container docs outline how to enable checksums. I have tested deploying this using the pgvecto.rs database image, and it works as described. Adding this to an existing database doesn't do anything, as it's required to be specified on database initialization.

Implications

Importantly, this would not be a breaking change. I believe setting this variable only has an effect during database initialization, and not after. It is possible to restore a pg_dump from a non-checksum DB into a checksum DB, as well.

Healthcheck

This would allow users to add a healthcheck to monitor for corruption in the Postgres container if desired. For example:

healthcheck:
  test: Errs=$(psql --username=postgres --dbname=immich --tuples-only --no-align --command="SELECT checksum_failures FROM pg_stat_database WHERE datname = 'immich';"); [[ $Errs == '0' ]] || { echo 'corrupted pages detected'; exit 1; }
  interval: 60m
  retries: 1
  timeout: 10s
  start_period: 5m
  start_interval: 1m

Example logs

Checksums:

postgres=# show data_checksums;
 data_checksums
----------------
 on
(1 row)

postgres=# SELECT name, setting, category FROM pg_settings WHERE name = 'data_checksums';
      name      | setting |    category
----------------+---------+----------------
 data_checksums | on      | Preset Options
(1 row)

postgres=# SELECT datname, checksum_failures, checksum_last_failure FROM pg_stat_database WHERE datname IS NOT NULL;
  datname  | checksum_failures | checksum_last_failure
-----------+-------------------+-----------------------
 postgres  |                 0 |
 immich    |                 0 |
 template1 |                 0 |
 template0 |                 0 |
(4 rows)

Current setup PG message:

Data page checksums are disabled.

New PG message:

Data page checksums are enabled.

Adding this to an existing installation does not affect it all, checksums are not enabled:

PostgreSQL Database directory appears to contain a database; Skipping initialization

mertalev · 2024-05-11T00:48:03Z

The discussion here doesn't seem very positive for this flag. It adds a lot of WAL due to the need to enable hinting and the devs debate on what it would actually catch.

mmomjian · 2024-05-11T01:02:52Z

The discussion here doesn't seem very positive for this flag. It adds a lot of WAL due to the need to enable hinting and the devs debate on what it would actually catch.

I did read over that thread. The snippets seemed cherry-picked from the actual discussion, IMO, and the main reason against it was decided against was pg_upgrade not working with changes in defaults from one major release to another.

A number of large PG support companies recommend enabling it - whitepaper, source 2 (Amazon RDS uses checksums on all deployments per this source), 3.

If wal_compression is enabled (is it by default in the Docker image?), the I/O difference is minimal: source

mertalev · 2024-05-11T02:15:30Z

If wal_compression is enabled (is it by default in the Docker image?), the I/O difference is minimal

It isn't enabled currently, so I guess it's fine if we add that too. If we're tuning these then we may as well increase shared_buffers from the default 128MiB since it lowers the overhead and seems to be widely recommended for overall performance. 512MiB should be safe?

mmomjian · 2024-05-11T03:04:07Z

If wal_compression is enabled (is it by default in the Docker image?), the I/O difference is minimal

It isn't enabled currently, so I guess it's fine if we add that too. If we're tuning these then we may as well increase shared_buffers from the default 128MiB since it lowers the overhead and seems to be widely recommended for overall performance. 512MiB should be safe?

I think with the scope of the project and large libraries in use, it's reasonable to provide some sane DB tuning parameters to help improve performance - especially as we get some issues with the large time buckets that have been coming up lately. I think the following would be reasonable:

shared_buffers = 512MB
wal_compression = on # wal_compression = zstd is more efficient, but I think this is a PG15 feature

Some other options that we could consider is checkpoint_timeout, min_wal_size, and max_wal_size if needed.

mertalev · 2024-05-11T03:27:36Z

Of the settings you listed, max_wal_size=2gb shaved some time off of the geodata import the last time I tried it, so that would be nice to set as well. Otherwise I think this looks good.

mmomjian · 2024-05-11T04:26:52Z

Of the settings you listed, max_wal_size=2gb shaved some time off of the geodata import the last time I tried it, so that would be nice to set as well. Otherwise I think this looks good.

Do we currently specify any custom config for the PG container? Per the docs, we can either pass a custom postgresql.conf (which will conflict with pgvecto.rs) or add -c command line args.

It looks like the pgvecto.rs default CMD is ["postgres", "-c" ,"shared_preload_libraries=vectors.so", "-c", "search_path=\"$user\", public, vectors", "-c", "logging_collector=on"] (link). Would we override this in the docker-compose.yml like so?

  database:
    container_name: immich_postgres
    image: registry.hub.docker.com/tensorchord/pgvecto-rs:pg14-v0.2.0@sha256:90724186f0a3517cf6914295b5ab410db9ce23190a2d9d0b9dd6463e3fa298f0
    environment:
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_USER: ${DB_USERNAME}
      POSTGRES_DB: ${DB_DATABASE_NAME}
      POSTGRES_INITDB_ARGS: '--data-checksums'
    volumes:
      - ${DB_DATA_LOCATION}:/var/lib/postgresql/data
    restart: always
    command: ["postgres", "-c" ,"shared_preload_libraries=vectors.so", "-c", "search_path=\"$user\", public, vectors", "-c", "logging_collector=on", "-c", "max_wal_size=2GB", "-c", "shared_buffers=512MB", "-c", "wal_compression=on"]

mertalev · 2024-05-11T12:57:09Z

Yup, that's what I did.

mmomjian · 2024-05-11T13:55:31Z

I just started an instance with the docker-compose.yml in this commit, uploaded an image, and did a test search. Logs look clean.

postgres=# show data_checksums;
 data_checksums
----------------
 on
(1 row)

postgres=# show search_path;
       search_path
--------------------------
 "$user", public, vectors
(1 row)

postgres=# show max_wal_size;
 max_wal_size
--------------
 2GB
(1 row)

postgres=# show shared_buffers;
 shared_buffers
----------------
 512MB
(1 row)

postgres=# show wal_compression;
 wal_compression
-----------------
 on
(1 row)

postgres=# show logging_collector;
 logging_collector
-------------------
 on
(1 row)

postgres=# show shared_preload_libraries;
 shared_preload_libraries
--------------------------
 vectors.so
(1 row)

postgres=#

mertalev

Nice!

mertalev · 2024-05-11T14:03:13Z

Can you apply this change to the dev and prod Compose files as well? It's better if they follow the release file as closely as possible.

mmomjian · 2024-05-11T15:04:30Z

Done. BTW the defaults for those settings were WAL 1GB, buff 128MB, and compression off.

mmomjian · 2024-05-11T15:18:50Z

For anyone reading this, if someone pulls this new docker-compose.yml they will not get the benefit of checksums in an existing install, but they will get the wal_compression and tweaking of WAL size and shared buffers. No breaking changes.

stephen304 · 2024-05-28T20:41:04Z

If anyone is here wondering if you can enable checksumming without recreating the DB and importing an export, this seemed to do the trick to do an offline migration:

stop your immich stack before doing this
back up your DB just in case!
you must use an absolute path for DB_DATA_LOCATION, mine is set to a relative path (./volumes/postgres) which is why I added $(pwd)/
i just used whatever pgvecto-rs image I already had by grabbing the hash from docker image ls | grep tensor

docker run -it --entrypoint bash -v $(pwd)/${DB_DATA_LOCATION}:/var/lib/postgresql/data 2e2cb40c55b8

$ pg_controldata | grep -E 'state|checksum'
Database cluster state:               shut down
Data page checksum version:           0

$ time pg_checksums --enable --progress
215/215 MB (100%) computed
Checksum operation completed
Files scanned:  1437
Blocks scanned: 27557
pg_checksums: syncing data directory
pg_checksums: updating control file
Checksums enabled in cluster

real    0m11.508s
user    0m0.055s
sys     0m0.446s

$ pg_controldata | grep -E 'state|checksum'
Database cluster state:               shut down
Data page checksum version:           1

Basically I followed the steps from https://www.crunchydata.com/blog/fun-with-pg_checksums but with docker

mmomjian added 2 commits May 10, 2024 19:47

Update docker-compose.yml

d7dbb60

Update docker-compose.yml

a5bab58

mmomjian added the deployment label May 10, 2024

mmomjian requested a review from mertalev May 10, 2024 23:59

mmomjian marked this pull request as ready for review May 10, 2024 23:59

mmomjian requested a review from bo0tzz as a code owner May 10, 2024 23:59

mmomjian changed the title ~~deployment: enable data page checksums~~ feat(deployment): enable data page checksums May 11, 2024

mmomjian changed the title ~~feat(deployment): enable data page checksums~~ feat(server): enable data page checksums May 11, 2024

mmomjian added 2 commits May 11, 2024 09:40

Update docker-compose.yml

f675639

Update docker-compose.yml

4df3256

mertalev approved these changes May 11, 2024

View reviewed changes

propagate to dev/prod

504494a

mertalev approved these changes May 11, 2024

View reviewed changes

mertalev merged commit 55a7e54 into immich-app:main May 11, 2024
22 checks passed

mmomjian deleted the checksums branch May 11, 2024 15:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(server): enable data page checksums #9384

feat(server): enable data page checksums #9384

mmomjian commented May 10, 2024 •

edited

Loading

mertalev commented May 11, 2024

mmomjian commented May 11, 2024 •

edited

Loading

mertalev commented May 11, 2024

mmomjian commented May 11, 2024 •

edited

Loading

mertalev commented May 11, 2024

mmomjian commented May 11, 2024 •

edited

Loading

mertalev commented May 11, 2024

mmomjian commented May 11, 2024

mertalev left a comment

mertalev commented May 11, 2024

mmomjian commented May 11, 2024

mmomjian commented May 11, 2024

stephen304 commented May 28, 2024 •

edited

Loading

feat(server): enable data page checksums #9384

feat(server): enable data page checksums #9384

Conversation

mmomjian commented May 10, 2024 • edited Loading

Proposal

Implications

Healthcheck

Example logs

mertalev commented May 11, 2024

mmomjian commented May 11, 2024 • edited Loading

mertalev commented May 11, 2024

mmomjian commented May 11, 2024 • edited Loading

mertalev commented May 11, 2024

mmomjian commented May 11, 2024 • edited Loading

mertalev commented May 11, 2024

mmomjian commented May 11, 2024

mertalev left a comment

Choose a reason for hiding this comment

mertalev commented May 11, 2024

mmomjian commented May 11, 2024

mmomjian commented May 11, 2024

stephen304 commented May 28, 2024 • edited Loading

mmomjian commented May 10, 2024 •

edited

Loading

mmomjian commented May 11, 2024 •

edited

Loading

mmomjian commented May 11, 2024 •

edited

Loading

mmomjian commented May 11, 2024 •

edited

Loading

stephen304 commented May 28, 2024 •

edited

Loading