Skip to content

Latest commit

 

History

History
434 lines (350 loc) · 15.5 KB

logbook.md

File metadata and controls

434 lines (350 loc) · 15.5 KB

Logbook

Known Errors In New CrudTransactionConnector

  • PostgreSQL could not serialize access due to concurrent update
    • Max retries, 10, exceeded

TODO: Implement a try hard, try harder, 'reverse' exponential backoff?

  • fair and polite
  • currently testing a more patient strategy
    // retry/delay strategy
    // - delay diversity, random uniform distribution
    // - persistent retries, relative large max
    const _minRetryDelay = 1; // in ms, each retry delay is min <= random <= max
    const _maxRetryDelay = 64;
    const _maxRetries = 32;
    final _rng = Random();

Errors That Aren't Understood Yet

Only consistency anomaly consistently found

  • triggered by sustained higher transaction rates, 100 tps
  • PostgreSQL -> client db replication stops
  • introducing disconnecting/connecting, killing/starting, etc actually makes the system a bit more resilient
; can break replication and convergence
; example of all clients doing a different final read of key 0
:divergent-reads {0 {nil #{"n3"},
                     [2 5] #{"n4"},
                     [3 4 7] #{"n1"},
                     [6] #{"n2"},
                     [1] #{"n5"}}}
                 ...

Let's:

  • shrink the tests to the smallest reproducible
  • trace a few transactions end-to-end
  • insure that it's not a test artifact, incorrect usage
  • maybe explore a Dart CLI app that reproduces?

Test command:

lein run test --workload convergence --rate 100 --time-limit 30 --nodes n1,n2,n3,n4,n5 --postgres-nodes n1

Smallest example of divergence:

:divergent-reads {0 {nil #{"n3"
                           "n4"
                           "n5"},
                     [3] #{"n1" "n2"}}}

Turn on PostgreSQL statement logging to see replication from CrudTransactionConnector:

  • see correct transaction isolation
    LOG:  execute <unnamed>: SET SESSION CHARACTERISTICS AS TRANSACTION ISOLATION LEVEL READ COMMITTED
    LOG:  execute <unnamed>: SHOW TRANSACTION ISOLATION LEVEL
    LOG:  execute <unnamed>: SET SESSION CHARACTERISTICS AS TRANSACTION ISOLATION LEVEL REPEATABLE READ
    
  • see correct handling of concurrent conflict retries
    LOG:  statement: COMMIT;
    ERROR:  could not serialize access due to concurrent update
    STATEMENT:  UPDATE lww SET v = $1 WHERE id = $2 RETURNING *
    LOG:  statement: ROLLBACK;
    

Misc Possible Issues

  • frequent, extraneous?, duplicate SyncStatus stream messages

    [2:13:41.301] [powersync_endpoint] FINEST: statusStream: SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: 2024-07-28 02:13:40.138093, hasSynced: true, error: null>
    [2:13:41.301] [powersync_endpoint] FINEST: statusStream: SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: 2024-07-28 02:13:40.138093, hasSynced: true, error: null>
    
    int distinctSyncStatus = 0;
    int totalSyncStatus = 0;
    db.statusStream.distinct().listen((syncStatus) {
      distinctSyncStatus++;
      log.finest('distinctSyncStatus: $distinctSyncStatus');
    });
    db.statusStream.listen((syncStatus) {
      totalSyncStatus++;
      log.finest('totalSyncStatus: $totalSyncStatus');
    });
    [2:5:36.814] [powersync_endpoint] FINEST: distinctSyncStatus: 568
    [2:5:36.814] [powersync_endpoint] FINEST: totalSyncStatus: 7235
    
  • connect/disconnect

Mostly just an FYI on observing connecting/disconnecting and the statusStream.

Don't know if connect/disconnect are expected to be idempotent? It appears connect always does a disconnect before re-connecting.

hasSynced and lastSyncedAt don't always update in unison, e.g. lastSyncedAt can get set to null even though db has been synced and hasSynced remains true.

And so far, all local writes when disconnected are replicated on reconnection, and reconnection fully catches up the local db.

connect when connected:

[12:4:13.909] [powersync_endpoint] FINEST: statusStream: SyncStatus<connected: false connecting: false downloading: false uploading: false lastSyncedAt: 2024-08-06 12:04:04.793216, hasSynced: true, error: null>
[12:4:13.909] [powersync_endpoint] FINEST: statusStream: SyncStatus<connected: false connecting: false downloading: false uploading: false lastSyncedAt: 2024-08-06 12:04:04.793216, hasSynced: true, error: null>
[12:4:13.918] [powersync_endpoint] FINEST: statusStream: SyncStatus<connected: false connecting: true downloading: false uploading: false lastSyncedAt: null, hasSynced: true, error: null>
[12:4:13.918] [powersync_endpoint] FINEST: statusStream: SyncStatus<connected: false connecting: true downloading: false uploading: false lastSyncedAt: null, hasSynced: true, error: null>
[12:4:13.972] [powersync_endpoint] FINEST: statusStream: SyncStatus<connected: true connecting: false downloading: false uploading: false lastSyncedAt: null, hasSynced: true, error: null>
[12:4:13.972] [powersync_endpoint] FINEST: statusStream: SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: null, hasSynced: true, error: null>
[12:4:13.977] [powersync_endpoint] FINEST: statusStream: SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: null, hasSynced: true, error: null>
[12:4:13.990] [powersync_endpoint] FINEST: statusStream: SyncStatus<connected: true connecting: false downloading: false uploading: false lastSyncedAt: 2024-08-06 12:04:13.990260, hasSynced: true, error: null>

connect when disconnected:

[11:58:46.478] [powersync_endpoint] FINEST: statusStream: SyncStatus<connected: false connecting: false downloading: false uploading: false lastSyncedAt: 2024-08-06 11:54:38.176819, hasSynced: true, error: null>
[11:58:46.487] [powersync_endpoint] FINEST: statusStream: SyncStatus<connected: false connecting: true downloading: false uploading: false lastSyncedAt: null, hasSynced: true, error: null>
[11:58:46.488] [powersync_endpoint] FINEST: statusStream: SyncStatus<connected: false connecting: true downloading: false uploading: false lastSyncedAt: null, hasSynced: true, error: null>
[11:58:46.549] [powersync_endpoint] FINEST: statusStream: SyncStatus<connected: true connecting: false downloading: false uploading: false lastSyncedAt: null, hasSynced: true, error: null>
[11:58:46.549] [powersync_endpoint] FINEST: statusStream: SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: null, hasSynced: true, error: null>
[11:58:46.553] [powersync_endpoint] FINEST: statusStream: SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: null, hasSynced: true, error: null>
[11:58:46.569] [powersync_endpoint] FINEST: statusStream: SyncStatus<connected: true connecting: false downloading: false uploading: false lastSyncedAt: 2024-08-06 11:58:46.569435, hasSynced: true, error: null>

disconnect when connected

[12:3:7.960] [powersync_endpoint] FINEST: statusStream: SyncStatus<connected: false connecting: false downloading: false uploading: false lastSyncedAt: 2024-08-06 12:02:52.023069, hasSynced: true, error: null>
[12:3:7.961] [powersync_endpoint] FINEST: statusStream: SyncStatus<connected: false connecting: false downloading: false uploading: false lastSyncedAt: 2024-08-06 12:02:52.023069, hasSynced: true, error: null>

disconnect when disconnected

[12:0:51.521] [powersync_endpoint] FINEST: statusStream: SyncStatus<connected: false connecting: false downloading: false uploading: false lastSyncedAt: 2024-08-06 12:00:28.991090, hasSynced: true, error: null>

History

Exercise Backend Connector With Process Pause/Kill

(c/su
   (cu/grepkill! app-ps-name)
   ...
   (cu/grepkill! :stop app-ps-name)
   ...
   (cu/grepkill! :cont app-ps-name))

Jepsen log:

:nemesis	:info	:kill	:minority
...
:nemesis	:info	:kill	{"n1" :killed, "n2" :killed}
...
:nemesis	:info	:start	:all
...
:nemesis	:info	:start	{"n1" :started, "n2" :started, "n3" :already-running, "n4" :already-running, "n5" :already-running}

Observations:

  • the usual, can end up in a divergent state
    • pause more likely than kill to diverge
  • new Exception in client when paused
    • org.apache.http.NoHttpResponseException

Exercise Backend Connector By Partitioning

(c/su
   (c/exec :iptables
           :-A :INPUT
           :-s :powersync
           :-j :DROP
           :-w)
   (c/exec :iptables
           :-A :INPUT
           :-s :pg-db
           :-j :DROP
           :-w))

Jepsen log:

:nemesis	:info	:partition-sync	:majority
...
:nemesis	:info	:partition-sync	{"n1" :partitioned, "n3" :partitioned, "n4" :partitioned}
...
:nemesis	:info	:heal-sync	nil
...
:nemesis	:info	:heal-sync	{"n1" :healed, "n2" :healed, "n3" :healed, "n4" :healed, "n5" :healed}

Observations:

  • exercises start-up under partition
  • client always reconnects when partition heals
  • more likely to end up in a divergent state

Exercise Backend Connector By Disconnecting/Connecting

await db.connect(connector: connector);
await db.disconnect();

Jepsen log:

2024-07-18 20:03:19,792 :nemesis	:info	:disconnect	:majority
2024-07-18 20:03:19,796 :nemesis	:info	:disconnect	{"n1" :disconnected, "n3" :disconnected, "n5" :disconnected}
...
2024-07-18 20:03:24,403 :nemesis	:info	:connect	nil
2024-07-18 20:03:24,412 :nemesis	:info	:connect	{"n1" :connected, "n2" :connected, "n3" :connected, "n4" :connected, "n5" :connected}

PowerSync Endpoint log:

[powersync_endpoint] FINEST: statusStream: SyncStatus<connected: true connecting: false ...>
GET     [200] /powersync/connect
[powersync_endpoint] FINEST: statusStream: SyncStatus<connected: false connecting: false ...>
[powersync_endpoint] FINEST: statusStream: SyncStatus<connected: true connecting: false ...>

Observations:

  • disconnecting/connecting is reliable
    • client seems more resilient, likely to strongly converge, than normal operations
    • only error that's reproducible is PostgreSQL could not serialize access due to concurrent update, max retries exceeded
  • doing a connect on an already connected connection seems to trigger a disconnect then a connect

TODO / Questions:

  • are connect/disconnect idempotent?
    • check if connect causes an initial disconnect?

Initial Atomic Transaction Orientated Backend Connector

Example of a non-atomic transaction replication from a PowerSync write to a PostgreSQL read:

  • PowerSync, top transaction, writes '47' to random keys
  • PostgreSQL, bottom transaction, reads some but not all of the PowerSync writes in its transaction

Note that all writes are synced, Strong Convergence, but not on transaction boundaries.

G-single-item

CrudTransactionConnector (PostgreSQL transaction orientated)

New CrudTransactionConnector:

  • uploadData tightly coupled to a PostgreSQL transaction
  • eager
    • processes all available transactions
    • wants local db to be available for sync updates
  • PostgreSQL transactions are executed with an isolation level of repeatable-read
    • so app retries on concurrent access Exceptions

Can clearly see the effects of different connectors and atomic transactions:

atomic-matrix GitHub action results

Implementation

Specify backend connector at runtime:

  • cli: --backend-connector CLASSNAME
  • .env: BACKEND_CONNECTOR=CrudTransactionConnector

Expect no Exceptions, and mean it:

// log PowerSync status changes
// monitor for upload error messages, there should be none
db.statusStream.listen((syncStatus) {
  if (syncStatus.uploadError != null) {
    log.severe(
        'Upload error detected in statusStream: ${syncStatus.uploadError}');
    exit(127);
  }

New workloads that isolate, emphasize PowerSync -> PostgreSQL and PostgreSQL -> PowerSync replication. Simple way to show atomic/non-atomic replication with different backend connectors:

  • ps-wo-pg-ro PowerSync write only, PostgreSQL read only
  • ps-ro-pg-wo PowerSync read only, PostgreSQL write only

Sample test commands:

lein run test --workload ps-wo-pg-ro --rate 20 --time-limit 60 --nodes n1,n2 --postgres-nodes n1 --backend-connector CrudBatchConnector
lein run test --workload ps-ro-pg-wo --rate 20 --time-limit 60 --nodes n1,n2 --postgres-nodes n1 --backend-connector CrudTransactionConnector

GitHub action.


PowerSync, w/sync, Single User

CrudBatchConnector

  • fetchCredentials
    • generate permissive JWT
  • uploadData
    • CrudBatch oriented
    • one CrudBatch per call, not eager
    • one CrudEntry at a time
    • reuses single Postgres Connection
    • consistency quite casual, not at all causal, will allow:
      • non-Atomic transactions
      • Intermediate Reads
      • Monotonic Read cycling
  • Exception handling:
    • recoverable
      • throw to signal PowerSync to retry
    • unrecoverable
      • indicates architectural/implementation flaws
      • not safe to proceed
      • exit to force app restart and resync

The CrudBatchConnector was design to be the minimum needed to go end-to-end.

Test results:

  • PowerSync writing, PostgreSQL reading:

    • total availability
    • non-atomic transactions
    • Strong Convergence
  • PowerSync reading, PostgreSQL writing:

    • total availability
    • Repeatable Read
    • Strong Convergence

Notes:

  • TODO: check on performance of replication with direct writes to PostgreSQL
    • tests can cause long periods of replication
    • but always strongly converges

Test command:

lein run test --workload ps-ro-pg-wo --rate 10 --time-limit 60 --nodes n1,n2 --postgres-nodes n1
lein run test --workload ps-wo-pg-ro --rate 10 --time-limit 60 --nodes n1,n2 --postgres-nodes n1

PowerSync, localOnly: true, Single User

Demonstrates

  • PowerSync's PowerSyncDatabase driver, libpowersync.so native ffi lib
  • initial NoOpConnector for local only
    • permissive auth JWT/JWS for development
    • no uploadDate() as localOnly: true
  • Docker
    • full PowerSync service
    • config, env
  • longer test runs
    • 10k's of transactions

Test results:

  • total availability
  • strict serializability
{:stats {:count 35802,
         :ok-count 35802,
         :fail-count 0,
         :info-count 0},
 :workload {:list-append {:valid? true},
            :logs-ps-client {:valid? true, :count 0, :matches ()}},
 :valid? true}

Action and workload were a stepping stone and have been retired.

GitHub Action.

Test command:

lein run test --workload powersync-local --rate 150 --time-limit 1000 --nodes n1

Initial Endpoint Test

Implementation

  • Dart CLI app
  • PowerSync's sqlite_async driver
  • native libsqlite3-dev.so implementation
  • exposes an http post API for Jepsen to submit transactions

Demonstrates

  • basic automated building/installing, starting/stopping, of Dart apps
  • endpoint
    • correct handling of Jepsen transaction maps
    • roundtrip type handling from Jepsen through endpoint to db and back
    • correct use of driver
  • basic end-to-end functionality
  • Docker, GitHub Action integration

Test results:

  • total availability
  • strict serializability
{:stats {:valid? true,
         :count 11795,
         :ok-count 11795,
         :fail-count 0,
         :info-count 0},
 :workload {:list-append {:valid? true}},
 :valid? true}

sqlite3-local latency

Action and workload were a stepping stone and have been retired.

GitHub Action.

Test command:

lein run test --workload sqlite3-local --nodes n1