-
Notifications
You must be signed in to change notification settings - Fork 2
Attack Plan
After evaluating the extensive (and almost feature-complete) work provided in postgres_fdw
, I have a clear understanding of what it does (and doesn't) provide in terms of a platform for a well-written "real-time inserts" extension.
postgres_fdw
's current workflow is complete and correct. In particular:
- Sessions are wrapped in a topmost remote transaction with at least
REPEATABLE READ
semantics. This is done transparently to the end user -
SAVEPOINT
s are created remotely any time a new connection-using command is issued - (Sub-)transaction callbacks are implemented to correctly roll back remote (sub-)transactions during local failures
- Query conditions are partitioned based on whether they can be safely executed remotely. If not, they are evaluated locally on returned results
-
RETURNING
clauses are correctly implemented - Prepared statements and cursors are used for all remote operations
- The (sub-)transaction callbacks clean up all prepared statements and cursors on error
Some areas exist for improvement.
-
ANALYZE
currently sends the entire table back to the local machine rather than using commands to advance the cursor when needing to skip sampled rows -
UPDATE
andDELETE
behave kind of peculiarly: they participate in aForeignScan
similarly toSELECT
but also grab locks (FOR UPDATE
) on rows they will touch.postgres_fdw
secretly addsctid
to the remote target list and uses it to issue subsequentUPDATE
orDELETE
commands.
The rationale behind the second (apparently inefficient) limitation is related to the support for qualifications that cannot be evaluated remotely: by scanning and filtering locally for a set of ctid
s, postgres_fdw
supports all queries.
A comment in the code points out that if no qualifications (or join conditions) must be executed locally, then issuing an UPDATE
/DELETE
directly during the foreign scan phase (and doing nothing during the modify phase) would be preferable. This is a change we'll want to make.
Here's what I'm doing…
A configuration schema is needed for nodes, shards, placements, etc. Some of this can live in the OPTIONS
of the foreign table or server, but most should live in tables. This is complete.
The above schema needs to be compatible with Citus' own representation of nodes, placements, etc., so I've written a set of VIEW
s that can adapt the Citus metadata to what the extension expects. This is also complete, but may need some modification during testing with Citus.
Obviously we need a good hash function to shard our data. I've written one similar to hash_array
that provides identical output to Citus' own hash function. hash_array
provides a natural way to calculate multi-column partition hashes so we'll want its structure eventually.
For each of the configuration table rows, structures and algorithms will be needed to support looking up where a row lives, where it should live, where a shard's replicas live, etc. These are underway for review this week.
The postgres_fdw
functions need modification to check the partition column and send rows to the correct machines (including duplicates to replicas). This is pretty straightforward once the above structures and algorithms exist: instead of the current assumption that a foreign table points to exactly one remote machine, we add loops to send rows to many machines. So some function signatures will change and whatnot but the core logic remains identical.
This also is pretty straightforward: instead of opening one cursor on a remote machine, we open one cursor per shard needed for a given query. This might be a fixed number of shards (if pruning is implemented) or all shards. We'll iterate each query until all remote tables have been exhausted.
So there's a bit of bookkeeping that needs to happen (i.e. removing connections from our set of queried machines as we hit end of stream), but the logic again remains mostly the same.
This is a bit trickier. Currently postgres_fdw
scans remote tables with a SELECT FOR UPDATE
and requests their ctid
s. Modifications are then subsequently made directly using those ctid
s. This is described in the Limitations section above.
We can keep track of the ctid
returned by each remote node (we'll even be scanning replicas) and then issue UPDATE
or DELETE
commands against those ctid
s (we'll keep track of what ctid
s came from which node and then only use those ctid
s with that node). This fallback (as suboptimal as it is) will still be necessary to support queries with qualifications or join conditions that cannot be evaluated remotely.
If a given query has no "local qualifications", we can just skip the scan/modify route and issue a UPDATE
or DELETE
directly against remote nodes during the scan phase. The modify phase would then be a no-op. We'll need to augment deparse.c
to deparse UPDATE
and DELETE
nodes (I think), but other than that this will act just like a foreign scan but with an UPDATE
or DELETE
clause instead.
Right now UPDATE
and DELETE
both support RETURNING
clauses. We'll want to be sure to only get RETURNING
rows from a single replica rather than returning rows from all replicas. Rather than attempting to de-duplicate result rows, we'll just determine (a priori) which replica will provide the RETURNING
for a given query (since all replicas should theoretically have identical output).