pg_recall
is a PostgreSQL extension that keeps track of changes in a separate _log
table (one _log
table for each managed data table).
For CRUD queries (SELECT
, INSERT
, UPDATE
and DELETE
) this works transparently. DDL changes have to be adapted though.
To see it in action, have a look at the examples/
directory.
The main goal of pg_recall
is to provide a quick and transparent way to keep track of changes to user-edited data (to eliminate the risk of accidential deletion or modification and to provide a safe way for them to try out different settings while being able to revert to the old state if necessary).
It allows to query the individual table rows or the entire table for arbitrary timestamps (within the logInterval
you specify for each table).
You could think of it as kind of a safety net for you and your customers (but NOT as a replacement for backups)
I'd love to see ports for other databases.
- PostgreSQL (9.2 or newer, as it uses range data types)
- the
btree_gist
extension (that one requires superuser database access!)
The code is pretty standard pl/pgsql code so it should be platform independent.
The extension can be installed by issuing make install
(you'll have to run that as root in most cases).
After that the extension has to be enabled for each database you want to use it on:
CREATE EXTENSION IF NOT EXISTS recall WITH VERSION '0.9.5';
I recommend specifying the version (if you don't, the most recent version will be installed), especially before we hit 1.0 or if you're using database migration software,
As there are some resource impacts to using pg_recall
, you have to enable it for each table you want to use it on:
SELECT recall.enable('tableName', 'log_interval', 'targetSchema');
SELECT recall.enable('tableName', 'log_interval'); -- targetSchema defaults to 'recall'
so for example
SELECT recall.enable('accounts', '6 months');
After that a trigger has been added to your accounts
table and all changes will be logged to the automatically created recall.accounts_log
table.
You can work with your data as you did before, no changes to the CRUD queries are necessary.
And if you want to change the log interval later on, simply invoke recall.enable()
again (with the new interval of course). This will also update the cached primary key columns for that table and will reset the last_cleanup
field to NULL.
recall.enable()
creates a *_tpl
and a *_log
table for each of the tables you call it for (*
being the name of the original data table).
Those two tables will be stored in targetSchema
(defaults to recall
)
The _tpl
table serves as parent table for both the _log
table and the original data table (See PostgreSQL's Inheritance Documentation for details).
It's created without constraints and won't contain any data (it only serves as the one table you have to apply schema changes to).
The _log
table looks like this:
CREATE TABLE <prefix>_log (
_log_time TSTZRANGE NOT NULL DEFAULT tstzrange(now(), null),
EXCLUDE USING gist (id WITH =, _log_time WITH &&) -- automatically includes all your pkey columns and prevents overlaps in your log data
CHECK (NOT isempty(_log_time))
) INHERITS <prefix>_tpl;
Other than the GiST index that checks for overlaps no index constraints are defined for the _log
table (no foreign keys and no unique or check constraints).
The GiST index will be used for all pkey- and/or timestamp related queries, so they'll be reasonably fast, but if you plan on querying based on non-primary key columns, you'll have to add your own indexes.
Not having any foreign keys also means it won't stop you from deleting previously referenced data (let's say you have an account
and a contract
table (and each contract references the account that created it). If you enable pg_recall
on contract but not on account (or the log interval in account is shorter than that in contract), it's possible you have references to account IDs in contract_log that point to data that's been deleted from account and are therefore not restorable).
As mentioned before, you don't have to change any queries for current data, but if you want to have a look at past records (within the logInterval
of course), you have to query the corresponding _log
table.
There currently is one convenience function, recall.at()
. It creates a temporary view resembling the data table at any given time in the past.
But if you want to do something not covered by that function, you'll have to query the _log
table yourself (:ts
being the timestamp you want to query for):
... AND _log_time @> :ts
See PostgreSQL's range operators for details on how to query based on time.
In the following examples, my_table
is the name of the original data table.
pg_recall provides a convenience function for querying log data from a certain moment in time: recall.at(tblName, timestamp)
.
It'll create a temporary view with the suffix _past
added to your data table's name that you can query exactly like the original data table.
SELECT recall.at('my_table', now() - interval '2 months');
SELECT * FROM my_table_past WHERE ...;
As mentioned before, the _past
view is temporary, so it'll only be visible from the current database session (which allows you to use recall.at()
simultaneously on the same table from different sessions).
Also, as it just creates a view, using recall.at()
should perform roughly the same as if you were querying the _log
table yourself.
recall.at()
returns the name of the temporary view.
SELECT recall.at('my_table', now() - interval '1 minute');
SELECT * FROM my_table_past WHERE id = 5;
or alternatively (also includes the _log_time
column)
SELECT * FROM my_table_log WHERE id = 5 AND _log_time @> now() - interval '1 minute';
SELECT * FROM my_table_log WHERE some_key = 'some value' ORDER BY LOWER(_log_time) DESC;
Every now and then you should run recall.cleanup('tableName')
or the more convenient
SELECT recall.cleanup_all();
It will cycle through all managed log tables and remove all outdated log entries (with UPPER(_log_time) < now() - logInterval
- logInterval is the interval you specified as second parameter of recall.enable()
).
It is up to you how you want to run this cleanup job. If you don't run it, the log tables will simply keep growing. Depending on your application a simple background task might do the trick. Alternatively you could write a cron job.
To disable logging for a table, simply call
SELECT recall.disable('tableName');
Note: This will restore the original state of that table and drop the *_log
table, so all the log data for that table will be lost!
- It adds two extra tables for each data table you enable it on. Subsequent DDL changes (i.e.
ALTER TABLE
s that add/modify or delete columns) have to be done on the_tpl
table. - The log table doesn't inherit the constraints (and foreign keys). So make sure you also enable recall on all referenced tables (with at least the same log interval) to avoid ending up with log entries pointing to nothing.
- The cleanup function has to be run manually (e.g. using a background task in your app or a cronjob)
- It creates copies of the whole record every time it changes, so you might not want to use it on tables that have a high churn rate or contain large BLOB data.
However:- The trigger function detects unchanged records (
UPDATE ... SET value = value
), so feel free to bulk-update larger quantities of records without filtering out unchanged ones beforehand. - The main issue when using pg_recall on those tables is storage. If you're ok with the storage implications, there's no reason not to use pg_recall on those tables.
- The trigger function detects unchanged records (
- It doesn't protect the log table, so it won't protect you from accidentally (or an adversary from intentionally) tampering with the log tables.
- You might wanna think twice before changing the primary key of a table (changing their value should work, but adding/removing columns from/to the primary key is untested and will most likely break things).
pg_recall
does NOT replace database backups, but that should go without saying.
It can however be a reasonable simple tool to allow you or your users to view and manage changes to pretty much arbitrary data.
Have a look at the github issues page and feel free to issue pull requests.
Note that I'm running the regression tests (make installcheck
) on a 9.5 server (other versions may trigger different notices (9.1 for example prints implicit primary key creation notices))
If you plan on porting recall to another database/framework, let me know.
pg_recall
tries to follow the generic structure of PostgreSQL Extensions.
Read the Extension manual and the Extension Build Infrastructure for further details.
expected/*
: contains the expected output of the regression testssql/*
: contains the regression testsexamples/
: example projectsrecall--0.9.5.sql
: the actual implementationrecall--0.9*--0.9*.sql
: update script(s)recall.control
: extension control fileMakefile
: PGXS make file (noteworthy targets:make install
andmake installcheck
to run the regression tests)README.md
: this fileCOPYING
: license file
This project is licensed under the terms of the PostgreSQL license (which is similar to the MIT license; see the COPYING file for details).
This is a list of other projects I found doing similar things.
Keep in mind though that for most of these I only had a quick look at how they're implemented/used, so don't count on any of the following facts to be objective or true :)
- TimeTravel for PostgreSQL (GNU GPLv3): Similar project, everything's in the
tt
database schema, seems to store the log data a little differently though (can't really say much more about them because I've just skimmed through their documentation PDF) - A PL/pgSQL Trigger Procedure For Auditing in the PostgreSQL docs
-
Temporal queries in SQL:2011
-
CouchDB's Revisions Revision support is a first class citicen of CouchDB land. Revisions are identified by sequential IDs, old data can be cleaned up by "compaction"
-
EclipseLink JPA History: Higher level implementation in EclipseLink (but using a lot of the same ideas).
-
...
Create GitHub issues/merge requests where appropriate.
For everything else contact me on Twitter or per mail (first name @ last name .net)