Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.x] Fix issues with dumping PostgreSQL databases that contain multiple schemata #36046

Merged
merged 4 commits into from
Jan 27, 2021

Conversation

cbj4074
Copy link
Contributor

@cbj4074 cbj4074 commented Jan 26, 2021

To preface this post, it is intentionally long-winded (even though the punchline is succinct), because it articulates the complexities and nuances of this and similar situations concerning PostgreSQL's search_path.

Note also that this PR arises out of a continuing discussion regarding how the search_path is handled in various contexts throughout Laravel: #35463 (comment) .


The schema:dump Artisan command is designed to dump structures only — not table data. To achieve this, a --exclude-table-data argument is appended to the pg_dump command for every table that $connection->getSchemaBuilder()->getAllTables() returns, excepting only the $migrationTable, as defined in the SchemaState class and configured via config('database.migrations').

In the existing implementation, the underlying logic does not take into account the numerous different formats that the search_path can take (as defined in the connection's configuration properties). And while noble efforts were made to add support for Laravel installations using a non-default schema name (i.e., anything other than public), the table data dumping logic doesn't account for the possibility that the database has more than one schema, which causes tables to be dumped with data when they shouldn't be.

In a multi-schema environment, and when using a search_path that contains more than one schema, the aforementioned getAllTables() method returns a query like this, where homestead and public are schemas listed in the search_path that is set on the connection:

select tablename from pg_catalog.pg_tables where schemaname in ('homestead','public');
migrations
password_resets
users
products
failed_jobs

Of particular note is that the products table is in the homestead schema, and the other tables are in public. I mention this because it demonstrates why we cannot use only a single value from the search_path to prefix these references, as is done currently. Given this specific example, doing so would cause all but one of the tables to have the wrong schema qualifier, which would cause their data not to be excluded from the dump.

Because the schemaname column is not selected in the above query, we can't qualify these references fully when pairing them with --exclude-table-data switches. However, this is not a problem because pg_dump's --exclude-table-data switch accepts a pattern (see: https://www.postgresql.org/docs/current/app-psql.html#APP-PSQL-PATTERNS ), and so is not limited to a static table reference, and more importantly, PostgreSQL will traverse the search_path while building reference qualifiers in this context.

What's this all mean? It means that the schema qualifier should actually be omitted entirely from the reference. So, the appended switches might look something like this:

.. --exclude-table-data=password_resets --exclude-table-data=users --exclude-table-data=products --exclude-table-data=failed_jobs

Now, PostgreSQL will do the "heavy lifting" and traverse the search_path, checking each schema therein, in the order specified, if and until it finds the specified table.

To demonstrate, PostgreSQL will, ultimately, build the following qualified references, which is exactly what we want:

public.migrations
public.password_resets
public.users
homestead.products
public.failed_jobs

But, what happens when we have tables with the same name in two different schemas, e.g., homestead.foo and public.foo? We'd end-up with two identical switches, like --exclude-table-data=foo --exclude-table-data=foo, and table data will be excluded only for the table that PostgreSQL finds first when traversing the search_path.

PostgreSQL's pattern capabilities to the rescue! The simple solution to this problem is to use a wildcard schema for each table, e.g., --exclude-table-data="*.foo". This works because the only goal here is to keep the migration table data, but discard the data for every other table, and this accomplishes that objective.

In other words, we can eliminate all the logic around the search_path, and offload the complexity to PostgreSQL using a wildcard.


IMPORTANT: This PR fixes only the dumping behavior in a multi-schema database, and not the reloading behavior, which has never worked correctly in multi-schema databases, and, in my opinion, should be addressed in a separate PR. Actually, it fixes both issues, now that it includes a previous fix that appears to have been reverted accidentally.

The PostgreSQL search_path logic that this commit removes was added with the intention of enabling support for a schema named anything other than "public". While the theory was sound, the implementation didn't take into account the behavior in databases in which *multiple* schemas exist. In multi-schema databases, the list of tables for which data should not be dumped was incorrect, leading to unexpected behavior.

This revised approach takes advantage of PostgreSQL's support for pattern-based object references when specifying the list of tables for which data should not be dumped, and eliminates the need to perform complex search_path parsing altogether.

The attendant Pull Request documentation explains how this technique works in detail.
@driesvints
Copy link
Member

@cbj4074 fyi: this PR won't be reviewed until it's out of draft.

@GrahamCampbell GrahamCampbell changed the title Fix issues with dumping PostgreSQL databases that contain multiple schemas [8.x] Fix issues with dumping PostgreSQL databases that contain multiple schemata Jan 26, 2021
@cbj4074
Copy link
Contributor Author

cbj4074 commented Jan 26, 2021

Regarding the other half of the problem (restoring from the dumped file), please see: #36054

laravel#36054 was fixed in a previous commit, but appears to have been lost during subsequent edits in laravel@7be50a5 .

This commit restores the changes made in laravel@502e75b .

Fixes laravel#36054
While not required in a psql interactive terminal, this pattern requires outer double-quotes to function as intended when passed as a CLI argument.

While simple in this specific instance, pattern quoting can grow complicated (depending on the pattern), but is well explained in the PostgreSQL manual:

https://www.postgresql.org/docs/current/app-psql.html#APP-PSQL-PATTERNS
@cbj4074
Copy link
Contributor Author

cbj4074 commented Jan 26, 2021

Now that the exceptions noted in #36054 are resolved, I would add only the following important points of note on this subject:


  1. In Laravel 8.x, when the user wants to specify more than one schema in the search_path, the search_path (misnamed as schema) value as set on the connection must be specified as an array, e.g., ['public', 'homestead']; it cannot be set as a string, e.g., 'public,homestead' (or any quoted variation thereof). However, in Laravel 9.x, the search_path (which is named correctly on the connection and throughout Laravel) parsing is more advanced and will happily accept a number of different formats, including an array, a string (with or without quotes, as PostgreSQL may require), and either of which may contain PostgreSQL's special $user variable.
  2. In the event that the database specified on the connection contains more than one schema with a table named migrations (or whatever else config('database.migrations') returns), then the data in all such tables will be preserved when running artisan migrate:fresh. I can't conceive of any realistic use-case in which this would be problematic. If this minor caveat is of any concern, however, I believe that to fix it properly, the getAllTables() method would need to return the schemaname column — or prepend it to the existing tablename value, so as not to create another column in the result-set — which would potentially have search_path-related side-effects that would need to be studied very carefully to avoid breakage. In fact, I had considered doing this in a previous PR, but it seemed to widen the scope too significantly to justify at the time. Basically, the idea would be to change the compileGetAllTables() method to do this:
select concat(schemaname, '.', tablename) as tablename from pg_catalog.pg_tables where schemaname in ('homestead','public');

It's entirely possible that this would have no unwanted side-effects while solving many such problems of ambiguity; we'd just have to look at it in detail.


For @taylorotwell or anyone else who may be interested in playing with the various scenarios, the simplest way to test this "manually" is to log into the PostgreSQL instance via CLI, e.g., psql -U homestead -W, and do something like the following, to answer, "Which objects are 'visible' when the search_path is set to this specific value?":

vagrant@homestead:~/code/laravel$ psql -U homestead -W
Password:
psql (12.4 (Ubuntu 12.4-1.pgdg20.04+1))
Type "help" for help.

homestead=# set search_path to "public", "homestead";
SET
homestead=# \du
                                   List of roles
 Role name |                         Attributes                         | Member of
-----------+------------------------------------------------------------+-----------
 homestead | Superuser                                                  | {}
 postgres  | Superuser, Create role, Create DB, Replication, Bypass RLS | {}

homestead=# \d
                   List of relations
  Schema   |        Name        |   Type   |   Owner
-----------+--------------------+----------+-----------
 homestead | products           | table    | homestead
 homestead | products_id_seq    | sequence | homestead
 public    | failed_jobs        | table    | homestead
 public    | failed_jobs_id_seq | sequence | homestead
 public    | migrations         | table    | homestead
 public    | migrations_id_seq  | sequence | homestead
 public    | password_resets    | table    | homestead
 public    | users              | table    | homestead
 public    | users_id_seq       | sequence | homestead
(9 rows)

homestead=# \dt *.products
            List of relations
  Schema   |   Name   | Type  |   Owner
-----------+----------+-------+-----------
 homestead | products | table | homestead
(1 row)

homestead=# \dt *.users
         List of relations
 Schema | Name  | Type  |   Owner
--------+-------+-------+-----------
 public | users | table | homestead
(1 row)

This will give you a sense of how the search_path affects object visibility in a multi-schema database.

The last two commands, with \dt, demonstrate why the approach taken in this PR works; the tables — neither of which is fully-qualified when passed to the switch — are in different schemas, but are both data-excluded correctly.

Lastly, I suspect that we may still have some work to do when PostgreSQL extensions, such as citext, are used in one or more schemas, but my preference would be to merge this PR, since it will yield the correct behavior in the vast majority of use-cases, and I can open a subsequent PR if needed.

P.S. @GrahamCampbell Thanks! I never knew that "schemata" is perhaps a more scholarly, though not less accepted nor correct (especially in this context), variant of "schemas". :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants