Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add skip_conflicts option to load function #143

Open
1 task
volcan01010 opened this issue Nov 7, 2022 · 6 comments
Open
1 task

Add skip_conflicts option to load function #143

volcan01010 opened this issue Nov 7, 2022 · 6 comments

Comments

@volcan01010
Copy link
Collaborator

volcan01010 commented Nov 7, 2022

Note: this might be a bit complicated - maybe users should just use executemany and copy_rows and write their own INSERT statements to ensure that they get exactly what they want.

Summary

As an ETL Helper user I want to use skip_conficts so that I can handle primary key violations at the database level without requiring ETL Helper's error handling.

Description

A common error in a big insert is where records already exist for a given primary key. Some databases have internal methods for handling these scenarios. For example SQLite has the ON CONFLICT or INSERT syntax:
https://www.sqlite.org/lang_conflict.html. PostgreSQL has ON CONFLICT DO NOTHING: https://www.postgresql.org/docs/current/sql-insert.html#:~:text=ON%20CONFLICT%20DO%20NOTHING%20simply,can%20perform%20unique%20index%20inference.

There should be implementations for each database. The specific way in which they are applied will need to be configured in the DbHelper classes. The SQL statement generated for the load function may need changes around the INSERT keyword, or as a suffix.

Acceptance critera

  • Calling load with skip_conflicts=True means that primary key violations pass silently and rows are unchanged
@volcan01010
Copy link
Collaborator Author

Having looked further into this, it won't be easy to implement. There are two main hurdles:

  • Writing a SQL command that includes solving of conflicts requires ETLHelper to know what the primary key is for a table. It turns out that this isn't straightforward to find out. We discovered this when we wrote the table_info command. For example, a primary key may be a composite of multiple columns.
  • The syntax for different databases is different. This is less of a problem and is what ETL Helper was designed for. But there would still be a bit of work involved.

Here is some code that ChatGPT wrote to demonstrate what the command should be for PostgreSQL. You can see that it explicitly uses the id command.

Screenshot 2023-05-05 at 09-09-02 ChatGPT
Screenshot 2023-05-05 at 09-09-14 ChatGPT

@volcan01010
Copy link
Collaborator Author

ChatGPT was also able to generate equivalent commands for Oracle, SQLite and MS SQL Server. There are two things to note:

  • Oracle and SQL Server use MERGE, where SQLite and PostgreSQL use INSERT ... ON CONFLICT
  • MERGE commands also require knowlege of the primary key for the ON clause

Screenshot 2023-05-05 at 13-49-41 ChatGPT
Screenshot 2023-05-05 at 13-49-56 ChatGPT
Screenshot 2023-05-05 at 13-50-10 ChatGPT

@volcan01010
Copy link
Collaborator Author

Here is ChatGPT's suggestion for:

Please write a SQL command that lists the primary key columns for a table in a PostgreSQL database.

SELECT column_name
FROM information_schema.key_column_usage
WHERE constraint_name = (
    SELECT constraint_name
    FROM information_schema.table_constraints
    WHERE table_name = 'your_table_name' AND constraint_type = 'PRIMARY KEY'
);

And the translation to different database systems via:

Can you give equivalent commands for Oracle, SQLite and MS SQL Server databases?

Oracle:

SELECT column_name
FROM all_cons_columns
WHERE constraint_name = (
    SELECT constraint_name
    FROM all_constraints
    WHERE table_name = 'your_table_name' AND constraint_type = 'P'
);

SQLite:

PRAGMA table_info(your_table_name)
WHERE pk > 0;

MS SQL Server:

SELECT column_name
FROM information_schema.key_column_usage
WHERE constraint_name = (
    SELECT constraint_name
    FROM information_schema.table_constraints
    WHERE table_name = 'your_table_name' AND constraint_type = 'PRIMARY KEY'
);

Note that I haven't checked them, and that the MS SQL Server value looks suspiciously similar to the PostgreSQL one.

@volcan01010
Copy link
Collaborator Author

I think that we could probably do this at some point.

It would be better to implement this as a new function, merge, rather than updating the load function as that would better reflect the fact that the underlying mechanism is different.

This would use a function behind the scenes called generate_merge_sql instead of generate_insert_sql that is used by load.

@kerberpolis
Copy link
Contributor

I've used COPY to transfer rows in a db migration; this can be 10,000s or millions. My process is:

  1. I export items from a source databse
  2. Augment the data
  3. In batches of 5000 copy the augmented data to the target table.

In one table there is a composite unqiue key (column_X, column_Y) where if there is a conflict in 1 of the 5000, because it is a single tranaction and the cursor is not commited, none of rows are copied. On a conflict I would like to ignore ON CONFLICT (column_X, column_Y) DO NOTHING as they have already been migrated (so this is relevant for skip_conflict=True only.)

You can also skip conflicts using ON CONFLICT ON CONSTRAINT [constraint_name] DO NOTHING.

Unfortunately you cannot ON CONFLICT (id, column_X, column_Y) ... as the PK and UQ constraints are independant. Neither can you chain them ON CONFLICT (id) ... ON CONFLICT (column_X, column_Y) ...

With that in mind, if it fits your use cases too it would be great if a copy function could specify a constraint_name instead of only being a PK. For case witth more than one constraint, it would require the developer to either pre-filter their data beforehand. Or perform some kind of upsert but at that point it might be easier to supply their own SQL statement...

@volcan01010 volcan01010 changed the title Add skip_conflicts and replace_conflicts options to load and copy_table_rows Create merge function Jul 20, 2023
@volcan01010 volcan01010 changed the title Create merge function Add skip_conflicts option to load function Jul 20, 2023
@volcan01010
Copy link
Collaborator Author

The first task, whatever we do is to determine the primary keys of a table.

The specific queries are given above.

The primary key columns can be added as an extra attribute to the table_info response.

def table_info(table, conn, schema=None):

The current table_info response looks like:

def test_table_info_no_schema_no_duplicates(pgtestdb_conn, pgtestdb_test_tables):

We would need to add a primary_key_query to each DbHelper that performed a similar job to the table_info_query.
https://github.com/BritishGeologicalSurvey/etlhelper/blob/456f0db009524354d0377621067ba56e8ac441da/etlhelper/db_helpers/postgres.py#L13C4-L13C21

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants