Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cursor.copy_from table argument no longer accepting schema #1294

Closed
edublancas opened this issue Jun 17, 2021 · 17 comments
Closed

cursor.copy_from table argument no longer accepting schema #1294

edublancas opened this issue Jun 17, 2021 · 17 comments

Comments

@edublancas
Copy link

Hi,

In 2.8.6 and earlier, I was able to pass a schema.table argument cursor.copy_from:

cursor.copy_from(f, table='schema.table')

But this is no longer working on >2.9:

UndefinedTable: relation "schema.table" does not exist
@dvarrazzo
Copy link
Member

The change was needed to avoid the chance of SQL injection. Please use cursor.copy_expert("COPY schema.table FROM STDIN").

@edublancas
Copy link
Author

Ah, that makes sense. I couldn't find anything in the docs so I thought it was a bug. Thanks a lot!

@dvarrazzo
Copy link
Member

TBH when making that change it didn't occur me to think about passing a schema-qualified table. However there is no way to accept them and remove the security concern at the same time: the only way would be to make the function accept a sql.|dentifier object. But if someone has to change the code this way they may as well use copy_expert() (copy_from() and copy_to() are really just ancient and incomplete methods).

So, I sincerely apologise for the inconvenience, but I think it's for the best :)

@edublancas
Copy link
Author

No need to apologize, psycopg2 is an amazing library! Thanks for all the hard work! I just made the change and everything is back to normal.

@dstarkebaum
Copy link

Hi @dvarrazzo ,
Thanks for gathering all of this discussion (and explanation) in one place!
Can you elaborate on the SQL injection concern you mentioned, and how this solution helps to prevent it? (I thought I looked through all of the linked discussions for this question, but I could not find it...)

I was able to adapt my existing function by adding this line just prior to calling "copy_from":
curs.execute(f'SET search_path TO {schema}')

I wonder why something like this could not be implemented within the psycopg2 library?
Here is an excerpt from my code:

def copy_from(
        conn_info,
        file,
        schema,
        table,
        sep=',',
        null='',
        size=8192,
        columns=None,
    ):
        with psycopg2.connect(**conn_info) as conn:
            with conn.cursor() as curs:
                curs.execute(f'SET search_path TO {schema}')
                result = curs.copy_from(
                    file=file,
                    table=table,
                    sep=sep,
                    null=null,
                    size=size,
                    columns=columns,
                )
                return result

@dvarrazzo
Copy link
Member

I was able to adapt my existing function by adding this line just prior to calling "copy_from":
curs.execute(f'SET search_path TO {schema}')

This affects the whole session. It might be a good solution for you, as well as for a program, but it's not a good solution with a driver because it has global (session-wide) repercussions.

@dstarkebaum
Copy link

Thanks for the quick response!

I was able to adapt my existing function by adding this line just prior to calling "copy_from":
curs.execute(f'SET search_path TO {schema}')

This affects the whole session. It might be a good solution for you, as well as for a program, but it's not a good solution with a driver because it has global (session-wide) repercussions.

That is a good point. The fact that I am opening and closing a new connection as part of the operation should help with that, but it makes it a much narrower solution...

I suppose it might work to cache the existing search path ('SHOW search_path;') and try to restore it afterwards, but that may end up having its own complications?

Is there any further discussion you could link to, which talks about the SQL injection concern that led to this change?

I saw mention that the "copy_from" method is considered "legacy" at this point? Does that mean I should prefer to use copy_expert instead?

I am using this in the context of trying to bulk upload medium-sized pandas dataframes (in the 10M's of rows), which can be held in memory (or processed by chunks).
I have had performance issues with df.to_sql(), so I have been using my own "upload_dataframe" function which streams a dataframe to a string buffer to make use of the postgres "COPY FROM STDIN" syntax. I orinally used "copy_export", but ran into some issue and had better luck with "copy_from" instead (until I got around upgrading to the latest version)

def df_to_string_io(
    df:pd.DataFrame,
    **kwargs
) -> typing.IO:
    f = io.StringIO() # No need to worry about closing StringIO object
    df.to_csv(f, **kwargs)            
    f.seek(0)
    return f


def upload_dataframe(
        self,
        df,
        schema,
        table,
        **kwargs,
    ):
        kwargs['sep'] = kwargs.get('sep', '|')
        kwargs['index'] = kwargs.get('index', False)
        kwargs['na_rep'] = kwargs.get('na_rep', 'NULL')
        kwargs['header'] = kwargs.get('header', False)
        # kwargs['engine'] = kwargs.get('engine', 'python')
        kwargs['encoding'] = kwargs.get('encoding', 'cp1252')
        print(f'Uploading DataFrame with {df.shape[0]} rows to {schema}.{table}')
        print(list(df.columns))
        # print(f'Uploading DataFrame: {list(df.columns)}')
        # print(kwargs)

        file = df_to_string_io(df, **kwargs)
        # query = f'''COPY {schema}.{table} FROM STDIN WITH(FORMAT CSV, HEADER TRUE, DELIMITER '{kwargs["sep"]}', NULL '{kwargs["na_rep"]}');'''
        # print(query)
        # self.copy_expert(query,file)
        
        self.copy_from(file=file,schema=schema,table=table,sep=kwargs['sep'],null=kwargs['na_rep'])

@dvarrazzo
Copy link
Member

I suppose it might work to cache the existing search path ('SHOW search_path;') and try to restore it afterwards, but that may end up having its own complications?

I suppose you don't consider multithread or otherwise concurrent programs.

That solution is good for you: we are very happy about that but it's just not a general solution.

Is there any further discussion you could link to, which talks about the SQL injection concern that led to this change?

No public discussion, no. But if someone exposes that function to unchecked input it is open to sql injections. It doesn't need further literature.

I saw mention that the "copy_from" method is considered "legacy" at this point? Does that mean I should prefer to use copy_expert instead?

Look at the parameters supported by copy_to/from, look at the parameters supported by COPY and draw your own conclusions. For extra LOL points use the links to the previous version in the PostgreSQL documentation to figure out how for how long those methods have been obsolete.

I am using this in the context of trying [...]

This is a bug tracker for psycopg and this is a ticket about a psycopg issue. If you wish to discuss your code please write to the mailing list.

@tothandor
Copy link

Though this issue may not be an issue, the discussion helped me a lot to understand the reasons behind this change.

@taylorsmithgg
Copy link

taylorsmithgg commented Jan 29, 2022

Not to be too critical here, but this is a breaking change.
Wouldn't it have made more sense to make that more evident by bumping to 3.0?
Furthermore, getting an error message that isn't as vague as "relation does not exist" since this is a known thing would be great. Maybe even a link to this issue?
I only ask because we have thousands of scripts using the copy_from function from a shared lib and found this out the hard way.
It's going to require hundreds of work hours for us to reliably replace all of the references due to lack of backwards compatibility.

Promise I'm not just here to complain, would be happy to help fix this.

P.S. stack overflow google fu leads to a lot of references about putting schemas in double quotes due to casing issues and it takes a while to track down this issue

@dvarrazzo
Copy link
Member

@taylorsmithgg I know that the semver police is after us on this. It was a mistake to make this change: the original intention was to fix the column names only and the breakage was not intentional.

I assume that, if you are using a shared library to do this operation, you can modify it, instead of changing all the thousands of scripts using it taking hundreds of hours, right? Using copy_expert() would be even better. You can even peg psycopg2 to <2.9 which should take even less time.

I see easy fixes to your problem: sure easier than make psycopg2 safe for all the possible use cases and keep the track record of not allowing people to shoot in their foot until they tick all the boxes declaring that they want to do so under their responsibility.

@taylorsmithgg
Copy link

taylorsmithgg commented Jan 30, 2022

@dvarrazzo Unfortunately not. Our implementation was just built to orchestrate returning the Connection object with our secret store so we didn't hassle too much with reconstructing it in ever project. The utility of copy_from is all externalized from the lib in our scripts. We have currently pinned <2.9 for the moment.

@anant-matelabs
Copy link

anant-matelabs commented Jul 14, 2022

only way would be to make the function accept a sql.identifier object. But if someone has to change the code this way they may as well use copy_expert()

There's difference between copy_from() and copy_expert() for my scenario:

My DB is in Azure PostgreSQL and it won't access my backend service inside Kubernetes pod.
I have table in schema other than 'public'.
CSV file is in local file system of my server

copy_expert() gives psycopg2.errors.InsufficientPrivilege: must be superuser or a member of the pg_read_server_files role to COPY from a file

while,
copy_from() copies the rows into the table (when its in default 'public' schema)

@dvarrazzo
Copy link
Member

@anant-matelabs you must use COPY ... FROM STDIN with copy_expert(). What copy_from() does internally is just to craft such string. There is no difference in permissions.

"COPY %s%s FROM stdin WITH DELIMITER AS %s NULL AS %s";

@anant-matelabs
Copy link

anant-matelabs commented Jul 15, 2022

I was missing the STDIN and was instead giving the path to file.
Got it! It works as expected, thanks!

@aymather
Copy link

aymather commented Aug 11, 2022

First just wanna say thank you for the amazing library and all your hard work <3

I found a very interesting edge case I want to place here, as I can't tell if it is a bug or not, but hopefully I can save someone else the many hours this just took me to figure out.

The problem happened when I was still getting the "relation does not exist" error even when I switched to using the copy_expert method. So I set up a test case.

(My psycopg2-binary version is 2.9.3)

Here is my schema/table setup:

create schema if not exists test_schema;
create table if not exists test_schema.test_table (
  id bigserial not null primary key,
  value text
);

Then I used the copy_expert strategy to upload data to by test schema/table:

FILENAME = 'tmp_df.csv'
TABLE = 'test_schema.test_table'
CONNECTION_STRING = '<MY CONNECTION STRING>'

# Connect
conn = psycopg2.connect(CONNECTION_STRING)
cur = conn.cursor()

# Create dataframe of test data
df = pd.DataFrame(['joe', 'schmo'], columns=['value'])
df.to_csv(FILENAME, index=False)

# Build query string for copy
string = sql.SQL("""
    copy {} (value)
    from stdin (
        format csv,
        null "NULL",
        delimiter ',',
        header
    );
""").format(sql.Identifier(TABLE))

# Open csv and copy_expert the data into table
with open(FILENAME) as csv_file:
    cur.copy_expert(string, csv_file)

# Commit
conn.commit()

# Close
cur.close()
conn.close()

This code was still giving me the "relation test_schema.test_table" does not exist error. And it turns out that the error was coming from the way that the string was formatting. When the format happened, it added double quotes around the test_schema.test_table.

# This is what I had, and it doesn't work
string = sql.SQL("""
    copy {} (value)
    from stdin (
        format csv,
        null "NULL",
        delimiter ',',
        header
    );
""").format(sql.Identifier(TABLE))

# This is what it formats to, and it also doesn't work hardcoded like this
string = sql.SQL("""
    copy "test_schema.test_table" (value)
    from stdin (
        format csv,
        null "NULL",
        delimiter ',',
        header
    );
""")

# If you remove the double quotes, it works fine
string = sql.SQL("""
    copy test_schema.test_table (value)
    from stdin (
        format csv,
        null "NULL",
        delimiter ',',
        header
    );
""")

It would be great to get an explanation as to what's happening here / what the recommended method of dynamically adding table names to sql strings is? Also, if I'm a complete moron and using this package incorrectly, feel free to let me know of my stupidity.

Thank you again! <3

@aymather
Copy link

Sorry, my question was answered by #1383 by this comment

cabutlermit added a commit to MITLibraries/slingshot that referenced this issue Sep 13, 2022
* Update Makefile with Dev1 commands from ECR repo
* Create Dev1 build GitHub Action
* Add a PR template
* update the .gitignore to ignore .DS_Store files

This also includes a fix for a problem introduced by a newer version of
the psycopg2-binary package. There was a change introduced after 2.8.6
that impacted how this app loaded data into tables in the PostGIS
database. For now, instead of trying to fix the code, I just restricted
the version of the psycopg2-binary to 2.8.6 or earlier.

 See
* psycopg/psycopg2#1294
and
* psycopg/psycopg2#1383
for more details.
ikanashov pushed a commit to ikanashov/data-detective that referenced this issue Oct 12, 2022
delhomer added a commit to delhomer/luigi that referenced this issue Nov 28, 2024
copy_from does not accept 'schema.table' notation since
psycopg2.9. Then a bug arises in Luigi if the table refers a schema
name.

This PR modifies the 'copy' method behavior, by refering copy_expert
instead of copy_from, as suggested in
psycopg/psycopg2#1294.

Fix spotify#3198
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants