Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_sql_table bug on columns with dots #10279

Closed
Gerenuk opened this issue Jun 4, 2015 · 9 comments
Closed

read_sql_table bug on columns with dots #10279

Gerenuk opened this issue Jun 4, 2015 · 9 comments
Labels
IO SQL to_sql, read_sql, read_sql_query

Comments

@Gerenuk
Copy link

Gerenuk commented Jun 4, 2015

read_sql_table handles columns with dots in the name incorrectly (removes name before dot)

from sqlalchemy import create_engine
import pandas as pd
engine = create_engine('sqlite://')
conn = engine.connect()
conn.execute("create table test (a int, `b.c` int)")
conn.execute("insert into test values (1,2)")
df = pd.read_sql_table("test", engine)
print(df)

->

    a  c
0  1  2
@jorisvandenbossche jorisvandenbossche added the IO SQL to_sql, read_sql, read_sql_query label Jun 5, 2015
@jorisvandenbossche
Copy link
Member

This is the result of sqlalchemy returning this as the key of the result set:

In [65]: res = conn.execute("select * from test")

In [66]: res.keys()
Out[66]: [u'a', u'c']

But, in the Table object it has the correct column name:

In [67]: meta = sqlalchemy.MetaData(bind=engine)

In [68]: meta.reflect()

In [69]: table = meta.tables['test']

In [70]: table.columns.keys()
Out[70]: ['a', 'b.c']

@zzzeek Is there a reason for this difference?

@zzzeek
Copy link

zzzeek commented Jun 5, 2015

in the former case you're looking at sqlite3's cursor.description, and I believe we have some workarounds in place for when sqlite3 does the wrong thing with UNION statements that is specifically truncating off dotted names. periods inside of table and column names are a really bad idea. see item #1 here: https://periscope.io/blog/better-sql-schema.html

@zzzeek
Copy link

zzzeek commented Jun 5, 2015

call the "sqlite_raw_colnames" execution option to turn this off:

res = conn.execution_options(sqlite_raw_colnames=True).execute("select * from test")

or

eng = create_engine("sqlite://", execution_options={"sqlite_raw_colnames": True})

@zzzeek
Copy link

zzzeek commented Jun 5, 2015

here's that bug in case you're curious

import sqlite3

conn = sqlite3.connect(":memory:")
cursor = conn.cursor()

cursor.execute("create table x (a integer, b integer)")
cursor.execute("insert into x (a, b) values (1, 1)")
cursor.execute("insert into x (a, b) values (2, 2)")

cursor.execute("select x.a, x.b from x")
assert [c[0] for c in cursor.description] == ['a', 'b']

cursor.execute("""
    select x.a, x.b from x where a=1
    union
    select x.a, x.b from x where a=2
""")
assert [c[0] for c in cursor.description] == ['a', 'b'], \
    [c[0] for c in cursor.description]

output:

  File "test.py", line 18, in <module>
    assert [c[0] for c in cursor.description] == ['a', 'b'], [c[0] for c in cursor.description]
AssertionError: ['x.a', 'x.b']

see the problem? add a UNION, the driver totally blows the cursor.description. those break SQLAlchemy queries, so we filter out the dots.

Feel free to report upstream to the sqlite3/pysqlite driver.

@Gerenuk
Copy link
Author

Gerenuk commented Jun 6, 2015

This seems like a questionable hack. The select union doesn't blow up, but just gives too much information which the user still can use. Whereas cutting out random parts of names loses information and should strictly speaking be documented. Moreover, this isn't even common to write the union with a superfluous x..
But worst of all, it introduces dead/buggy code once the issue has been fixed in sqlite.
I hope there aren't more surprises.

@zzzeek
Copy link

zzzeek commented Jun 6, 2015

This seems like a questionable hack. The select union doesn't blow up, but just gives too much information which the user still can use.

Your defense of the sqlite driver in asserting that this is not a bug at all, just that the driver is in this case deciding to "give too much information that the user can still use", suggests that SQLAlchemy is merely recognizing that a column name of the form "x.y" is just the SQLite driver giving us "too much information", which for SQLAlchemy's purposes we necessarily truncate, because it confuses other parts of the system. If this was in fact only "extra" information that's not needed, then it should be safe to truncate. This is of course totally wrong, because this is not "extra" information; it's garbage which without context we cannot distinguish between a table-qualified column name and a standalone column name with a dot in it.

Whereas cutting out random parts of names loses information and should strictly speaking be documented.

This isn't a "random" part of the name, it is explicitly the part preceding the dot which in the vast majority of cases is the table or alias name. The lack of documentation was a bug and is resolved as of yesterday:

http://docs.sqlalchemy.org/en/rel_1_0/dialects/sqlite.html#dotted-column-names

Moreover, this isn't even common to write the union with a superfluous x..

It is absolutely common if your query happens to contain more than one table/alias/subquery, and further is fully necessary if there is any kind of column name overlap between those relations. Vastly more common than the "questionable" practice of putting dots in column names.

The SQLAlchemy Core and ORM querying features, which account for the overwhelming vast majority of queries performed under normal use of SQLAlchemy, render all SELECT statements with the leading table name. Historically, the Core also has relied upon a predictable set of column names coming back from cursor.description in order to match result columns to the Core table constructs which they correspond towards.

Calling this "too much information" is not accurate at all because it is entirely ambiguous and without context (short of parsing the SQL statement to see if it has the word "UNION" in it - now that would be a hack). We have no idea if the "x." prefix is part of the column name or is part of a table/alias name. That's not information, it's noise. Very recent SQLAlchemy releases no longer rely as heavily on the names in cursor.description and we could likely just switch this behavior off at this point, however it is very likely that real-world applications are themselves relying upon consistency in these names; Pandas itself is one such example. If SQLAlchemy did not do what it does here, then the user whose Pandas application ends up calling upon a UNION against multiple tables/aliases would report the same issue from the other end. It's a no-win situation as long as sqlite3 has this issue.

Certainly the sqlite3 driver is most responsible here, as it has a lot of silly SQL-parsing issues like these which often go on for years without fixes (see http://bugs.python.org/issue21718 for another example).

Pandas itself is also not without blame, as it naively constructs SQL schemas using the raw naming schemes from an entirely different domain instead of applying reasonable escaping patterns to avoid poor SQL practices such as putting dots / spaces / other things in column names.

But worst of all, it introduces dead/buggy code once the issue has been fixed in sqlite.

To my knowledge, this is not even acknowledged as a bug in sqlite. I hardly bother reporting them myself because in ten years I've hardly seen the pysqlite driver ever fix any of the bugs we're reported. Bugs like these in sqlite and the driver linger on for years. I can assure you the moment a fixed is released, SQLAlchemy will immediately release an accommodation for this - on my end there's a lot of ugly crap that had to happen to make these use cases work.

It is by nature that database abstraction libraries have dependencies on the specific behaviors of the drivers, so accommodating continued change in those libraries is necessary in any case; there are hundreds of critical behaviors in SQLAlchemy dialects that would classify as "dead/buggy" if the upstream driver made a change in behavior; something as simple as a new version of the psycopg2 driver suddenly deciding to parse JSON result sets into json rather than returning them as strings, for example. SQLAlchemy clearly had to parse the JSON beforehand and the psycopg2 driver clearly benefits by building in this parsing themselves and releasing it. That's just how it goes in this domain.

Without this name-truncation at some level, common SQLAlchemy queries like query(X).union(query(Y)) would historically simply fail. We felt it was appropriate to keep the public-facing keys() consistent with this. Because using dots in column names is vastly less common than emitting a SELECT/UNION with table-qualified column names.

@Gerenuk
Copy link
Author

Gerenuk commented Jun 6, 2015

Thanks for the explanation.
The dots come in from reading external data and without a warning on driver limitations it's hard to foresee.
Since a work-around suggestion was available quickly, it's fine. Thanks.

@kokes
Copy link
Contributor

kokes commented Aug 29, 2018

Using the original snippet, I get the expected output

$ python3 10279.py 
   a  b.c
0  1    2

That's probably because the .keys method returns the correct column names, as illustrated here:

$ cat 10279b.py 
from sqlalchemy import create_engine
engine = create_engine('sqlite://')
conn = engine.connect()
conn.execute("create table test (a int, `b.c` int)")
conn.execute("insert into test values (1,2)")
ex = conn.execute('select * from test')

print(ex.keys())
$ python3 10279b.py 
['a', 'b.c']

So this has been fixed upstream, so I guess we can close this? Or is there a need for a test to avoid regressions?

@jorisvandenbossche
Copy link
Member

Yes, as also noted in http://docs.sqlalchemy.org/en/latest/dialects/sqlite.html#dotted-column-names, this seems to be fixed now.

So closing this (anyhow, it was not a pandas issues, but rather sqlite/sqlalchemy interaction). Thank @kokes for noting

@jorisvandenbossche jorisvandenbossche added this to the No action milestone Sep 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
Development

No branches or pull requests

4 participants