-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SNOW-1730905: Support retrieving empty Arrow result #1219
Comments
hi and thank you for raising this requirement. to me it makes sense indeed, being able to handle results the same way, regardless if they are empty or non-empty. we'll look into this (no timeline estimated) |
so, had a word with the team and for now, we won't change our approach. Mostly because if there's no resultset, then there's no schema to create the Arrow record with. So if your code seems to need error handling when a resultset is empty (thus result is an empty array, not an arrow.Record), perhaps changing the processing approach could help the fastest. Here's a pseudocode: batches := result.GetBatches()
for batch := range batches {
records := range batch.Records()
for record := range records {
// do something on record only if it exists
}
} and here's an actually working example on how to process arrow batches with Hope this helps. |
@sfc-gh-pfus do you have any insight which might be worth sharing here ? |
I just tried the python connector and it outputs a proper empty result: import snowflake.connector
ctx = snowflake.connector.connect(
user=user,
password=password,
account=account,
warehouse=warehouse,
database=database,
login_timeout=3,
)
cur = ctx.cursor()
sql = "select 1 a where 1 = 0"
cur.execute(sql)
df = cur.fetch_pandas_all()
print(df.to_string())
print("--\n")
print("num columns: " + str(len(df.columns)))
print("size:" + str(df.size))
print("-----\n\n")
import pyarrow as pa
table = pa.Table.from_pandas(df)
print(table) Output:
|
Hi @datbth ! Let me summarize a couple of things.
|
Thank you for the response.
My point was to explain that my query returns a schemaful relation, so I expect to be able to retrieve that schema, especially when Arrow is also schemaful.
I'm trying to use Arrow to benefit from its performance values (minimal serialization cost, columnar processing, etc.).
I agree that this makes some sense. But then how about using the minimal data type (e.g.
I also don't know whether it is possible to retrieve Arrow records directly or not. But at least the library provides a natively supported way to retrieve the schema. Could you try discussing this matter with other client teams or the database team? For me, currently, I'm able to construct the Arrow schema from the snowflake result schema. So this issue is not urgent to me. |
I see your point @datbth . Nevertheless, we currently have 7 native drivers. We try to minimize code duplication and keep consistency in our codebases, especially if something may be implemented on the backend side. And second huge point is - we don't create arrow data in driver, we just propagate what backend sends to us and we don't want to change such approach (we had this discussion more than once before). My suggestion is to work with your AM/SE and file a backend request to fill the schema on that end and hopefully someone will implement it :) 🤞 🤞 🤞 |
I see. I was hoping this gets propagated to your backend team. Thank you. |
I think it may be faster if it is an external requirement :) Can we close the issue now? |
Yeah ok. Thanks |
Such varying schemas in a single resultset would be hard to deal with. I just went ahead and tried this query: select 1 FROM table (generator(rowcount => 10000))
union all
select 123456789123456789.123456789 FROM table (generator(rowcount => 10000)) It yields 2 batches, containing 8192 rows (in 1 Arrow Record) and 11808 rows (in 2 Arrow Records). But both batches (all Arrow records) have this same schema: arrow.Decimal128Type {Precision: 27, Scale: 9} Tried another one: select 1 FROM table (generator(rowcount => 10000))
union all
select 123456789123456789 FROM table (generator(rowcount => 10000)) and also found that all the result Arrow records have this same schema: arrow.Int64Type{} So I think there has been a mistake in your statement. |
My guess is that it is because you just select static data, but I'm not 100% sure. I think server does not compute it on select time, but on storage time - this is the optimization to save the disc source. I'm not an expert on how backend compresses integers, what are the rules behind it. I only now, that we already had customers that were unhappy with this compression and schema variations and there is ongoing task to implement a parameter to disable this behaviour on the backend side. |
Thank you for the answer.
Is there any way I can get notified when such parameter is released? |
What is the current behavior?
When the query returns empty result (e.g.
SELECT 1 WHERE 1 = 0
),GetArrowBatches
returns an empty array.What is the desired behavior?
Empty result is still a valid result.
GetArrowBatches
should return an empty Arrow Record.Otherwise, downstream processing would have to treat empty results as errors and that does not make sense.
How would this improve
gosnowflake
?Allow downstream processing/usage to consume empty results properly.
References, Other Background
This could be a server issue if the server is not responding any "chunks".
The text was updated successfully, but these errors were encountered: