Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make log function to be in sync with PostgresSql #5259

Closed
comphead opened this issue Feb 12, 2023 · 11 comments
Closed

Make log function to be in sync with PostgresSql #5259

comphead opened this issue Feb 12, 2023 · 11 comments
Labels
enhancement New feature or request

Comments

@comphead
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The log function has some discrepancy with Postgres for edgecases e.g.

❯ select log(1, 64), log(0);
+-------------------------+---------------+
| log(Int64(1),Int64(64)) | log(Int64(0)) |
+-------------------------+---------------+
| inf                     | -inf          |
+-------------------------+---------------+

The query executes ok, but in PostgresSql the query fail fast

postgres=# select log(0);
ERROR:  cannot take logarithm of zero
postgres=# select log(1, 64);
ERROR:  division by zero

Describe the solution you'd like
Investigate and implement log function to be the same behavior as for PostgresSql for edgecases above and perhaps for other like negative numbers, etc

Describe alternatives you've considered
Not doing this

Additional context
Found when working on #5245

@comphead
Copy link
Contributor Author

Other edgecases

PG log(-1), log(-10, 1), log(10, -1)
cannot take logarithm of a negative number

DF
+----------------+--------------------------+--------------------------+
| log(Int64(-1)) | log(Int64(-1),Int64(10)) | log(Int64(1),Int64(-10)) |
+----------------+--------------------------+--------------------------+
| NaN            | NaN                      | NaN                      |
+----------------+--------------------------+--------------------------+

@ozankabak
Copy link
Contributor

ozankabak commented Feb 13, 2023

In short, this behavior doesn't seem unreasonable to me. IIRC, it also agrees with IEEE 754.

Still, let's dig a little deeper and think about this in a larger context. What should we do when such values occur in the middle of a large table? What does PG do in that case? What should we do in case of an unbounded table/stream?

Theoretically, this is a local out-of-domain issue and the output value indicates this clearly. Therefore, erroring out does not seem reasonable to me. Having worked on numerics for a long time in a wide variety of contexts, Datafusion's current behavior seems quite reasonable 🙂

On the other hand, we want PG compatibility in general. Maybe we can add a configuration flag that controls what we do in such cases.

@comphead
Copy link
Contributor Author

@ozankabak Its probably point of long disputes what is better: failfast or return null/nan etc.
I'm still thinking we should stick to PG behavior, as it is consistent and expectable from users migrating their queries.

From personal point of view it seems failfast is better solution anyway as it doesn't silently corrupt the data and give the user responsibility how to deal with bad data on his side. Math function has to be simple and support only supportable data for given function.

Other day old spark versions turned the value into null if it cannot be casted to timestamp and this was sooooo frustrating, that spark went away and now the to_timestamp function fails fast.

@ozankabak
Copy link
Contributor

As I said, I see the value in failing fast but doing that unconditionally is simply not an option in certain applications when you are executing a long-running query on a stream. That's why I suggested to have a configuration option to control this. Having both behaviors available allows us to cater to all use cases without impacting anyone badly, and has a nice side-effect of obviating the need for long philosophical discussions about failing fast vs. special values.

@comphead
Copy link
Contributor Author

@ozankabak thanks for feedback, the discussion is already much broader than log function itself.
You may want to create a separate topic to discuss failfast/failsafe approaches in detail. Moreover this behavior has to be consistent across all math functions, and probably cast functions as well.

I just noticed, since we use rust math functions, it returns NaN values for outlier scenarios, and this going to be consistent, even not complying to how PG works

❯ select sqrt(-1);
+-----------------+
| sqrt(Int64(-1)) |
+-----------------+
| NaN             |
+-----------------+

whereas PG bursts as

cannot take square root of a negative number

I'll leave it for @alamb @andygrove to decide

@ozankabak
Copy link
Contributor

ozankabak commented Feb 16, 2023

Yes, Rust is following the usual numerics convention (which is unsurprising) and the behavior naturally propagates as we rely on these functions. I agree with the emphasis on consistency. If we want to support PG-like behavior through a general configuration option, we will need to write wrappers around these functions that consult the config.

@alamb
Copy link
Contributor

alamb commented Feb 16, 2023

If we want to support PG-like behavior through a general configuration option, we will need to write wrappers around these functions that consult the config.

I agree the only way everyone will be satisfied is with a config that controls it.

I suspect this is a non trivial amount of effort to implement this consistently - so we should be aware of that as we start down the road

@ozankabak
Copy link
Contributor

Agreed

@comphead
Copy link
Contributor Author

Thanks @alamb so we can close this ticket then and leave all math scalar functions to work in sync with rust math functions and out of sync with PG. Probably its worth to document somewhere

@alamb
Copy link
Contributor

alamb commented Feb 16, 2023

Probably its worth to document somewhere

Agree -- the best place would be https://github.com/apache/arrow-datafusion/blob/main/docs/source/user-guide/expressions.md perhaps. Do either of you have some time to write one? Otherwise I will do so

@comphead
Copy link
Contributor Author

comphead commented Feb 16, 2023

i'll update the doc. Filed #5312

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants