-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect distance constraint using CoordinateBlocker #74
Comments
jstammers
changed the title
Improve performance of CoordinateBlocker
Incorrect distance constraint using CoordinateBlocker
Oct 22, 2024
Thanks @jstammers, almost definitely a bug. I'll take a look. |
It looks like a bug in how ibis compiles the floor divide, it doesn't preserve the needed parenthesis: reg = ibis.literal(10) / (1 / ibis.literal(2))
floor = ibis.literal(10) // (1 / ibis.literal(2))
print(ibis.to_sql(reg))
SELECT
10 / (
1 / 2
) AS "Divide(10, Divide(1, 2))"
print(ibis.to_sql(floor))
SELECT
CAST(FLOOR(10 / 1 / 2) AS BIGINT) AS "FloorDivide(10, Divide(1, 2))" Will link here to the ibis issue/PR that I will make. |
PR up at ibis-project/ibis#10353 |
In the meantime until that is fixed, I added a workaround on our side in 4598a9e. So this should be fixed, please let me know if not! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I have a dataset of ~1M records containing Lat/Long coordinates that I would like to block using a
CoordinateBlocker
. I'm finding that I'm running into memory issues when doing this.As an example, I've simulated some data using a grid of centroids and sampling from a 2D Normal distribution, choosing a standard deviation that ensures the overlap between clusters is fairly small
I can use the
sklearn.neighbors.BallTree
class to calculate the number of points within a given radius of each point as followsWhich in this case gives around 300 points on average
On my machine, it takes around 40s to calculate this. However, when I try to block these using
CoordinateBlocker
I run out of memory.In this example, I would expect around 300M pairs but
blocked.count().execute()
returns around 29 Billion records.Restricting to just the first 10k records, I can see there are some that have a much larger than expected distance, which may be related to how the coordinates are being used to block records together
The text was updated successfully, but these errors were encountered: