Incorrect distance constraint using CoordinateBlocker #74

jstammers · 2024-10-22T15:27:11Z

I have a dataset of ~1M records containing Lat/Long coordinates that I would like to block using a CoordinateBlocker. I'm finding that I'm running into memory issues when doing this.

As an example, I've simulated some data using a grid of centroids and sampling from a 2D Normal distribution, choosing a standard deviation that ensures the overlap between clusters is fairly small

import numpy as np
import pandas as pd

np.random.seed(0)
# Parameters
n_clusters = 10000
n_in_cluster = 100
dist_between_clusters = 1000  # Distance between clusters in meters (e.g., 1 km)
lat0 = 0.0  # Starting latitude in degrees
lon0 = 0.0  # Starting longitude in degrees

# Earth's radius and conversions
earth_radius = 6371000  # in meters

# Function to calculate meters per degree latitude
def meters_per_degree_lat():
    return 111132.92 - 559.82 * np.cos(2 * np.radians(lat0)) + \
           1.175 * np.cos(4 * np.radians(lat0)) - 0.0023 * np.cos(6 * np.radians(lat0))

# Function to calculate meters per degree longitude
def meters_per_degree_lon(lat):
    return (np.pi * earth_radius * np.cos(np.radians(lat))) / 180

# Calculate meters per degree at starting latitude
meters_per_deg_lat = meters_per_degree_lat()
meters_per_deg_lon_initial = meters_per_degree_lon(lat0)

# Calculate grid dimensions
dim = int(np.ceil(np.sqrt(n_clusters)))
n_clusters_actual = dim * dim  # Actual number of clusters

# Calculate degree differences between clusters
delta_lat = dist_between_clusters / meters_per_deg_lat
delta_lon = dist_between_clusters / meters_per_deg_lon_initial

# Standard deviations for clusters in degrees
sigma_lat = (dist_between_clusters / 2) / meters_per_deg_lat
sigma_lon_initial = (dist_between_clusters / 2) / meters_per_deg_lon_initial

# Initialize list to hold points
points = []

# Generate grid of centroids and points around them
for i in range(dim):
    for j in range(dim):
        # Compute centroid latitude and longitude
        lat = lat0 + i * delta_lat
        lon = lon0 + j * delta_lon
        # Update meters per degree longitude and sigma_lon for current latitude
        meters_per_deg_lon_current = meters_per_degree_lon(lat)
        sigma_lon = (dist_between_clusters / 2) / meters_per_deg_lon_current
        # Generate points around centroid
        cluster_lats = np.random.normal(loc=lat, scale=sigma_lat, size=n_in_cluster)
        cluster_lons = np.random.normal(loc=lon, scale=sigma_lon, size=n_in_cluster)
        # Combine latitudes and longitudes
        cluster_points = np.column_stack((cluster_lats, cluster_lons))
        points.extend(cluster_points)

# Convert points to a DataFrame
df = pd.DataFrame(points, columns=['latitude', 'longitude'])

I can use the sklearn.neighbors.BallTree class to calculate the number of points within a given radius of each point as follows

import numpy as np
import pandas as pd
from sklearn.neighbors import BallTree

# Convert coordinates to radians
df['lat_rad'] = np.radians(df['latitude'])
df['lon_rad'] = np.radians(df['longitude'])
coordinates = df[['lat_rad', 'lon_rad']].to_numpy()

# Build the BallTree
tree = BallTree(coordinates, metric='haversine')

def count_in_distance(x):
    radius_in_radians = x / earth_radius
    indices = tree.query_radius(coordinates, r=radius_in_radians)
    return [len(ind) - 1 for ind in indices]

df['count_within_1km'] = count_in_distance(1000)

Which in this case gives around 300 points on average

On my machine, it takes around 40s to calculate this. However, when I try to block these using CoordinateBlocker I run out of memory.

import ibis
from mismo.lib.geo import CoordinateBlocker
con = ibis.get_backend()
t = con.create_table("clusters", df, overwrite=True)
coord_blocker = CoordinateBlocker(lat="latitude", lon="longitude", distance_km=1)
blocked = coord_blocker(t, t).cache() # runs OOM

In this example, I would expect around 300M pairs but blocked.count().execute() returns around 29 Billion records.

Restricting to just the first 10k records, I can see there are some that have a much larger than expected distance, which may be related to how the coordinates are being used to block records together

from mismo.lib.geo import distance_km
coord_blocker = CoordinateBlocker(lat="latitude", lon="longitude", distance_km=1)
blocked = coord_blocker(t.limit(10000), t.limit(10000))
blocked = blocked.mutate(distance=distance_km(lat1=blocked.latitude_l,lat2=blocked.latitude_r,lon1=blocked.longitude_l,lon2=blocked.longitude_r))
blocked

┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ record_id_l ┃ record_id_r ┃ latitude_l ┃ latitude_r ┃ longitude_l ┃ longitude_r ┃ distance  ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ int64       │ int64       │ float64    │ float64    │ float64     │ float64     │ float64   │
├─────────────┼─────────────┼────────────┼────────────┼─────────────┼─────────────┼───────────┤
│           0 │        9997 │   0.007977 │   0.005547 │    0.008468 │    0.890071 │ 98.030167 │
│           1 │          95 │   0.001809 │   0.003195 │   -0.006060 │   -0.000771 │  0.607952 │
│           2 │          95 │   0.004426 │   0.003195 │   -0.005713 │   -0.000771 │  0.566254 │
│           3 │        9997 │   0.010133 │   0.005547 │    0.004359 │    0.890071 │ 98.487991 │
│           4 │          95 │   0.008445 │   0.003195 │   -0.005275 │   -0.000771 │  0.769126 │
│           5 │        9999 │  -0.004419 │  -0.005760 │    0.008740 │    0.888860 │ 97.865038 │
│           6 │          95 │   0.004296 │   0.003195 │   -0.001860 │   -0.000771 │  0.172167 │
│           7 │         182 │  -0.000684 │  -0.003129 │   -0.003361 │   -0.001149 │  0.366605 │
│           8 │        9999 │  -0.000467 │  -0.005760 │    0.008647 │    0.888860 │ 97.877034 │
│           9 │        9997 │   0.001857 │   0.005547 │    0.006657 │    0.890071 │ 98.231970 │
│           … │           … │          … │          … │           … │           … │         … │
└─────────────┴─────────────┴────────────┴────────────┴─────────────┴─────────────┴───────────┘

The text was updated successfully, but these errors were encountered:

jstammers · 2024-10-22T15:46:55Z

I've tried varying the distance_km value to see if there's any pattern there. It doesn't scale linearly as I would expect, so I wonder if this is related to how the coordinates are being hashed to approximate the distance between two locations

NickCrews · 2024-10-22T19:44:10Z

Thanks @jstammers, almost definitely a bug. I'll take a look.

NickCrews · 2024-10-23T00:06:15Z

It looks like a bug in how ibis compiles the floor divide, it doesn't preserve the needed parenthesis:

reg = ibis.literal(10) / (1 / ibis.literal(2))
floor = ibis.literal(10) // (1 / ibis.literal(2))

print(ibis.to_sql(reg))
SELECT
  10 / (
    1 / 2
  ) AS "Divide(10, Divide(1, 2))"


print(ibis.to_sql(floor))
SELECT
  CAST(FLOOR(10 / 1 / 2) AS BIGINT) AS "FloorDivide(10, Divide(1, 2))"

Will link here to the ibis issue/PR that I will make.

NickCrews · 2024-10-23T00:53:31Z

PR up at ibis-project/ibis#10353

NickCrews · 2024-10-23T01:25:35Z

In the meantime until that is fixed, I added a workaround on our side in 4598a9e. So this should be fixed, please let me know if not!

jstammers · 2024-10-23T08:23:38Z

Thanks for fixing this! For reference, this is what I have now, which is much more manageable

jstammers changed the title ~~Improve performance of CoordinateBlocker~~ Incorrect distance constraint using CoordinateBlocker Oct 22, 2024

NickCrews mentioned this issue Oct 23, 2024

fix(compiler): respect order of ops in FloorDivide ibis-project/ibis#10353

Merged

NickCrews closed this as completed Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect distance constraint using CoordinateBlocker #74

Incorrect distance constraint using CoordinateBlocker #74

jstammers commented Oct 22, 2024

jstammers commented Oct 22, 2024

NickCrews commented Oct 22, 2024

NickCrews commented Oct 23, 2024

NickCrews commented Oct 23, 2024

NickCrews commented Oct 23, 2024

jstammers commented Oct 23, 2024

Incorrect distance constraint using CoordinateBlocker #74

Incorrect distance constraint using CoordinateBlocker #74

Comments

jstammers commented Oct 22, 2024

jstammers commented Oct 22, 2024

NickCrews commented Oct 22, 2024

NickCrews commented Oct 23, 2024

NickCrews commented Oct 23, 2024

NickCrews commented Oct 23, 2024

jstammers commented Oct 23, 2024