Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect distance constraint using CoordinateBlocker #74

Closed
jstammers opened this issue Oct 22, 2024 · 6 comments
Closed

Incorrect distance constraint using CoordinateBlocker #74

jstammers opened this issue Oct 22, 2024 · 6 comments

Comments

@jstammers
Copy link
Contributor

I have a dataset of ~1M records containing Lat/Long coordinates that I would like to block using a CoordinateBlocker. I'm finding that I'm running into memory issues when doing this.

As an example, I've simulated some data using a grid of centroids and sampling from a 2D Normal distribution, choosing a standard deviation that ensures the overlap between clusters is fairly small

import numpy as np
import pandas as pd

np.random.seed(0)
# Parameters
n_clusters = 10000
n_in_cluster = 100
dist_between_clusters = 1000  # Distance between clusters in meters (e.g., 1 km)
lat0 = 0.0  # Starting latitude in degrees
lon0 = 0.0  # Starting longitude in degrees

# Earth's radius and conversions
earth_radius = 6371000  # in meters

# Function to calculate meters per degree latitude
def meters_per_degree_lat():
    return 111132.92 - 559.82 * np.cos(2 * np.radians(lat0)) + \
           1.175 * np.cos(4 * np.radians(lat0)) - 0.0023 * np.cos(6 * np.radians(lat0))

# Function to calculate meters per degree longitude
def meters_per_degree_lon(lat):
    return (np.pi * earth_radius * np.cos(np.radians(lat))) / 180

# Calculate meters per degree at starting latitude
meters_per_deg_lat = meters_per_degree_lat()
meters_per_deg_lon_initial = meters_per_degree_lon(lat0)

# Calculate grid dimensions
dim = int(np.ceil(np.sqrt(n_clusters)))
n_clusters_actual = dim * dim  # Actual number of clusters

# Calculate degree differences between clusters
delta_lat = dist_between_clusters / meters_per_deg_lat
delta_lon = dist_between_clusters / meters_per_deg_lon_initial

# Standard deviations for clusters in degrees
sigma_lat = (dist_between_clusters / 2) / meters_per_deg_lat
sigma_lon_initial = (dist_between_clusters / 2) / meters_per_deg_lon_initial

# Initialize list to hold points
points = []

# Generate grid of centroids and points around them
for i in range(dim):
    for j in range(dim):
        # Compute centroid latitude and longitude
        lat = lat0 + i * delta_lat
        lon = lon0 + j * delta_lon
        # Update meters per degree longitude and sigma_lon for current latitude
        meters_per_deg_lon_current = meters_per_degree_lon(lat)
        sigma_lon = (dist_between_clusters / 2) / meters_per_deg_lon_current
        # Generate points around centroid
        cluster_lats = np.random.normal(loc=lat, scale=sigma_lat, size=n_in_cluster)
        cluster_lons = np.random.normal(loc=lon, scale=sigma_lon, size=n_in_cluster)
        # Combine latitudes and longitudes
        cluster_points = np.column_stack((cluster_lats, cluster_lons))
        points.extend(cluster_points)

# Convert points to a DataFrame
df = pd.DataFrame(points, columns=['latitude', 'longitude'])

I can use the sklearn.neighbors.BallTree class to calculate the number of points within a given radius of each point as follows

import numpy as np
import pandas as pd
from sklearn.neighbors import BallTree

# Convert coordinates to radians
df['lat_rad'] = np.radians(df['latitude'])
df['lon_rad'] = np.radians(df['longitude'])
coordinates = df[['lat_rad', 'lon_rad']].to_numpy()

# Build the BallTree
tree = BallTree(coordinates, metric='haversine')

def count_in_distance(x):
    radius_in_radians = x / earth_radius
    indices = tree.query_radius(coordinates, r=radius_in_radians)
    return [len(ind) - 1 for ind in indices]

df['count_within_1km'] = count_in_distance(1000)

Which in this case gives around 300 points on average
image

On my machine, it takes around 40s to calculate this. However, when I try to block these using CoordinateBlocker I run out of memory.

import ibis
from mismo.lib.geo import CoordinateBlocker
con = ibis.get_backend()
t = con.create_table("clusters", df, overwrite=True)
coord_blocker = CoordinateBlocker(lat="latitude", lon="longitude", distance_km=1)
blocked = coord_blocker(t, t).cache() # runs OOM

In this example, I would expect around 300M pairs but blocked.count().execute() returns around 29 Billion records.

Restricting to just the first 10k records, I can see there are some that have a much larger than expected distance, which may be related to how the coordinates are being used to block records together

from mismo.lib.geo import distance_km
coord_blocker = CoordinateBlocker(lat="latitude", lon="longitude", distance_km=1)
blocked = coord_blocker(t.limit(10000), t.limit(10000))
blocked = blocked.mutate(distance=distance_km(lat1=blocked.latitude_l,lat2=blocked.latitude_r,lon1=blocked.longitude_l,lon2=blocked.longitude_r))
blocked

┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ record_id_lrecord_id_rlatitude_llatitude_rlongitude_llongitude_rdistance  ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ int64int64float64float64float64float64float64   │
├─────────────┼─────────────┼────────────┼────────────┼─────────────┼─────────────┼───────────┤
│           099970.0079770.0055470.0084680.89007198.030167 │
│           1950.0018090.003195-0.006060-0.0007710.607952 │
│           2950.0044260.003195-0.005713-0.0007710.566254 │
│           399970.0101330.0055470.0043590.89007198.487991 │
│           4950.0084450.003195-0.005275-0.0007710.769126 │
│           59999-0.004419-0.0057600.0087400.88886097.865038 │
│           6950.0042960.003195-0.001860-0.0007710.172167 │
│           7182-0.000684-0.003129-0.003361-0.0011490.366605 │
│           89999-0.000467-0.0057600.0086470.88886097.877034 │
│           999970.0018570.0055470.0066570.89007198.231970 │
│           … │           … │          … │          … │           … │           … │         … │
└─────────────┴─────────────┴────────────┴────────────┴─────────────┴─────────────┴───────────┘
@jstammers jstammers changed the title Improve performance of CoordinateBlocker Incorrect distance constraint using CoordinateBlocker Oct 22, 2024
@jstammers
Copy link
Contributor Author

I've tried varying the distance_km value to see if there's any pattern there. It doesn't scale linearly as I would expect, so I wonder if this is related to how the coordinates are being hashed to approximate the distance between two locations

image

@NickCrews
Copy link
Owner

Thanks @jstammers, almost definitely a bug. I'll take a look.

@NickCrews
Copy link
Owner

It looks like a bug in how ibis compiles the floor divide, it doesn't preserve the needed parenthesis:

reg = ibis.literal(10) / (1 / ibis.literal(2))
floor = ibis.literal(10) // (1 / ibis.literal(2))

print(ibis.to_sql(reg))
SELECT
  10 / (
    1 / 2
  ) AS "Divide(10, Divide(1, 2))"


print(ibis.to_sql(floor))
SELECT
  CAST(FLOOR(10 / 1 / 2) AS BIGINT) AS "FloorDivide(10, Divide(1, 2))"

Will link here to the ibis issue/PR that I will make.

@NickCrews
Copy link
Owner

PR up at ibis-project/ibis#10353

@NickCrews
Copy link
Owner

In the meantime until that is fixed, I added a workaround on our side in 4598a9e. So this should be fixed, please let me know if not!

@jstammers
Copy link
Contributor Author

Thanks for fixing this! For reference, this is what I have now, which is much more manageable

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants