-
Notifications
You must be signed in to change notification settings - Fork 14k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A collection of geospatial bug fixes #4444
Conversation
fd1ef6b
to
6afed9f
Compare
p = Point(s) | ||
return (p.latitude, p.longitude) | ||
|
||
df[key] = df[spatial.get('lonlatCol')].apply(tupleify) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should definitely benchmark this with our current data before deploying, since it's using regular expressions to do the job: https://github.com/geopy/geopy/blob/master/geopy/point.py#L310-L339
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In [2]: from geopy.point import Point
In [3]: Point('234,239')
Out[3]: Point(54.0, -121.0, 0.0)
In [4]: %timeit Point('234,239')
The slowest run took 4.29 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 11.7 µs per loop
In [5]: %timeit (float(v) for v in '234,239'.split(','))
The slowest run took 6.84 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 593 ns per loop
So roughly double the time. On 1M points, that means 0.5 second which to me is fine as almost negligible compare to the network time it takes to bring that over. Note that there's probably a numpy way of doing this that would be much faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, you mean 20x the time. Calling Point
on 1M points would take ~12 seconds (11.7e-6 s * 1e6).
I looked at the code and I'm not sure how to optimize this with Numpy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh snap. 12 seconds is long.
171ad11
to
3c97a53
Compare
3c97a53
to
034d31b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, and we can look at improving the perf of Point
later if we need.
Let's optimize |
Also using
geopy
to parse single-column spatial info, supports many different types of lat/lng styles