Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

geohash-int encoding not providing correct range queries #11

Closed
winash12 opened this issue Sep 22, 2018 · 4 comments
Closed

geohash-int encoding not providing correct range queries #11

winash12 opened this issue Sep 22, 2018 · 4 comments

Comments

@winash12
Copy link

@yinqiwen I believe there is a bug in your code.

I went to this github site and downloaded their geohash encoding software - https://github.com/hkwi/python-geohash/blob/master/geohash.py and I have some results to share. It would be great if you can fix the bug because I would like to continue using your code -

import geohash as gh

#query geohash
lonTest = 172.76843
latTest = 61.560745

geoha = gh.encode_uint64(latTest,lonTest)

gg = geoha >> 45
print(gg)
print(gg.bit_length())
# Showing values below of print
#515311
#19


#data geohash
gridLon = 172.85000000008023
gridLat = 61.65143600000049

geoha1 = gh.encode_uint64(gridLat,gridLon)

gg1 = geoha1 >> 38

print(gg1)
print(gg1.bit_length())
# Showing values below of gg1 and bit length of gg1
#65959859
#26
rangeMin = gg << 7
rangeMax = rangeMin + (1 << (26-19) - 1) 

print(rangeMax-rangeMin,rangeMin,rangeMax,gg1)
#64 65959808 65959872 65959859
assert(rangeMin <=gg1 and gg1 < rangeMax)

As you can see gg1 is "in range" of rangeMin and rangeMax. With your code I am not able to reproduce this. Can you show where I am wrong and your code is correct ?

@yinqiwen
Copy link
Owner

to use geohash-int search by a radius value , u need to search the 8 neighbors, not only the geohash box area.

@winash12
Copy link
Author

Still not getting it in a range query. When I add the data geohashes to a skip list(not the query geohashes) must I add the 8 neighbors as well to the skip list in order for a range query to work ?

Tomorrow I will reduce the number of geohashes I am adding to the skip list and then do the test again. At the moment I am unable to do a successful range query. I can share my code offline if you request.

@yinqiwen
Copy link
Owner

yinqiwen commented Oct 9, 2018

geohash value represent a box in the map, if a point is at the edge of the box, u need to search the neighbors box to find the nearest points, and a box in the 2D map have 8 neighbors.

@winash12
Copy link
Author

winash12 commented Oct 9, 2018

I will put my code in a github site(my skip list code and your geohash code) in the future after some more testing. Then you can download, build and then verify yourself. Or suggest modifications.

@winash12 winash12 closed this as completed Jun 6, 2019
mattsta added a commit to mattsta/geohash-int that referenced this issue Dec 28, 2020
This is a cleanup of geohash-int from a few years ago where the big
improvement is adding one-shot SSE4 bitmap merging using BMI2
instructions for parallel bit deposit and parallel bit extraction.

Relevant notes:
=============
Uses Haswell (mid-2013+) CPU intrinsics to deposit and extract bits
to/from an integer given a selector mask.

The original "deposit/extract by magic numbers" method takes ~18.6 cycles
on my 2013 laptop (163 million encodes or decodes per second at -O2 and
-O3).

The new method takes ~5.7 cycles on my 2013 laptop (500 million
encodes or decodes per second at -O2 and -O3, 264 million per second at
-O0).
==============

All included commit messages in this squash:

- This is a combination of 52 commits.
- This is the 1st commit message:

Convert all \r\n to \n

Generated by: dos2unix *

- This is the commit message yinqiwen#2:

Fix permissions

We don't need executable text files so: chmod 644 *

- This is the commit message yinqiwen#3:

Fix formatting

This is exactly:
clang-format -style="{BasedOnStyle: llvm, IndentWidth: 4}"

- This is the commit message yinqiwen#4:

Move main() to its own file

main() is formatted with clang-format too and we're also adding
a simple build infrastructure for making a working test executable.

- This is the commit message yinqiwen#5:

Fix license formatting

- This is the commit message yinqiwen#6:

Style clean up (manually)

Fixes:
  - Easier to read offsets in blocks of actions
  - Initialize i in for loop instead of before for loop
  - Make functions void if they never return failure
  - Move variable declarations to top of open braces
  - Remove unneeded stdio.h header (we extracted main() elsewhere)

- This is the commit message yinqiwen#7:

Fix format specifier for unsigned long long

If other platforms complain, we can use system-defined type specifiers.

- This is the commit message yinqiwen#8:

Add geohash_helper

This is geohash_helper.{cpp,hpp} converted to C from
https://github.com/yinqiwen/ardb/blob/d42503/src/geo/geohash_helper.cpp
https://github.com/yinqiwen/ardb/blob/d42503/src/geo/geohash_helper.hpp

It compiles cleanly, but is untested so far.

- This is the commit message yinqiwen#9:

Rename lon to long and fix some interfaces

We don't need pointers as arguments everywhere because:
  - a.) we aren't modifying the parameters
  - b.) the structres aren't that big

- This is the commit message yinqiwen#10:

Improve naming and add helper functions

No more _under and combinedNames, only camelCase.

Also added some helper encoders so callers don't have to manually
create ranges when they don't care about getting any range
data returned.

- This is the commit message yinqiwen#11:

Fix params on decode and add decode helpers

- This is the commit message yinqiwen#12:

Fix parameter error from upstream

It sure looks like an error to pass lat_range, lat_range instead of
lat_range, long_range.

- This is the commit message yinqiwen#13:

Add more paranoid static initializers

- This is the commit message #14:

Add radius helpers for projection types

Always nicer if callers don't have to know internal #define symbols.

- This is the commit message #15:

Add helpers and remove static for radius search

- This is the commit message #16:

Remove all static linkage from helpers

- This is the commit message #17:

Return area from radius search too

- This is the commit message #18:

Fix naming (allign -> align)

- This is the commit message #19:

Avoid inlining helper functions

That's what be compiler optimizations fo, 'yo.

- This is the commit message #20:

Rename step estimation function

- This is the commit message #21:

Rename thingX to Xthing

- This is the commit message #22:

Return boolean instead of -1 for failure

- This is the commit message #23:

Fix infinite loop if 0 passed as parameter

- This is the commit message #24:

Add cap for max bit decode depth estimation

Tiny ranges like asking for 0.1 meter radii resulted
in steps of 28, which isn't valid because that decodes
28*2=56 bits instead of the 52 bits we store.  We don't
want to decode non-encoded data.

Also fix an edge case where asking for a radius of zero gives you zero steps.
A radius of 0 should imply "my exact location at highest precision
available," which is 26 steps.

- This is the commit message #25:

Add direct hash->lat/long decode helpers

- This is the commit message #26:

Change getDistanceSquare to just getDistance

Square it yourself -- what do you think this is, the united socialist
republic of doing your math for free?

Note: this update is only for WGS84 coordinates.

Mercator coordinates still return squared distance, so it's
getDistanceSquared when using the Mercator helper.

- This is the commit message #27:

Add define for maximum step size

All decodes must be done at maximum step size.

Encodes can be done to a less precise step size, but then
they must be shifted to compensate for their lack of meaningful
bits.

- This is the commit message #28:

Fix great circle distance function

Upstream was missing a few carefully needed /2's.

Now it's a properly un-broken haversine arc length calcumajigger.

- This is the commit message #29:

Note not to use step estimation for decodes

Use ALL THE BITS

- This is the commit message #30:

Update EPSG:3785 latitude ranges

Now it's a little more precise, and we have a note saying where
these magic numbers came from.

- This is the commit message #31:

Use actual distance calculation for bounding box

The previous weak estimate wasn't accurate enough, so let's
calculate actual bounding boxes.

- This is the commit message #32:

Claim some ownership

Ego addition because I've spent over a dozen hours improving
and fixing this library.

- This is the commit message #33:

Add bit interleaving optimizations

These optimizations speed up encoding from 7 million
to 1.7 billion encodes per second. (Decodes go from 10 million
to the same 1.7 billion decodes per second.)

Cleaned up to fit the new code organization from
yinqiwen@8840c4c

- This is the commit message #34:

Fix C99 not having M_PI

This will fix some annoying compile issues on
strictly standards compliant platforms.

(C99 doesn't define M_PI, but OS X gives it to us anyway; doesn't
work so great on Linux.)

- This is the commit message #35:

Remove library function from decode routine

Instead of calling a math function, we can just shift things around.

Speed up is over 2x from the previous version.

Adapted from: yinqiwen@97b79f1

- This is the commit message #36:

Refactor update

No new functionality, but lots of code reworking.

There are many major code organization changes in this one commit.

(sorry, it should have broken out into many commits, but it didn't
happen)

Notable changes include:

  - Removed individual ranges for lat/long, now we have one range struct
  to encapsulate lat+long range.  Improves argument flow, usage, and
  efficiency since we can use one shared range struct instead of needing
  to set it every time manually.

  - CamelCased all the things.

  - Added names to structs instead of just typedefs.  Now
    debuggers/analyzers will know their names easier.

  - Standardized on __{BEGIN,END}_DECLS for headers

  - Fixed dumb cmake error causing compile flags to not work.

  - Made return values from deinterleave explicit as inout params
  instead of double packing the result in a uint64_t the caller needs
  to manually unpack.

  - Capitalized some hex constants that looked ugly in lowercase.

  - Made some lat/long pairs more explicit as lat=y and long=x

  - Combined move_x and move_y functions into just move() with an
  argument of whether to move X or Y

  - Added a diagram of how we find all the geohash neighbors around
  ourself (NW, N, NE, W, E, SW, S, SE)

  - Added a #define for our bit width (currently 52) in case we want to change
  it in the future, we only have to update one place.

  - Added tiny python program in header comment showing how you can compute the
  accuracy of geohashes at any given (valid) bit width.

  - Improved struct efficiency by using bitfields instead of requiring
  7 bytes of padding in the hash struct (which gets repeated 8 times
  in the neighbor struct — that's a lot of wasted space we don't have
  anymore).

  - Improved the (really crapy) test main we have so it reports the actual
  step size used and not just a static value potentially far removed
  from reality.

That's about all for now.

These changes completely break any previous API compatability (function
names changed, function arguments changed, struct layouts changed),
so this marks a new major version number for the geohash-int API toolkit.

- This is the commit message #37:

Make maigic numbers common constants

We save 40 bytes of code size by using one table of magic numbers
instead of two.  Potentially.  Under optimization, the compiler probably
just directly inlines the values anyway since they are in const arrays.

But, using one common table shows we're using the same method to
encode/decode.

The deinterleave table is the same as the interleave table except
the deinterleave table has one additional constant at the end.

The per-function S[] array makes sure we only use the constants
we need in each function.

- This is the commit message #38:

Use GEO_STEP_MAX instead of fixed integers

These didn't get caught in prior renamings/refactorings.

- This is the commit message #39:

Add optional 'fast' radius lookup

Just use triangular distance for a small radius.  Probably very
error prone.  Use at your own risk.

- This is the commit message #40:

Cleanup formatting

- This is the commit message #41:

Add fastest geohash integer calculation possible

Uses Haswell (mid-2013+) CPU intrinsics to deposit and extract bits
to/from an integer given a selector mask.

The previous "deposit/extract by magic numbers" method takes ~18.6 cycles
on my 2013 laptop (163 million encodes or decodes per second at -O2 and
-O3).

The new method takes ~5.7 cycles on my 2013 laptop (500 million
encodes or decodes per second at -O2 and -O3, 264 million per second at
-O0).

[[Note: the previous update saying "1.7 billion encodes per second" was
using incorrect benchmarking.  The previous implementation was doing
less than 200 million per second, not 1.7 billion.
Sometimes testing these micro-improvements is tricky beacuse compilers like
to inline so much or optimize away some checks during compile time.]]

We only use the new ability if the ability to run the intrinsics is
detected at compile time.  This also means if you compile on one machine
with the intrinsics enabled, then relocate your machine to a different
hardware platform without the Haswell BMI2 instruction set, you'll just
crash.

- This is the commit message #42:

Make BMI2 support optional instead of automatic

To enable the binary assembly intrinsics speedup, build with:
        mkdir -p build/; cd build
        cmake -DBMI2=ON ..

- This is the commit message #43:

Add API to check if BMI2 enabled during compile

This reports against the static build configuration, not against any
runnable features.

It's possible to build with BMI2 support but not have it, so this would
return true even though if you try to run a geohash construction, you
would segfault due to missing assembly features.

- This is the commit message #44:

Optimize from float division to float multiply

Tehnically faster to multiply than divide.

- This is the commit message #45:

Fix comment about project origin

- This is the commit message #46:

Update license to explicit Apache2

- This is the commit message #47:

Standardize on lat/lng

It's uglier in some places but prettier in other places.

May as well use pretty alignment and shorter typing distance instead of
three syllable words for identifiers.

- This is the commit message #48:

Fix old comment style

- This is the commit message #49:

Update distance fast with better implementation

Still "use at your own risk" — only works between points where if you
were standing on a hill, the accuracy is everything you can see before
the horizon.

- This is the commit message #50:

Add ULL to large constants

It's more correct even though assigning a constant to a uint64_t should
be fairly explcit by itself.

- This is the commit message #51:

Factor out S[]

We can factor out S[] like we did with B[].

Does it matter?  The compiler probably inlines the values anyway when
compiled under optimization beacuse the array is const and the positions
are defined at compile time.

But, it shows we're using the same method and the same data to
encode/decode along the way.

Reduces (logical) code usage from 2 arrays of 5 and 6 bytes each to just
one array of 6 bytes.  But, with optimizations it's possible everything
is just directly placed as values and the space savings doesn't make a
difference.

- This is the commit message #52:

Retrofit
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants