Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeseries need to use a stable hash algorithm for computing keys #4008

Closed
bnaecker opened this issue Sep 1, 2023 · 0 comments · Fixed by #4251
Closed

Timeseries need to use a stable hash algorithm for computing keys #4008

bnaecker opened this issue Sep 1, 2023 · 0 comments · Fixed by #4251
Assignees

Comments

@bnaecker
Copy link
Collaborator

bnaecker commented Sep 1, 2023

When oximeter fetches samples from a producer, it extracts a schema and a timeseries key. This is used to ID the specific instantiation of the timeseries within the schema, which is critical for being able to associate the records for a specific sample across the various tables. We use this to determine which fields go with which measurements.

We're currently using std::collections::hash_map::DefaultHasher to do this. There are no real stability guarantees around that though, which means we may lose the ability to associate records if that hash algorithm changes. We need to switch to a stable algorithm. XXHash or CityHash are attractive candidates.

I see no way to do this without breaking existing data. We need to do this ASAP, to minimize the data loss.

@smklein smklein self-assigned this Oct 11, 2023
smklein added a commit that referenced this issue Oct 12, 2023
#4251)

- Uses [bcs](https://crates.io/crates/bcs) as a stable binary format for
inputs to the `timeseries_key` function
- Uses [highway](https://crates.io/crates/highway) as a stable hash
algorithm (it's keyed, fast, portable, and well-distributed).
- Additionally, adds an EXPECTORATE test validating the stability of
timeseries_key values.

Fixes #4008 , and also
addresses the issue raised in
#4221 regarding stable
input

NOTE: This PR itself *also* breaks the stability of the `timeseries_key`
(hopefully for the last time), and will rely on
#4246 to wipe the metrics
DB for the next release.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants