Skip to content
This repository has been archived by the owner on Jan 20, 2023. It is now read-only.

merge into empty t-digest produces strange numbers #5

Open
igorwwwwwwwwwwwwwwwwwwww opened this issue Oct 4, 2017 · 2 comments
Open

Comments

@igorwwwwwwwwwwwwwwwwwwww
Copy link
Contributor

i've been playing around with MergeInto, and noticed some strange behaviour when merging into an empty t-digest. this might be an edge case, i'm not quite sure.

here is the test program:

package main

import (
	"fmt"
	"math/rand"

	"github.com/spenczar/tdigest"
)

func main() {
	rand.Seed(5678)
	values := make(map[float64]interface{})

	// Generate 100k uniform random data between 0 and 100
	var (
		n        int     = 100000
		min, max float64 = 0, 100
	)
	for i := 0; i < n; i++ {
		val := min + rand.Float64()*(max-min)
		values[val] = nil
	}

	for i := 0; i < 10; i++ {
		a := tdigest.New()
		for v := range values {
			a.Add(v, 1)
		}

		b := tdigest.New()
		a.MergeInto(b)

		fmt.Println(a.Quantile(1), b.Quantile(1))
	}
}

i would expect a and b to be similar if not identical, but that is not the case. a is around 99.99 which is the expected result. however, b is all over the place.

any ideas what might be going on here?

@spenczar
Copy link
Owner

spenczar commented Oct 4, 2017

Thanks for the detailed issue report, and thanks especially for the reproducer.

You're right that this looks quite wrong. The empty tdigest is getting very confused when it receives any of the heavy, high-weight centroids from the middle of a.

This is either a problem in the algorithm, or (much more likely) a serious problem in the implementation of it. I'm afraid this might take longer to turn around than #2 did, this looks pretty nasty.

@spenczar
Copy link
Owner

spenczar commented Oct 5, 2017

This might actually be a problem with the algorithm: tdunning/t-digest#78. There's a blog post that suggests a workaround I'll play with.

Since I wrote this library, Ted Dunning rewrote much of the paper to describe a different implementation which he calls a "merging TDigest" which solves most of these problems. Implementing merging TDigests would be a lot more work though - it would be a rewrite of this package. I'm up for that, it could just take a while.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants