Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.0.3: panic: interface conversion: interface {} is string, not float64 #433

Closed
WesleyDavid opened this issue May 15, 2020 · 4 comments
Closed
Labels
bug Something isn't working influxdb Issue affects InfluxDB

Comments

@WesleyDavid
Copy link
Contributor

Trickster Version: 1.0.3 buildInfo: 2020-04-16T14:37:36+0000 41fbf794f4457d1d6f3453ab480612d5eb10ae92 goVersion: go1.13.8 goArch: amd64
Cache kind: in-memory.
Backend: InfluxDB 1.7.
Host OS: CoreOS / Linux
Runtime: Docker

There's a decent amount of traffic coming through the cluster 60 - 90 queries per second. We're mirroring some customer queries that would normally go straight to a time series backend. That mirrored traffic is getting pushed through a Trickster. As of last night, we noticed Trickster had crashed out, which is very uncharacteristic. So far we've never seen a crash. The stack trace is rather short:

May 15 14:16:23 trickster: panic: interface conversion: interface {} is string, not float64 
May 15 14:16:23 trickster: goroutine 105279725 [running]: 
May 15 14:16:23 trickster: github.com/Comcast/trickster/internal/proxy/origins/influxdb.(*SeriesEnvelope).Sort.func1(0x0, 0xc15feea0e8, 0xc703650390, 0xc703714620, 0xc703650360, 0xc15feea140, 0xc42311dc20, 0x2, 0x2) 
May 15 14:16:23 trickster: 	/go/src/github.com/Comcast/trickster/internal/proxy/origins/influxdb/series.go:415 +0x2a2 
May 15 14:16:23 trickster: created by github.com/Comcast/trickster/internal/proxy/origins/influxdb.(*SeriesEnvelope).Sort 
May 15 14:16:23 trickster: 	/go/src/github.com/Comcast/trickster/internal/proxy/origins/influxdb/series.go:414 +0x21f 

Prior to that, things appear to be normal. We've got debug logging turned on and requests were coming in and returning with status code=200. These were the last few lines just before the panic above (I removed query responses with code=200 because it shows customer data):

May 15 14:16:23 trickster: time=2020-05-15T21:16:23.829246636Z app=trickster caller=proxy/engines/deltaproxycache.go:364 level=debug event="delta routine completed" cacheKey=hostname:8086.ae613707bc6c18d064990f968838a72a cacheStatus=phit reqStart=1589577083 reqEnd=1589577383 extentsFetched=1589577381-1589577383 
May 15 14:16:23 trickster: time=2020-05-15T21:16:23.832029195Z app=trickster caller=cache/memory/memory.go:130 level=debug event="memory cache retrieve" cacheKey=hostname:8086.1306f7307221e2a67988174d2575f76c 
May 15 14:16:23 trickster: time=2020-05-15T21:16:23.83222794Z app=trickster caller=runtime/asm_amd64.s:1357 level=debug event="memorycache cache store" is_direct=true cacheKey=hostname:8086.ae613707bc6c18d064990f968838a72a length=0 ttl=24h0m0s 
May 15 14:16:23 trickster: time=2020-05-15T21:16:23.83234411Z app=trickster caller=cache/memory/memory.go:130 level=debug event="memory cache retrieve" cacheKey=hostname:8086.06a06cf54975626baf9f9675edebcd1c 
May 15 14:16:23 trickster: time=2020-05-15T21:16:23.832442905Z app=trickster caller=proxy/engines/deltaproxycache.go:364 level=debug event="delta routine completed" cacheKey=hostname:8086.06a06cf54975626baf9f9675edebcd1c cacheStatus=hit reqStart=1589576400 reqEnd=1589577000 
May 15 14:16:23 trickster: time=2020-05-15T21:16:23.841262719Z app=trickster caller=proxy/engines/deltaproxycache.go:364 level=debug event="delta routine completed" cacheKey=hostname:8086.0a8b348365e76cb5a6f64b8b3f1a7457 cacheStatus=kmiss reqStart=1589576700 reqEnd=1589577300 
May 15 14:16:23 trickster: time=2020-05-15T21:16:23.841711551Z app=trickster caller=runtime/asm_amd64.s:1357 level=debug event="memorycache cache store" cacheKey=hostname:8086.0a8b348365e76cb5a6f64b8b3f1a7457 length=0 ttl=24h0m0s is_direct=true 
May 15 14:16:23 trickster: time=2020-05-15T21:16:23.881875306Z app=trickster caller=proxy/engines/deltaproxycache.go:364 level=debug event="delta routine completed" cacheKey=hostname:8086.c2900bc5172618fe8d1854323d3d5b44 cacheStatus=kmiss reqStart=1589576760 reqEnd=1589577360 
May 15 14:16:23 trickster: time=2020-05-15T21:16:23.882575724Z app=trickster caller=cache/memory/memory.go:130 level=debug event="memory cache retrieve" cacheKey=hostname:8086.cfe57a045bdb40592c38a09bf14d68ec 
May 15 14:16:23 trickster: time=2020-05-15T21:16:23.882677636Z app=trickster caller=proxy/engines/deltaproxycache.go:364 level=debug event="delta routine completed" cacheKey=hostname:8086.cfe57a045bdb40592c38a09bf14d68ec cacheStatus=hit reqStart=1589576460 reqEnd=1589577360 
May 15 14:16:23 trickster: time=2020-05-15T21:16:23.882792281Z app=trickster caller=runtime/asm_amd64.s:1357 level=debug event="memorycache cache store" cacheKey=hostname:8086.c2900bc5172618fe8d1854323d3d5b44 length=0 ttl=24h0m0s is_direct=true 

On our end, the infrastructure side of things, we haven't (knowingly) changed anything. We've been letting it run for days to gather statistics for us to analyze. However since this is a mirror of customer traffic, there's any number of arbitrary queries that could be coming through. Current working theory is that some new query that is otherwise absorbed by InfluxDB (either by responding with a 200 or just gracefully erroring out and returning a 400 or 500 response) is perhaps tripping a bug in Trickster.

Some Context

As a result of this experience, we decided to do a quick test of the latest beta build, 1.1 beta 3, which resulted in seemingly unrelated crashes within minutes of the process being up. That experience resulted in this issue, which may or may not be related, but the context and timing seemed relevant to note.

@WesleyDavid
Copy link
Contributor Author

I was able to grab another stack trace, this one seemingly more complete, for a similar panic that happened a little bit after the one that opened up this issue:

2020/05/15 19:37:10 http: panic serving 104.198.20.141:60806: interface conversion: interface {} is string, not float64 
goroutine 156820450 [running]: 
net/http.(*conn).serve.func1(0xc8998bcfa0) 
	/usr/local/go/src/net/http/server.go:1767 +0x139 
panic(0xad2fc0, 0xc046c98a50) 
	/usr/local/go/src/runtime/panic.go:679 +0x1b2 
github.com/Comcast/trickster/internal/proxy/origins/influxdb.(*SeriesEnvelope).CropToRange(0xc3fac908c0, 0x0, 0xed650e530, 0x1168260, 0x0, 0xed650e620, 0x1168260, 0x0, 0x0, 0x0) 
	/go/src/github.com/Comcast/trickster/internal/proxy/origins/influxdb/series.go:354 +0xdca 
github.com/Comcast/trickster/internal/proxy/engines.DeltaProxyCacheRequest(0xca1e60, 0xc046c98780, 0xc6deb8f200) 
	/go/src/github.com/Comcast/trickster/internal/proxy/engines/deltaproxycache.go:303 +0x16e1 
github.com/Comcast/trickster/internal/proxy/origins/influxdb.(*Client).QueryHandler(0xc0000a58c0, 0xca1e60, 0xc046c98780, 0xc6deb8f200) 
	/go/src/github.com/Comcast/trickster/internal/proxy/origins/influxdb/handler_query.go:39 +0x156 
net/http.HandlerFunc.ServeHTTP(0xc0001e4560, 0xca1e60, 0xc046c98780, 0xc6deb8f200) 
	/usr/local/go/src/net/http/server.go:2007 +0x44 
github.com/Comcast/trickster/internal/util/middleware.WithResourcesContext.func1(0xca1e60, 0xc046c98780, 0xc6deb8f100) 
	/go/src/github.com/Comcast/trickster/internal/util/middleware/config_context.go:35 +0x265 
net/http.HandlerFunc.ServeHTTP(0xc0000e77c0, 0xca1e60, 0xc046c98780, 0xc6deb8f100) 
	/usr/local/go/src/net/http/server.go:2007 +0x44 
github.com/Comcast/trickster/internal/util/middleware.Decorate.func1(0xca20a0, 0xc0ccf212c0, 0xc6deb8f100) 
	/go/src/github.com/Comcast/trickster/internal/util/middleware/metrics.go:35 +0x145 
net/http.HandlerFunc.ServeHTTP(0xc0000e7810, 0xca20a0, 0xc0ccf212c0, 0xc6deb8f100) 
	/usr/local/go/src/net/http/server.go:2007 +0x44 
github.com/gorilla/mux.(*Router).ServeHTTP(0xc0001a4000, 0xca20a0, 0xc0ccf212c0, 0xcc6a321d00) 
	/go/pkg/mod/github.com/gorilla/[email protected]/mux.go:212 +0xe2 
github.com/gorilla/handlers.CompressHandlerLevel.func1(0xca27a0, 0xc644842460, 0xcc6a321d00) 
	/go/pkg/mod/github.com/gorilla/[email protected]/compress.go:148 +0x19b 
net/http.HandlerFunc.ServeHTTP(0xc0001dfa40, 0xca27a0, 0xc644842460, 0xcc6a321d00) 
	/usr/local/go/src/net/http/server.go:2007 +0x44 
net/http.serverHandler.ServeHTTP(0xc0001ae1c0, 0xca27a0, 0xc644842460, 0xcc6a321d00) 
	/usr/local/go/src/net/http/server.go:2802 +0xa4 
net/http.(*conn).serve(0xc8998bcfa0, 0xca4a60, 0xc5f5f76b00) 
	/usr/local/go/src/net/http/server.go:1890 +0x875 
created by net/http.(*Server).Serve 
	/usr/local/go/src/net/http/server.go:2928 +0x384 

I looked around to see if there was any recorded interaction with the socket 104.198.20.141:60806(that IP would be an AWS NLB that's in front of Trickster). I was hoping to find a record of what the HTTP payload was. The NLB logs do not contain that information and Trickster did not log out anything concerning that socket before the panic, so it doesn't appear there's any more detailed information on what exactly was sent that could have triggered the panic.

@jranson
Copy link
Member

jranson commented May 19, 2020

Thanks for sharing this panic. This is an interesting one because it means the data in the time field was not a float64 - or that we were looking at the wrong column for the timestamp for some reason. In beta4 (for which the PR should be approved in a bit), we cleaned this up just a tiny bit to not assume a float 64 for a field named time, and only cache the point if its value is convertible to a float64. That should prevent the panic, and if you happen to see any interesting artifacts in your dashboards in beta4 and beyond, they may be related to this condition, so we'll be able to more easily capture those queries going forward for analysis. Ultimately we'll want to log this as a warning and attach an event to the tracer, but in the short term, let's see if we can can further root cause why the type conversion is sometimes not working based on any now non-fatal anomalies. That way I don't add a warning that can potentially spam the log on every timestamp of a given problematic query every time someone requests it.

@WesleyDavid
Copy link
Contributor Author

WesleyDavid commented May 19, 2020

Very interesting. It's puzzling to us because we've had literally billions of queries go through that Trickster and then suddenly panic after panic with this type error out of nowhere. We still haven't been able to figure out what the query string was, and now that we're back on 1.0.3 we haven't had any of them since last week. ¯\_(ツ)_/¯

I'll roll out beta 4 when it's ready and see what we catch in the logs!

jranson pushed a commit that referenced this issue May 19, 2020
* fix cross-goroutine map collision panic on headers #432

* if timestamp does not cast to float64, don't cache and log temp warning #433

* allow customizable health check path

* show good /trickster/config text post cfg reload

* bump version
@jranson jranson added bug Something isn't working influxdb Issue affects InfluxDB labels Jun 5, 2020
@jranson
Copy link
Member

jranson commented Apr 24, 2021

@WesleyDavid closing this one as you indicated you've been using v1.1.5 without problems

@jranson jranson closed this as completed Apr 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working influxdb Issue affects InfluxDB
Projects
None yet
Development

No branches or pull requests

2 participants