Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.1 beta 3: "fatal error: concurrent map iteration and map write" #432

Closed
WesleyDavid opened this issue May 15, 2020 · 6 comments
Closed

Comments

@WesleyDavid
Copy link
Contributor

Previously we were testing out 1.0.3, and I just dropped 1.1 beta 3 in, pulled fresh from Dockerhub:

~ $ trickster --version
Trickster version: 1.1.0-beta3, buildInfo: 2020-05-15T04:04:07+0000 55386eebcc1b87685ee5eee1403183d669838981, goVersion: go1.14.2, copyright: © 2018 Comcast Corporation

We're using an in-memory cache. Our backend is InfluxDB 1.7. We're sending about 5,000 queries per minute through Trickster, and within just a minute or two of the container being up we're seeing:

May 15 13:41:38 trickster: fatal error: concurrent map iteration and map write 
May 15 13:41:38 trickster: goroutine 6884098 [running]: 
May 15 13:41:38 trickster: runtime.throw(0xbcd7eb, 0x26) 
May 15 13:41:38 trickster: 	/opt/hostedtoolcache/go/1.14.2/x64/src/runtime/panic.go:1116 +0x72 fp=0xc002b5cb28 sp=0xc002b5caf8 pc=0x4349d2 
[...]

...and the usual avalanche of about a million stack trace log lines after that. Our running config is fairly simple with just one cache and one origin. We've given Trickster about 200GB of memory and run it in a Docker containers on Linux (technically CoreOS). I can provide any further details that would be helpful in tracking this down, but so far that's about all I've got. 😄

@jranson
Copy link
Member

jranson commented May 15, 2020

Thanks! It would be good to see something deeper in the stack trace to know what code lines triggered the panic (e.g., the last tricksterproxy packaged .go file line #, before the trace drops into runtime/panic.go) so that we can know which map it is, and then implement the proper locking around it. Is that something you'd be able to dig out?

@WesleyDavid
Copy link
Contributor Author

Catching up on this, there were three fatal error: concurrent map iteration and map write panics in rapid succession. We manually restarted the process each time, five minutes or so of traffic passed through and then another panic hit. After the third panic in about 15 minutes we reverted back to 1.0.3.

I figured more data is best data so here are the full stack traces, each being 7,000 or so lines:

@jranson
Copy link
Member

jranson commented May 18, 2020

Excellent, thanks! We're taking a look!

@jranson
Copy link
Member

jranson commented May 18, 2020

	/opt/hostedtoolcache/go/1.14.2/x64/src/runtime/map.go:853 +0x552 fp=0xc002b5cba8 sp=0xc002b5cb28 pc=0x40f872 
net/http.Header.Clone(0xc000eb2330, 0xcca640) 
	/opt/hostedtoolcache/go/1.14.2/x64/src/net/http/header.go:97 +0xaa fp=0xc002b5cc98 sp=0xc002b5cba8 pc=0x6c1aca 
github.com/tricksterproxy/trickster/pkg/proxy/engines.DeltaProxyCacheRequest(0xcbdcc0, 0xc00debcf00, 0xc00daa4b00) 
	/home/runner/work/trickster/trickster/pkg/proxy/engines/deltaproxycache.go:421 +0x126b fp=0xc002b5d6b0 sp=0xc002b5cc98 pc=0x9a1f3b 

Looks like a concurrent map access on the request Headers, on all 3 traces. Should be an easy one to track down and fix up. We'll aim to have beta4 w/ this patch ready tomorrow.

jranson pushed a commit that referenced this issue May 19, 2020
* fix cross-goroutine map collision panic on headers #432

* if timestamp does not cast to float64, don't cache and log temp warning #433

* allow customizable health check path

* show good /trickster/config text post cfg reload

* bump version
@WesleyDavid
Copy link
Contributor Author

WesleyDavid commented May 20, 2020

Pulled down the 1.1.0 Beta 4 Docker image and unleashed the traffic on it. Got a panic fatal error: concurrent map iteration and map write after about 10 minutes. We peaked at around 4,000 QPS just before it cratered.

Stack trace here. EDIT: Fixed the gist link. If you clicked the gist via an email alert, it was to an older Gist. Sorry!

Back to 1.0.3 for now! 😄

@jranson
Copy link
Member

jranson commented Jun 2, 2020

Wesley and I worked through this in a separate channel and the fix for the map iteration was applied in Beta6. we'll close for now and Wesley can reopen if any more issues arise relating to influxdb

@jranson jranson closed this as completed Jun 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants