-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
possible to resolve the issue for golang? #23
Comments
Thanks for the comment, @ultperf . Not quite sure what is meant by "pre-fork mode" because Golang also doesn't have fork()ing, right? "means one process per thread" Do you mean "one thread per process"? One thread per Golang process would make it easier to create a binding for SharedHashFile for a Golang process running like this. But who runs their Golang processes like this? What are you hoping to achieve with SharedHashFile that cannot be achieved without it e.g. using a different architecture and/or components? |
yes "one thread per process".
p.s. : possible to make this happen? i can sponsor a few coffees for this. |
By Ideally ultra high performing C programs can still use threads and be in one process, but you want to avoid context switching overhead, and you also want to avoid memory allocation overhead and cleaning up the memory overhead (AKA GC for Golang). In theory a Golang program could use as many Goroutines as CPU cores, so there'd be no context switching overhead. And the same program could also pre-allocate most or all of the memory it uses, thus nullifying the "have to share and manage memory between cores" statement above. In this scenario you wouldn't need But you also wouldn't want to use Golang's built in hash table either. Why not? It relies on allocating memory which later has to be GC'd and thus does not nullify the "have to share and manage memory between cores" statement. And general purpose LRU caches for Golang (e.g. [2]) suffer from the same issue because they are effectively built on top of Golang's built in hash table / associative array which will churn memory and GC at run-time. But if you implemented your own Golang hash table using pre-allocated memory to avoid GC then it might end up looking something like So I'm not saying that I won't port [1] https://pkg.go.dev/github.com/valyala/fasthttp/prefork#section-readme |
https://www.techempower.com/benchmarks/#section=data-r21&test=cached-query check all prefork mode of golang vs non prefork possible to get the binding done for golang? it's not critical but really hope to see it going though |
@ultperf, thanks for sending the link to the techempower round 21 cached query results. Looking at the 'Cached queries' results for '20 queries (bar)', 'fasthttp-prefork' managed 353,362 responses per second, and 'fasthttp' only 68,312 responses per second. So the prefork version is about 5.2 times faster? And if I understand you correctly, the prefork version runs n x 1 thread processes instead of the non-prefork version running a 1 x n thread process? The difference seems to be so big that I'm wondering why one version is so much bigger than the other? You hinted before that it's about Golang's GC. Is that the or one of the reasons and has anybody published an analysis online somewhere? Having a look at the prefork crode here [1], I noticed that its README publishes wrk stats with (104,861 requests per second) and without prefork (97,553 requests per second). This is only 1.07 times faster for prefork. So why is the techempower benchmark 5.2 times faster for prefork but prefork's own benchmark is only 1.07 times faster? [1] https://github.com/valyala/fasthttp/tree/master/prefork |
in areas of memory use / caching / with regards to GC: once memory is used, the issue with golang as a GC + single process, multithreaded go routine features, the mem handling between multi threads will be doing a lot of context switches between those threads with consideration to for server production use, the "cached query" of techempower is more "real life" because it's using 20+ core CPU as benchmark and caching for golang program is the reason why i am asking u for the "sharedhashfile" golang binding. pls help? i can sponsor a few coffees for this? truly need this... not sure if u can write in in pure go for cgo. preferably pure go binding or c->go asm or goasm p.s. : it will be great if u can optionally make it a shared mem ipc for golang. that'll really be something this module will be used everywhere. |
@ultperf, thanks for the info and clarifications. More questions for clarification :-) Let's say for example sake that the test is running on the 20+ core CPU and the number of processes or threads is 20. And running in non-prefork mode there would be 1 process with 20 threads and presumably 20 goroutines or 1 goroutine per thread? Are the 'cached queries' just cached in memory upon startup in whatever internal data / memory format is convenient to the implementation? Presumably the prefork mode code starts, loads the cached query data into memory, and then fork()s 20 times? In this way it uses a similar amount of memory to the non-prefork mode because the cached query data is in shared memory due to the fork()ing?
Confused by this statement: In non-prefork mode, if the number of goroutines is kept less than the number of CPU cores, surely there wouldn't be any context switches, or? Presumably the prefork and non-prefork modes will generate the same amount of garbage for GC -- assuming the same amount of incoming queries -- at run-time? So, all other things being similar for processing a query, we might assume that the difference in performance, and the reason the prefork mode code is ~ 5 times faster, is largely due to the higher efficiency of the prefork mode GC at run-time? When Golang performs GC then it needs to loop through every heap allocation whether it's going to be GC'd or not. This means the bigger the heap, the slower the GC. So this is already bad for the non-prefork mode code. It's heap will be ~ 20 times bigger than the heap of a single prefork process? Which means a 20 times longer concurrent GC? Also, although the Golang GC is largely concurrent to minimize 'stop the world' GC processing tasks, in tests I have found in the past that during concurrent GC then new Golang heap allocations become much slower, and thus, the Golang concurrent GC processing causes regular heap allocations to occur much slower and thus slows down regular code running. The prefork mode code does exactly the same thing as the non-prefork code but: 1. Presumably the heap for each process is going to be 20 times smaller, meaning the concurrent GC is going to be 20 times faster? And 2. Assuming not all 20 processes GC at the same time, there will be a good chance of a new query being handled by a process not currently slowed down by concurrent GC, and presumably the query will be handled faster? It would be interesting to experiment further with the prefork mode code to dynamically disable Golang GC altogether while handling queries, and periodically enable GC but disable handling queries. This way, queries will only ever be handled by processes guaranteed not to be subject to concurrent GC, and in theory giving overall better performance? There could also be some kind of IPC mechanism to ensure that GC always happens evenly spread out between the 20 processes? That's enough questions and speculation for now :-) |
no, the goroutine is thread specific so each thread will spawn their own goroutines.
for fasthttp, there's only 1 goroutine i think but in terms of the use of "goroutine" in 1 process with 20 threads, the goroutine can choose which cpu to run on (assuming we disregard numa). normally spawning multiple goroutines will create a lot of memory issues so that's why panjf/ants was created to address this problem. how golang works depends on developer execution (and how he writes the code), with regards to sharedhashfile, it is kind of irrelevant to non-expert golang coder, either on goroutine usage or feature / function placement etc
i have no idea. but in prefork mode, the reason it's faster is mostly because there's no context switching and memory management between 20+ cpu threads. i dont want to confuse u further but mostly because golang doesnt have context switching memory management between multiple threads in a single process (non pre-fork). the overhead to manage non-prefork mode is much higher than just pure single thread per cpu.
in prefork mode, each cached query data is being used by the thread that spawns it only. they are not shared between forked processes. no memory sharing between processes in prefork mode. that's why i took notice of your sharedhashfile.
there still will be. each goroutine is like spawning a separate thread. each thread means context switching and the overhead of keeping such cpu core threads irrespective of num of cpu cores.
keeping the overhead at 1 thread managing the single garbage collection "stuff" is cheaper than single process, multi threaded garbage collection "cycles". talking about real world programs, garbage will be there for golang programs (real world), unavoidable. so prefork mode with sharedhashfile can be significantly better. however, the only overhead will be the CGO call at around 100ns per call. i've tested lru cache written in c interfaced with cgo and it's more consistent though 10-20x lower throughput but more predictable in usage instead of latencies created by GC.
techempower benchmark shows 1 side BUT depending on the skill of the programmer for real life program. i'm talking big programs here not the ones written for benchmarking purposes, single thread prefork only need to manage the thread's GC. single GC because CPU is pinned to the thread/process.
sort of i think but GC is supposed to be working in the background so if u want to think that way, i guess it's "right". i think u also have to think about the fact that when experts use prefork mode, they already have the knowhow to work around other areas of efficiencies... like looking at your sharedhashfile etc.
expert coder will have ways to work around such issues etc.
there are plenty ways to manage GC and to tame it. so depending on what kind of app is being written and how it's ran, u can write it in ways to take GC into considerations
for IPC, i'm actually waiting for bytedance to opensource their ShmIPC this year. can u do a golang binding for ur sharedhashfile to test first?
|
i can sponsor a few coffees for this.
also, the server is running in "pre-fork mode", means one process per thread. so there's thread isolation "goroutine is using gnet's ants" https://github.com/panjf2000/ants or bytedance's gopool
pls do a golang binding for this and reduce the issues for golang with the use case scenario i've mentioned above.
with ref to this:
#22
thx in advance! really appreciate this. will repay ur kindness in future in other ways.
The text was updated successfully, but these errors were encountered: