-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
memory leak in 0.8.2 #941
Comments
Well, memory usage is not that much on first server, but on second server, memory usage is far more...
|
here is pmap of the process :
As you can see there is a 1.5G anon block at the start... I'll run the same command in an hour to see which part grow... |
after a day or so, the second server, acting only as a replication, is using 1.8G of resident RAM :
|
got help to debug with pprof from jvshahid, here some results :
|
Here is the result after some time :
the process memory consumption seen from the OS is around 1G with a really big chuck of 800M... As proposed, I'm going to run even more debug with :
|
Influx process is stuck at 200% CPU. trying to gdb to dump threads stack with :
|
profile of the debug run (ended with a lock) is on dropbox : |
I had a better run with memory usage going up to 1+ gig. https://dl.dropboxusercontent.com/u/1965631/profile-prune-2.tar.gz |
as requested on IRC, LOG files from rocksdb from both servers. https://dl.dropboxusercontent.com/u/1965631/rocksdb-logs-prune-master.tar.gz |
the leader process (also the one receiving reads and writes) crashed with out of memory (no OOM) :
Well, this is not the complete message, it TOO big, but I thing you got the point : lots of goroutine, some in "chan receive" and some in "IOWAIT" |
in total it is 73 goroutines on IO WAIT and 2148 on chan receive |
Re-installed cluster on 4 nodes, 2 physical servers and 2 vmware. Ubuntu 14.04, dedicated, influxdb 0.8.2. I provided everything I could, here and on IRC. Everything requested, including logs from a debug binary and LOG files. I'm still waiting for an answer on your side. Could you please, at least, change the status from "more information" to "acknowledged and working on this huge memory problem influx is having" ? Last thing I could do, which will not be easy, is give a ssh access to someone so he could have a look by itself... |
I'm now seeing similar behavior here. We have three nodes in a cluster and they have been working fine until this morning. Today, one of the nodes suddenly started growing in terms of memory usage and keeps growing so far. Although the other two nodes look normal. We're using 0.8.2. |
We have a single node running 0.8.2 which started leaking memory right after upgrading it from 0.7.3. |
Anyone on this thread who can provide a script to reproduce this issue will be very helpful. I tried to reproduce it few days ago with no luck and had to take a look at other bugs and cut a release which distracted me. I'll keep trying to reproduce the issue today. @localhots let me know if you can provide any help since you're using one node only and that will make it easier to isolate the problem and repro it locally. |
What are the sizes of your databases as reported by the fs ? |
Database size is 23GB. What kind of information would help you? |
Also experiencing memory leaking in 0.8.2 that results in database crash. I'm throwing only ~100-200 metrics per minute to the database and after a day or so it has eaten most of its' 4GB of RAM (single-node setup). With less memory it would crash earlier. DB size is 313M but the issues started way before that. I'm not using database for anything exotic, simply 2 shard spaces (one for metrics, one for grafana). Roughly 30 shards total (due to grafana/grafana#663). Running it on top of EC2 (Amazon Linux 2014.03). The machine does not have swap associated. |
Here is an app that can be used to stress influxdb. Still need to add a few things to it but it's working as is. https://github.com/dgnorton/influxdb_stress |
Modifying the stress app to only write new series showed a huge memory consumption. If you stop the app before the OOM error, server stays up with the memory used, never giving it back, even if you don't write to the serie anymore... This is easily reproductible, with only one client and a batch size of 100 series :
As these series only contains one value and are never accessed then, it's not a write contention. a "list serie" also lock the process at 100% CPU. I'm going to write a write/read script to add values to existing series and see how memory changes over time, but you already have something to look at now... |
I've just confirmed that I can reproduce the leak @prune998 described above using As for the other leak in a cluster node I mentioned earlier, it happens even when there are no writes but some reads on the node. I haven't looked into it closely yet, though. |
I was able to reproduce the case of read-only memory leak with this Haskell script consistently a little while ago. But now I can't reproduce the issue after I ported the script to Ruby for some reason. Probably this leak happens occasionally. When I was able to reproduce it, the behavior was as follows:
If I remember correctly the memory usage was increasing quicker than 1MB/s. Once I restart the influxdb process, Unfortunately I cannot reproduce the behavior now with either of scripts. So I guess there should be another trigger to produce the issue. EDIT: I've been using influxdb v0.8.3 on Ubuntu 12.04 throughout the tests and the client libraries are the latest ones. |
@jvshahid We also don't introduce that many series per day (maybe ~50 average and ~200 at maximum). I also remember the DB crashing over the weekend, when there were no new series introduced at all. |
Has this always been the case or is it a regression? I'm seeing something like this locally but haven't really had time to investigate. Was just thinking that if it's a regression the quickest way to find out the problem is probably by doing a git bisect and the "stress script" mentioned above |
Can you guys report the number of series that you have so far and the GODEBUG='gctrace=1' HEAP_PROFILE_MMAP=true ./influxdb --stdout --profile
You should use the binary found here
On Tue, Oct 7, 2014 at 1:48 AM, Tuukka Mustonen [email protected]
|
The command from the last comment is
|
Following the instructions, when issuing the given command, I get an error "flag provided but not defined: -profile", and InfluxDB is not started. See below: sudo GODEBUG='gctrace=1' HEAP_PROFILE_MMAP=true ./influxdb --stdout --profile /tmp/influxdb.profile 2>&1 | tee /tmp/influxdb.stdout gc1(1): 3+2+733+2 us, 0 -> 0 MB, 18 (19-1) objects, 0/0/0 sweeps, 0(0) handoff, 0(0) steal, 0/0/0 yields |
Perhaps a hard memory usage limit read from configuration is the way to go here. If series names are read into memory that's fine, but with no limit on how much memory series names can take or a total limit the memory usage will surely run out of control given enough time. There should be a way to set a hard limit on how much memory influxdb will use without having to guess by adjusting In case this info is useful, we are running a four node cluster here with two replicas and over a couple days both replicas died with OOM while the two non-replicas are still going strong with no suspect memory usage. Can also easily replicate the memory issue by running the stress tool that @prune998 posted earlier, as well as our own graphite metric creating test tool which I can provide if needed. Profiling can be done locally with either of these tools. |
@perqa sorry about that, i think i forgot to enable profiling when i built that binary. I updated the binary to have profiling enabled you can use the same link i posted earlier. |
https://github.com/dgnorton/influxdb_stress ... added an option to have readers (clients executing queries). The queries aren't configurable from command line or file but you can edit the |
@jvshahid: Thanks for the update. I tried the new binary, but now I get a different error message: ./influxdb: error while loading shared libraries: libtcmalloc.so.4: cannot open shared object file: No such file or directory I'm running Ubuntu 14 on Vagrant. |
You may need to run |
Yes, indeed, as well as The log files are available at |
@perqa The memory profile shows very low memory usage, are you sure these profile files are from the right run ? |
It's the one and only profiling run I've done...so it must be the right one. What I did during the session was to log in to the web interface at port 8083, navigate to the right database and issue one query: |
That's not the point, you have to trigger the same memory leak behavior in On Thu, Oct 16, 2014 at 3:12 AM, perqa [email protected] wrote:
|
excuse me @toddboom Just noticed you added this thread to 0.9.0 milestone, does that mean this is not gonna be fixed with 0.8.x releases? I am running 0.8.3 in production because of aggregation functions errors (which fixed in 0.8.8 I assume) but as far as I know you have quite a different approach starting from 0.9.0, so I don't think I will migrate to 0.9.x in near future : ( So what would be the cause or is there any workaround to reduce or free the memory used by influxdb other than restart the instance? |
Hi @toddboom @jvshahid , I can confirm one situation will cause huge memory consumption, leaking. I am using the stressing tool provided by @dgnorton , with 10 writers and 10 readers, write batch is 3000 series per second per writer and there are total 30000 series. Two instances run as a cluster on two ubuntu 12.04 precise servers. LRU for leveldb is 500m and other configs are pretty much the default ones from installation. I noticed the memory only leaks when I have queries like below:
I am inclined that the last query
is the trouble maker here, also I have similar queries in my production which will eventually use up all memory on the server. The query aforementioned involves 3000 series in the db and once the query got executed things start going south... I can see memory usage ever increasing. In a nut shell, what I've seen is whenever InfluxDB gets stuck in query involves a large number of series will trigger the memory exhaustion. So the idea would be to manually control the query size to avoid memory issue? Also the split setting in shard config has impacts on following servers I assume. |
Hi @jvshahid , could you post instructions how to build a influxdb binary with profiling enabled? because the version you posted above was compiled against GLIBC_2.17 which I do not have on Ubuntu 12.04. So I will have to build the binary on my own with my environment. |
Current testing of v0.9.0 shows that this is no longer an issue, so we're closing it out. |
I still have a memory leak in Influxdb 0.8.2
Servers are VMWARE hosts, ubuntu 14.04 LTS :
Linux poplar 3.13.0-29-generic #53-Ubuntu SMP Wed Jun 4 21:00:20 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
It is a two node cluster with dedicated servers (nothing else is running except admin tools like logstash and diamond, not using much CPU or RAM).
The database have a one day shard with replication = 2 and split = 1
Here is the influxdb process info :
limits on the process :
process mapping :
global mem info :
Let me know if I can provide more info... or tell me how to memory dump if this is what you need....
The text was updated successfully, but these errors were encountered: