-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
InfluxDB starts for 3 hours #5764
Comments
I can provide the log for startup, any stats including stacktraces from gcc during startup. |
Can you upload your startup log? |
Looks like the wal files loading takes a while... That's a real problem as the service may be down for a while after a restart and customers may loose data points which may cause penalty due to some SLA. What about parallelizing wal loading ? Starting accepting connection before WAL are loaded and store the data points in a new new one and merge everything after the complete startup ? |
The WAL files are all zero length, since there are no writes for a long time and I did several restarts. The startup log does not provide good info here. From my observations, something happends after loading WAL file and before another log records. I assume the server can't spend hours to load zero WAL files. |
Here are 2 stacktraces, taken at the time when log says nothing and server just burns CPU. I omitted all sleeping/idle threads, the only job seem to be done by this one, doing "NewFieldCodec". It is the most of time spent during startup. |
That stack trace points to loading the in-memory index which is taking too much time. If you have a test env, I would be curious to see if #5372 helps. There are some issues I still need to fix with that PR so if you try it out, make sure it's on a test env. |
I can test it, but I'd need a Debian file to install it. If someboy can build it for me, or there is CI doing that, it would be very nice. I'm not a Go guy, sorry... |
I have a similar case, it takes ~1:20h to start up my 19GB database. |
Same problem here in combination with a graphite listener. |
Same problem here. The startup performance and overall performance have again degraded on 0.10.1 to a point where it cannot handle a database of 28 days or less than 1 million data points per day on a dedicated 16GB RAM and dual SSD machine. Pre 0.9 I was able to have a database with a year of the same data. Starting and loading was never an issue, only queries spanning months could take minutes to run. Here a snapshot of the log after 45min startup (still processing and killing my system) ` 2016/02/28 15:11:16 InfluxDB starting, version 0.10.1, branch 0.10.0, commit b8bb32e, built unknown |
We are having this problem as well, but to be fair we've had it for a long time. I'm not sure if it is even the same thing I reported initially in #4952, but we haven't had any huge improvements since then in the startup time. While we've put up with it, it became painful while we tripped over #5965. I was reminded of it again today when I had to make a config change, which required a restart and caused quite lengthy downtime. We have only about 70G worth of data on 8 CPUs and 60G of memory with about 600k series. During startup, the influxdb server uses only 1.5 CPUs, and there is no real disk IO to speak of, and it takes ages to do whatever it is doing. If there is anything I can contribute to sorting this out, please let me know. EDIT - We are using version influxdb-0.10.3-1 |
@zstyblik Yes, but It would be very helpful if someone with a slow load time could try that PR out to see if it helps. |
@cheribral How many shards do you have? |
@jwilder if somebody can provide build with that fix, then I'll bet it won't be a problem. Quote from couple comments above:
|
@jwilder We have 410 shards (counted from the output of 'show shards') across 10 databases. |
@jwilder I set up a similar box to our production one, and tried the branch on a copy of the real database. The startup time was 7 minutes, and the limiting factor seems to have been disk IO this time, which is much better than it was, so the change seems to have worked. Since the disk doesn't really ever break a sweat once the server is up and we are running normally, it would seem a bit silly to spend more on disks simply to cut down the startup time. Do all the files absolutely have to be loaded before accepting writes? Is there any chance the loading could be made lazier so that the listening socket can be available sooner? |
Just as a test, I swapped in a much faster disk, and while it didn't saturate CPU or the disk by any means, it started in under 2 minutes. |
@cheribral That's good news. How long was it taking before? There is definitely more room for improvement in the load time. Lazy loading shards might be possible. The shard has to be loaded before it can take a write or serve queries though. |
+1 for lazy load. That would solve my case completely. In my case shard takes writes for short time, then just stored for possible reads. The reader is ok with waiting to load "cold" shards. Can we have an option in config file to enable/disable it? Just to not harm those who prefer to load everything upfront... |
@jwilder Before it was taking over 30 minutes. |
Fixed via #5372 |
I have POC with InfluxDB with 14GB database. When I do restart service or restart the machine, it takes 3 hours to start up and start serving. This is unacceptable, since my production can't afford 3 hours of downtime. Max downtime I can accept is ~1 minute.
Please suggest what can be changed in configuration or something else to make it up instantly. I understand that it might execute requests a bit slower at the beginning, while it would have "cold" memory. But it makes much more sense to start serving at least with higher response times than not serve at all for 3 hours.
My server is on 8CPU 32GB RAM SSD machine (my local testing env, the question of moving onto AWS-hosted env depends on my POC success). It is Ubuntu 14.04,
InfluxDB version 0.10.0, branch 0.10.0, commit b8bb32e, built 2016-02-04T17:06:04.850564
The text was updated successfully, but these errors were encountered: