Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize indexing (translog.durability=async ) by using LinkedBlockingQueue. Closes #45371 #45450

Closed
wants to merge 0 commits into from

Conversation

dengweisysu
Copy link
Contributor

using LinkedBlockingQueue to decoupling Translog-Writing from Translog-Flushing.
this request contains two commits:

  1. Extract Interface ITranslog from Translog
  2. add AsyncTranslog to replace Translog when in async translog durability mode

@elasticcla
Copy link

Hi @dengweisysu, we have found your signature in our records, but it seems like you have signed with a different e-mail than the one used in your Git commit. Can you please add both of these e-mails into your Github profile (they can be hidden), so we can match your e-mails to your Github profile?

@dengweisysu
Copy link
Contributor Author

Hi @dengweisysu, we have found your signature in our records, but it seems like you have signed with a different e-mail than the one used in your Git commit. Can you please add both of these e-mails into your Github profile (they can be hidden), so we can match your e-mails to your Github profile?

i has add the e-mails into my github , is ok now ?

@dakrone dakrone added the :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. label Aug 12, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on the proposal. I don't think we should change translog this way (see my comments). However, we can work together to improve the performance of the translog. Can you share with us more information on your benchmark? How does translog take 6 minutes of out 18 minutes in your benchmark? Thank you!

//produce using ring buffer
dispatcher.produce(new OperationEvent(operation, this.translog, this.unConsumeCounter));
//async return no location
return null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks wait_for_refresh and other refresh features.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks wait_for_refresh and other refresh features.

As I know, wait_for_refresh is not suitable for the scene that require high indexing performance, because it will result in writing thread waiting and smaller segement (which will in trun increase time to query doc version when indexing).

So. Is it really necessary to force return the low level info( offset of Translog file) in async durability?

@@ -69,6 +70,9 @@
public static final Setting<Translog.Durability> INDEX_TRANSLOG_DURABILITY_SETTING =
new Setting<>("index.translog.durability", Translog.Durability.REQUEST.name(),
(value) -> Translog.Durability.valueOf(value.toUpperCase(Locale.ROOT)), Property.Dynamic, Property.IndexScope);
public static final Setting<Integer> TRANSLOG_ASYNC_EVENT_BUFFER_SIZE =
Setting.intSetting("indices.translog.async_event_buffer_size", 256 * 1026, Property.NodeScope);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would consume a significant amount of memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be set a smaller size.
But using memory to buffer translog when writing is unavailable is the key point of this proposal, it's a trade off between index speed and memory.

IndexSettings indexSettings = engineConfig.getIndexSettings();
if (translogConfig.isDurabilityAsync()) {
//using ayncTranslog to improve write performance
ExecutorService executor = engineConfig.getThreadPool().executor(Names.ASYNC_WRITE_TRANSLOG);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the translog durability setting would no longer work.

@dengweisysu
Copy link
Contributor Author

Thank you for working on the proposal. I don't think we should change translog this way (see my comments). However, we can work together to improve the performance of the translog. Can you share with us more information on your benchmark? How does translog take 6 minutes of out 18 minutes in your benchmark? Thank you!

As I mention in issue #45371, Writing is easily be blocking to waiting the lock just like this:
20190814141129

In my benchmark, One translog(for one Index operation ) takes about 3k, and the TranslogConfig.DEFAULT_BUFFER_SIZE is 8k,that means that every about 3 times operation will result in flush of BufferedChannelOutputStream (see BufferedOutputStream#write) and then will block other threads to add translog.

@ywelsch
Copy link
Contributor

ywelsch commented Aug 14, 2019

Is there only contention on a rollover?
What if you set index.translog.flush_threshold_size: 50G, index.translog.sync_interval: 1h, and index.translog.generation_threshold_size: 50G (i.e. no rollover)?

@dengweisysu
Copy link
Contributor Author

Is there only contention on a rollover?
What if you set index.translog.flush_threshold_size: 50G, index.translog.sync_interval: 1h, and index.translog.generation_threshold_size: 50G (i.e. no rollover)?

this config is really effective ! indexing performance improve significantly!
translog open with async durability model : 18 minutes
translog close (change source code) : 12 minutes
translog writing async using disruptor(change source code) : 12 minutes
translog writing async using LinkedBlockQueue(change source code) : about 13 minutes
translog writing async using using config above : about 13 minutes

but is there any negative impact using this config?

@ywelsch
Copy link
Contributor

ywelsch commented Aug 14, 2019

I wouldn't recommend running with this config (as you will lose 50G worth of data in case of a crash), but the above experiment allows us to look at much simpler solutions (e.g. parallelizing the writing to a new generation while fsyncing a previous generation).

@dengweisysu
Copy link
Contributor Author

I print the elapsed time when rollGeneration, and found that:

rollGeneration :StopWatch '': running time = 760.9ms

ms % Task name

00644 085% closeIntoReader
00062 008% copyCheckpointTo
00054 007% createWriter

this test is done with default config:
index.translog.generation_threshold_size=64mb

that means that rollGeneration happen every 2 second, and block all other write thread .
this test is done on v6.6.2

@dnhatn
Copy link
Member

dnhatn commented Aug 15, 2019

Thanks @dengweisysu.

@s1monw
Copy link
Contributor

s1monw commented Aug 15, 2019

@dengweisysu this is quite interesting. I do wonder what happens if you run your benchmarks with this change:

diff --git a/server/src/main/java/org/elasticsearch/index/translog/Translog.java b/server/src/main/java/org/elasticsearch/index/translog/Translog.java
index b98f0f2b643..8335503001f 100644
--- a/server/src/main/java/org/elasticsearch/index/translog/Translog.java
+++ b/server/src/main/java/org/elasticsearch/index/translog/Translog.java
@@ -1641,6 +1641,7 @@ public class Translog extends AbstractIndexShardComponent implements IndexShardC
      * @throws IOException if an I/O exception occurred during any file operations
      */
     public void rollGeneration() throws IOException {
+        current.sync(); // make sure we move most of the data to disk outside of the lock
         try (Releasable ignored = writeLock.acquire()) {
             try {
                 final TranslogReader reader = current.closeIntoReader();

This will make sure that we sync most of the data outside of the lock. if this saves 85% of the time that should reduce the time the lock is held significantly.
Would be great to see the impact.

@elasticmachine
Copy link
Collaborator

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

7 similar comments
@elasticmachine
Copy link
Collaborator

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

@elasticmachine
Copy link
Collaborator

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

@elasticmachine
Copy link
Collaborator

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

@elasticmachine
Copy link
Collaborator

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

@elasticmachine
Copy link
Collaborator

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

@elasticmachine
Copy link
Collaborator

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

@elasticmachine
Copy link
Collaborator

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

@dengweisysu
Copy link
Contributor Author

@dengweisysu this is quite interesting. I do wonder what happens if you run your benchmarks with this change:

diff --git a/server/src/main/java/org/elasticsearch/index/translog/Translog.java b/server/src/main/java/org/elasticsearch/index/translog/Translog.java
index b98f0f2b643..8335503001f 100644
--- a/server/src/main/java/org/elasticsearch/index/translog/Translog.java
+++ b/server/src/main/java/org/elasticsearch/index/translog/Translog.java
@@ -1641,6 +1641,7 @@ public class Translog extends AbstractIndexShardComponent implements IndexShardC
      * @throws IOException if an I/O exception occurred during any file operations
      */
     public void rollGeneration() throws IOException {
+        current.sync(); // make sure we move most of the data to disk outside of the lock
         try (Releasable ignored = writeLock.acquire()) {
             try {
                 final TranslogReader reader = current.closeIntoReader();

This will make sure that we sync most of the data outside of the lock. if this saves 85% of the time that should reduce the time the lock is held significantly.
Would be great to see the impact.

i would do the benchmark with this change today , But I don't think this will have effect, because the current.sync() will hold the lock of TranslogWriter, and then will block all other write thread too.

@dengweisysu
Copy link
Contributor Author

@dengweisysu this is quite interesting. I do wonder what happens if you run your benchmarks with this change:

diff --git a/server/src/main/java/org/elasticsearch/index/translog/Translog.java b/server/src/main/java/org/elasticsearch/index/translog/Translog.java
index b98f0f2b643..8335503001f 100644
--- a/server/src/main/java/org/elasticsearch/index/translog/Translog.java
+++ b/server/src/main/java/org/elasticsearch/index/translog/Translog.java
@@ -1641,6 +1641,7 @@ public class Translog extends AbstractIndexShardComponent implements IndexShardC
      * @throws IOException if an I/O exception occurred during any file operations
      */
     public void rollGeneration() throws IOException {
+        current.sync(); // make sure we move most of the data to disk outside of the lock
         try (Releasable ignored = writeLock.acquire()) {
             try {
                 final TranslogReader reader = current.closeIntoReader();

This will make sure that we sync most of the data outside of the lock. if this saves 85% of the time that should reduce the time the lock is held significantly.
Would be great to see the impact.

this change effect!

  • translog open with async durability model : 18 minutes
  • translog close (change source code) : 12 minutes
  • translog writing async using disruptor(change source code) : 12 minutes
  • translog writing async using LinkedBlockQueue(change source code) : about 13 minutes
  • translog using bigGeneration : about 13 minutes
  • translog sync before rollGeneration : about 14 minutes 40 seconds, much better than 18 minutes with default config

I print more log in sync, and I found that most of time is wasted by channel.force:
running time = 592ms
ms % Task name
00000 000% get TranslogWriter lock
00000 000% outputStream.flush
00578 098% channel.force
00013 002% writeCheckpoint in sync

sync() method only hold TranslogWriter when flush, and release before channel.force
20190816202329

sync() before rollGeneration, reduce the writeLock time holding when rollGeneration

@dengweisysu
Copy link
Contributor Author

@s1monw use sync() maybe better than current.sync() in rollGeneration

@dengweisysu
Copy link
Contributor Author

For rolling generation, the order now is :

  1. close previous one
  2. open new one.

If this order can be change to:

  1. open new one.
  2. replace current reference to new one
  3. close previous one

close older generation will not block other writing thread.
But it is impossible for the current design of the checkpoint file, because checkpoint file should be rename to target generation before create new one.

@dengweisysu
Copy link
Contributor Author

Why the generation_threshold_size config not open to change on guide document ("https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-translog.html").
The more writing thread concurrently writing the same shard, the more seriously performacne was impact, because rolling generation will become more quickly.

@s1monw
Copy link
Contributor

s1monw commented Aug 20, 2019

For rolling generation, the order now is :

close previous one
open new one.
If this order can be change to:

open new one.
replace current reference to new one
close previous one
close older generation will not block other writing thread.
But it is impossible for the current design of the checkpoint file, because checkpoint file should be rename to target generation before create new one.

I am afraid this would violate many of our assumptions we make in the transaction log. The problem here ist mostly that certain files must not exist if we fail during a sync or rolling a generation. In oder to change that I think we need a fundamental change how the translog writes it's files which I'd personally like to prevent. I think using the sync method outside of the lock would give us a good speedup and is low risk. Would you mind opening a PR for this change?

@s1monw
Copy link
Contributor

s1monw commented Aug 20, 2019

In your case moving to something like index.translog.generation_threshold_size: 512M would likely help too. Also given that we don't use the translog for recovery anymore in the future most of the downsides would go away here. @dnhatn @ywelsch we should consider changing the defaults here for the generation size? 64M might be quite small? But on the other hand we don't want to keep too much data.

@dnhatn
Copy link
Member

dnhatn commented Aug 20, 2019

If we increase index.translog.generation_threshold_size, then we can keep more translog if a Lucene commit contains gaps in sequence numbers; then recovering from the store would take a bit longer. I don't think these are big issues. I am wondering if we should increase the default threshold from 64MB to 256MB in 7.x and remove it entirely in 8.x? /cc @jasontedor.

@s1monw
Copy link
Contributor

s1monw commented Aug 21, 2019

I am wondering if we should increase the default threshold from 64MB to 256MB in 7.x and remove it entirely in 8.x?

I am not sure I am following @dnhatn are you suggesting we make the size fixed and remove the setting?

@dengweisysu
Copy link
Contributor Author

I am wondering if we should increase the default threshold from 64MB to 256MB in 7.x and remove it entirely in 8.x?

I am not sure I am following @dnhatn are you suggesting we make the size fixed and remove the setting?

@s1monw
Actually, with default config, Size of translog file generated is usually much bigger than 64M (in my benchmark is about 200M~500M)

Because flush operation (trigger by "flush_threshold_size") and rollGeneration use the same lock flushOrRollRunning (in IndexShard).
with default flush_threshold_size config:
[flush frequently] ---> [too many segments] ---> [slow down resolveDocVersion]---> [slow down writing]
to solve this problem, I tune up flush_threshold_size, and result in :
[flush infrequently]--->[roll generation frequently] ---> [slow down writing]

so in my benchmark:
with default config: 130 translog files, took about 70s
with larger flush_threshold_size config: 424 translog files, took about 242s

So maybe is unnecessary to simply change the default threshold from 64MB to 256MB.

@dnhatn
Copy link
Member

dnhatn commented Aug 22, 2019

I am not sure I am following @dnhatn are you suggesting we make the size fixed and remove the setting?

@s1monw Sorry I wasn't clear. I meant to increase the default setting from 64MB to 256MB in 7.x. If there's no significant issue with the new default setting, we can remove the setting and the rolling logic entirely in 8.x.

@dengweisysu Can you please run a benchmark with the default flush_threshold_size and generation_threshold_size= 256MB or 1024MB (to disable the auto rolling)? Thank you.

@dengweisysu
Copy link
Contributor Author

@dengweisysu Can you please run a benchmark with the default flush_threshold_size and generation_threshold_size= 256MB or 1024MB (to disable the auto rolling)? Thank you.

What does "1024MB (to disable the auto rolling)" means? To disable the auto rolling should be set larger than 35G.

@dengweisysu
Copy link
Contributor Author

Because flush operation will usually take about 5-10 seconds (maybe commit will trigger lucene merge). So in default config, rolling operation will be triggered much less than expected which resulting in generated file being larger than config.

@dengweisysu
Copy link
Contributor Author

@dnhatn I do the benchmark with the code " do sync before roll generation":

    1. default flush_threshold_size(512M),generation_threshold_size= 256MB: 16 min (translog files generated: 77, total time took in closeIntoReader : 16 seconds), and with generation_threshold_size= 64MB, count of translog files generated is 114.
    1. default flush_threshold_size(512M),generation_threshold_size= 40GB: 13 min(translog files generated: 7 times, total time took in closeIntoReader: about 2 seconds). This result is so strange that I can't explain the reason. Because change generation_threshold_size should only save time in "closeIntoReader", so the second benchmark should took: 16min - (16sec-2sec)=15min 46sec. 13 min is really strange, I will check it again tomorrow.

@dengweisysu
Copy link
Contributor Author

@dnhatn @s1monw Why should we do current.sync() when doing trimUnreferencedReaders. This is done in writeLock too!
企业微信截图_20190825162502

@dnhatn
Copy link
Member

dnhatn commented Aug 25, 2019

@dengweisysu The comment above that line should explain the reason. When we trim unreferenced readers, we advance the minimum retaining translog generation, hence we need to write that new value to the current checkpoint.

We don't call trimUnreferencedReaders too often. We call it after flushing and periodically (defaults to every 10 minutes). If this method blocks the write threads, we can do the same trick as rollGeneration by calling sync() outside writeLock.

@dengweisysu
Copy link
Contributor Author

We don't call trimUnreferencedReaders too often. We call it after flushing and periodically (defaults to every 10 minutes). If this method blocks the write threads, we can do the same trick as rollGeneration by calling sync() outside writeLock.

@dnhatn trimUnreferencedReaders is call when InternalEngine.rollTranslogGeneration is call. So it's as often as rollTranslogGeneration. I think we can move sync() outside the writelock as below
企业微信截图_20190825162502

@dnhatn
Copy link
Member

dnhatn commented Aug 26, 2019

Thanks @dengweisysu. Your change looks good to me. Some comments before opening a PR:

  • We still need to make sure to closeOnTragicEvent if deleteReaderFiles throws an exception.
  • Can you run a benchmark to see if this change improves the translog performance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. team-discuss
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants