Optimize indexing (translog.durability=async ) by using LinkedBlockingQueue. Closes #45371 #45450

dengweisysu · 2019-08-12T12:44:28Z

using LinkedBlockingQueue to decoupling Translog-Writing from Translog-Flushing.
this request contains two commits:

Extract Interface ITranslog from Translog
add AsyncTranslog to replace Translog when in async translog durability mode

elasticcla · 2019-08-12T12:44:36Z

Hi @dengweisysu, we have found your signature in our records, but it seems like you have signed with a different e-mail than the one used in your Git commit. Can you please add both of these e-mails into your Github profile (they can be hidden), so we can match your e-mails to your Github profile?

dengweisysu · 2019-08-12T12:56:05Z

Hi @dengweisysu, we have found your signature in our records, but it seems like you have signed with a different e-mail than the one used in your Git commit. Can you please add both of these e-mails into your Github profile (they can be hidden), so we can match your e-mails to your Github profile?

i has add the e-mails into my github , is ok now ?

elasticmachine · 2019-08-12T14:59:00Z

Pinging @elastic/es-distributed

dnhatn

Thank you for working on the proposal. I don't think we should change translog this way (see my comments). However, we can work together to improve the performance of the translog. Can you share with us more information on your benchmark? How does translog take 6 minutes of out 18 minutes in your benchmark? Thank you!

dnhatn · 2019-08-14T01:38:48Z

server/src/main/java/org/elasticsearch/index/translog/AsyncTranslog.java

+        //produce using ring buffer
+        dispatcher.produce(new OperationEvent(operation, this.translog, this.unConsumeCounter));
+        //async return no location
+        return null;


This breaks wait_for_refresh and other refresh features.

This breaks wait_for_refresh and other refresh features.

As I know, wait_for_refresh is not suitable for the scene that require high indexing performance, because it will result in writing thread waiting and smaller segement (which will in trun increase time to query doc version when indexing).

So. Is it really necessary to force return the low level info( offset of Translog file) in async durability?

dnhatn · 2019-08-14T01:39:40Z

server/src/main/java/org/elasticsearch/index/IndexSettings.java

@@ -69,6 +70,9 @@
    public static final Setting<Translog.Durability> INDEX_TRANSLOG_DURABILITY_SETTING =
        new Setting<>("index.translog.durability", Translog.Durability.REQUEST.name(),
            (value) -> Translog.Durability.valueOf(value.toUpperCase(Locale.ROOT)), Property.Dynamic, Property.IndexScope);
+    public static final Setting<Integer> TRANSLOG_ASYNC_EVENT_BUFFER_SIZE =
+        Setting.intSetting("indices.translog.async_event_buffer_size", 256 * 1026, Property.NodeScope);


This would consume a significant amount of memory.

this can be set a smaller size.
But using memory to buffer translog when writing is unavailable is the key point of this proposal, it's a trade off between index speed and memory.

dnhatn · 2019-08-14T01:47:47Z

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

+        IndexSettings indexSettings = engineConfig.getIndexSettings();
+        if (translogConfig.isDurabilityAsync()) {
+            //using ayncTranslog to improve write performance
+            ExecutorService executor = engineConfig.getThreadPool().executor(Names.ASYNC_WRITE_TRANSLOG);


Changing the translog durability setting would no longer work.

dengweisysu · 2019-08-14T06:22:33Z

Thank you for working on the proposal. I don't think we should change translog this way (see my comments). However, we can work together to improve the performance of the translog. Can you share with us more information on your benchmark? How does translog take 6 minutes of out 18 minutes in your benchmark? Thank you!

As I mention in issue #45371, Writing is easily be blocking to waiting the lock just like this:

In my benchmark, One translog(for one Index operation ) takes about 3k, and the TranslogConfig.DEFAULT_BUFFER_SIZE is 8k，that means that every about 3 times operation will result in flush of BufferedChannelOutputStream (see BufferedOutputStream#write) and then will block other threads to add translog.

ywelsch · 2019-08-14T08:47:47Z

Is there only contention on a rollover?
What if you set index.translog.flush_threshold_size: 50G, index.translog.sync_interval: 1h, and index.translog.generation_threshold_size: 50G (i.e. no rollover)?

dengweisysu · 2019-08-14T12:23:54Z

Is there only contention on a rollover?
What if you set index.translog.flush_threshold_size: 50G, index.translog.sync_interval: 1h, and index.translog.generation_threshold_size: 50G (i.e. no rollover)?

this config is really effective ! indexing performance improve significantly!
translog open with async durability model : 18 minutes
translog close (change source code) : 12 minutes
translog writing async using disruptor(change source code) : 12 minutes
translog writing async using LinkedBlockQueue(change source code) : about 13 minutes
translog writing async using using config above : about 13 minutes

but is there any negative impact using this config?

ywelsch · 2019-08-14T12:54:57Z

I wouldn't recommend running with this config (as you will lose 50G worth of data in case of a crash), but the above experiment allows us to look at much simpler solutions (e.g. parallelizing the writing to a new generation while fsyncing a previous generation).

dengweisysu · 2019-08-15T10:08:55Z

I print the elapsed time when rollGeneration, and found that:

rollGeneration :StopWatch '': running time = 760.9ms

ms % Task name

00644 085% closeIntoReader
00062 008% copyCheckpointTo
00054 007% createWriter

this test is done with default config:
index.translog.generation_threshold_size=64mb

that means that rollGeneration happen every 2 second, and block all other write thread .
this test is done on v6.6.2

dnhatn · 2019-08-15T12:03:39Z

Thanks @dengweisysu.

s1monw · 2019-08-15T13:13:41Z

@dengweisysu this is quite interesting. I do wonder what happens if you run your benchmarks with this change:

diff --git a/server/src/main/java/org/elasticsearch/index/translog/Translog.java b/server/src/main/java/org/elasticsearch/index/translog/Translog.java
index b98f0f2b643..8335503001f 100644
--- a/server/src/main/java/org/elasticsearch/index/translog/Translog.java
+++ b/server/src/main/java/org/elasticsearch/index/translog/Translog.java
@@ -1641,6 +1641,7 @@ public class Translog extends AbstractIndexShardComponent implements IndexShardC
      * @throws IOException if an I/O exception occurred during any file operations
      */
     public void rollGeneration() throws IOException {
+        current.sync(); // make sure we move most of the data to disk outside of the lock
         try (Releasable ignored = writeLock.acquire()) {
             try {
                 final TranslogReader reader = current.closeIntoReader();

This will make sure that we sync most of the data outside of the lock. if this saves 85% of the time that should reduce the time the lock is held significantly.
Would be great to see the impact.

elasticmachine · 2019-08-15T13:13:43Z