-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize indexing (translog.durability=async ) by using LinkedBlockingQueue. Closes #45371 #45450
Conversation
Hi @dengweisysu, we have found your signature in our records, but it seems like you have signed with a different e-mail than the one used in your Git commit. Can you please add both of these e-mails into your Github profile (they can be hidden), so we can match your e-mails to your Github profile? |
i has add the e-mails into my github , is ok now ? |
Pinging @elastic/es-distributed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on the proposal. I don't think we should change translog this way (see my comments). However, we can work together to improve the performance of the translog. Can you share with us more information on your benchmark? How does translog take 6 minutes of out 18 minutes in your benchmark? Thank you!
//produce using ring buffer | ||
dispatcher.produce(new OperationEvent(operation, this.translog, this.unConsumeCounter)); | ||
//async return no location | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This breaks wait_for_refresh and other refresh features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This breaks wait_for_refresh and other refresh features.
As I know, wait_for_refresh is not suitable for the scene that require high indexing performance, because it will result in writing thread waiting and smaller segement (which will in trun increase time to query doc version when indexing).
So. Is it really necessary to force return the low level info( offset of Translog file) in async durability?
@@ -69,6 +70,9 @@ | |||
public static final Setting<Translog.Durability> INDEX_TRANSLOG_DURABILITY_SETTING = | |||
new Setting<>("index.translog.durability", Translog.Durability.REQUEST.name(), | |||
(value) -> Translog.Durability.valueOf(value.toUpperCase(Locale.ROOT)), Property.Dynamic, Property.IndexScope); | |||
public static final Setting<Integer> TRANSLOG_ASYNC_EVENT_BUFFER_SIZE = | |||
Setting.intSetting("indices.translog.async_event_buffer_size", 256 * 1026, Property.NodeScope); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would consume a significant amount of memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be set a smaller size.
But using memory to buffer translog when writing is unavailable is the key point of this proposal, it's a trade off between index speed and memory.
IndexSettings indexSettings = engineConfig.getIndexSettings(); | ||
if (translogConfig.isDurabilityAsync()) { | ||
//using ayncTranslog to improve write performance | ||
ExecutorService executor = engineConfig.getThreadPool().executor(Names.ASYNC_WRITE_TRANSLOG); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing the translog durability setting would no longer work.
As I mention in issue #45371, Writing is easily be blocking to waiting the lock just like this: In my benchmark, One translog(for one Index operation ) takes about 3k, and the TranslogConfig.DEFAULT_BUFFER_SIZE is 8k,that means that every about 3 times operation will result in flush of BufferedChannelOutputStream (see BufferedOutputStream#write) and then will block other threads to add translog. |
Is there only contention on a rollover? |
this config is really effective ! indexing performance improve significantly! but is there any negative impact using this config? |
I wouldn't recommend running with this config (as you will lose 50G worth of data in case of a crash), but the above experiment allows us to look at much simpler solutions (e.g. parallelizing the writing to a new generation while fsyncing a previous generation). |
I print the elapsed time when rollGeneration, and found that: rollGeneration :StopWatch '': running time = 760.9msms % Task name00644 085% closeIntoReader this test is done with default config: that means that rollGeneration happen every 2 second, and block all other write thread . |
Thanks @dengweisysu. |
@dengweisysu this is quite interesting. I do wonder what happens if you run your benchmarks with this change: diff --git a/server/src/main/java/org/elasticsearch/index/translog/Translog.java b/server/src/main/java/org/elasticsearch/index/translog/Translog.java
index b98f0f2b643..8335503001f 100644
--- a/server/src/main/java/org/elasticsearch/index/translog/Translog.java
+++ b/server/src/main/java/org/elasticsearch/index/translog/Translog.java
@@ -1641,6 +1641,7 @@ public class Translog extends AbstractIndexShardComponent implements IndexShardC
* @throws IOException if an I/O exception occurred during any file operations
*/
public void rollGeneration() throws IOException {
+ current.sync(); // make sure we move most of the data to disk outside of the lock
try (Releasable ignored = writeLock.acquire()) {
try {
final TranslogReader reader = current.closeIntoReader(); This will make sure that we sync most of the data outside of the lock. if this saves 85% of the time that should reduce the time the lock is held significantly. |
Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually? |
7 similar comments
Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually? |
Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually? |
Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually? |
Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually? |
Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually? |
Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually? |
Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually? |
i would do the benchmark with this change today , But I don't think this will have effect, because the current.sync() will hold the lock of TranslogWriter, and then will block all other write thread too. |
this change effect!
I print more log in sync, and I found that most of time is wasted by channel.force: sync() method only hold TranslogWriter when flush, and release before channel.force sync() before rollGeneration, reduce the writeLock time holding when rollGeneration |
@s1monw use sync() maybe better than current.sync() in rollGeneration |
For rolling generation, the order now is :
If this order can be change to:
close older generation will not block other writing thread. |
Why the generation_threshold_size config not open to change on guide document ("https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-translog.html"). |
I am afraid this would violate many of our assumptions we make in the transaction log. The problem here ist mostly that certain files must not exist if we fail during a sync or rolling a generation. In oder to change that I think we need a fundamental change how the translog writes it's files which I'd personally like to prevent. I think using the sync method outside of the lock would give us a good speedup and is low risk. Would you mind opening a PR for this change? |
In your case moving to something like |
If we increase |
I am not sure I am following @dnhatn are you suggesting we make the size fixed and remove the setting? |
@s1monw Because flush operation (trigger by "flush_threshold_size") and rollGeneration use the same lock flushOrRollRunning (in IndexShard). so in my benchmark: So maybe is unnecessary to simply change the default threshold from 64MB to 256MB. |
@s1monw Sorry I wasn't clear. I meant to increase the default setting from 64MB to 256MB in 7.x. If there's no significant issue with the new default setting, we can remove the setting and the rolling logic entirely in 8.x. @dengweisysu Can you please run a benchmark with the default |
What does "1024MB (to disable the auto rolling)" means? To disable the auto rolling should be set larger than 35G. |
Because flush operation will usually take about 5-10 seconds (maybe commit will trigger lucene merge). So in default config, rolling operation will be triggered much less than expected which resulting in generated file being larger than config. |
@dnhatn I do the benchmark with the code " do sync before roll generation":
|
@dengweisysu The comment above that line should explain the reason. When we trim unreferenced readers, we advance the minimum retaining translog generation, hence we need to write that new value to the current checkpoint. We don't call |
@dnhatn trimUnreferencedReaders is call when InternalEngine.rollTranslogGeneration is call. So it's as often as rollTranslogGeneration. I think we can move sync() outside the writelock as below |
Thanks @dengweisysu. Your change looks good to me. Some comments before opening a PR:
|
using LinkedBlockingQueue to decoupling Translog-Writing from Translog-Flushing.
this request contains two commits: