Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HBASE-27621 Also clear the Dictionary when resetting when reading compressed WAL file #5016

Merged
merged 1 commit into from
Feb 11, 2023

Conversation

Apache9
Copy link
Contributor

@Apache9 Apache9 commented Feb 8, 2023

No description provided.

@Apache9 Apache9 self-assigned this Feb 8, 2023
@thangTang
Copy link
Contributor

This seems like an ingenious idea. But I want to confirm that due to the eviction mechanism of LRUMap, even if findEntry is used instead of addEntry, is there still a possibility of inconsistent read-write path behavior in theory?

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 1m 21s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
_ master Compile Tests _
+0 🆗 mvndep 0m 14s Maven dependency ordering for branch
+1 💚 mvninstall 5m 12s master passed
+1 💚 compile 3m 33s master passed
+1 💚 checkstyle 0m 55s master passed
+1 💚 spotless 0m 46s branch has no errors when running spotless:check.
+1 💚 spotbugs 2m 54s master passed
_ Patch Compile Tests _
+0 🆗 mvndep 0m 10s Maven dependency ordering for patch
+1 💚 mvninstall 4m 34s the patch passed
+1 💚 compile 3m 19s the patch passed
+1 💚 javac 3m 19s the patch passed
+1 💚 checkstyle 0m 15s The patch passed checkstyle in hbase-common
+1 💚 checkstyle 0m 36s hbase-server: The patch generated 0 new + 5 unchanged - 2 fixed = 5 total (was 7)
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 hadoopcheck 20m 14s Patch does not cause any errors with Hadoop 3.2.4 3.3.4.
+1 💚 spotless 0m 57s patch has no errors when running spotless:check.
+1 💚 spotbugs 3m 10s the patch passed
_ Other Tests _
+1 💚 asflicense 0m 20s The patch does not generate ASF License warnings.
59m 6s
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/1/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #5016
Optional Tests dupname asflicense javac spotbugs hadoopcheck hbaseanti spotless checkstyle compile
uname Linux 0655804e6aa8 5.4.0-137-generic #154-Ubuntu SMP Thu Jan 5 17:03:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / 1a9e465
Default Java Eclipse Adoptium-11.0.17+8
Max. process+thread count 84 (vs. ulimit of 30000)
modules C: hbase-common hbase-server U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/1/console
versions git=2.34.1 maven=3.8.6 spotbugs=4.7.3
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

if (status == Dictionary.NOT_IN_DICTIONARY) {
int tagLen = StreamUtils.readRawVarint32(src);
offset = Bytes.putAsShort(dest, offset, tagLen);
IOUtils.readFully(src, dest, offset, tagLen);
tagDict.addEntry(dest, offset, tagLen);
tagDict.findEntry(dest, offset, tagLen);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this change could be expensive? In the normal case, the entry will not exist in the dict. But now we're adding an extra map lookup for every call. Granted o(1), but involves cpu for hashcode, allocating lookup key, etc.

I wonder if we could trigger findEntry only if context has been reset? Otherwise use addEntry for first pass?

May not be a big issue, just checking

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is slower than before but I always think correctness comes first, and then we consider the performance. For log splitting and replication, reading is usually not the bottleneck.

Can file an follow on issue to do the optimization, maybe we could add a reset flag in CompressionContext too, to indicate that whether we need to do a lookup first.

Thanks.

Copy link
Contributor

@bbeaudreault bbeaudreault Feb 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, agree on correctness first.

Also agree on bottleneck for splitting/replications. However, this uncompressTags method is in the hot path of normal reads when DataBlockEncoding is used: here.

@Apache-HBase
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 3m 5s Docker mode activated.
-0 ⚠️ yetus 0m 3s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+0 🆗 mvndep 0m 16s Maven dependency ordering for branch
+1 💚 mvninstall 4m 7s master passed
+1 💚 compile 1m 23s master passed
+1 💚 shadedjars 5m 25s branch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 47s master passed
_ Patch Compile Tests _
+0 🆗 mvndep 0m 12s Maven dependency ordering for patch
+1 💚 mvninstall 3m 44s the patch passed
+1 💚 compile 0m 55s the patch passed
+1 💚 javac 0m 55s the patch passed
+1 💚 shadedjars 4m 37s patch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 34s the patch passed
_ Other Tests _
+1 💚 unit 2m 1s hbase-common in the patch passed.
-1 ❌ unit 202m 55s hbase-server in the patch failed.
234m 28s
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/1/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
GITHUB PR #5016
Optional Tests javac javadoc unit shadedjars compile
uname Linux 89a079cfa76f 5.4.0-1094-aws #102~18.04.1-Ubuntu SMP Tue Jan 10 21:07:03 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / 1a9e465
Default Java Eclipse Adoptium-11.0.17+8
unit https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/1/artifact/yetus-jdk11-hadoop3-check/output/patch-unit-hbase-server.txt
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/1/testReport/
Max. process+thread count 2668 (vs. ulimit of 30000)
modules C: hbase-common hbase-server U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/1/console
versions git=2.34.1 maven=3.8.6
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 1m 19s Docker mode activated.
-0 ⚠️ yetus 0m 3s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+0 🆗 mvndep 0m 14s Maven dependency ordering for branch
+1 💚 mvninstall 4m 25s master passed
+1 💚 compile 1m 10s master passed
+1 💚 shadedjars 5m 8s branch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 49s master passed
_ Patch Compile Tests _
+0 🆗 mvndep 0m 12s Maven dependency ordering for patch
+1 💚 mvninstall 4m 10s the patch passed
+1 💚 compile 1m 16s the patch passed
+1 💚 javac 1m 16s the patch passed
+1 💚 shadedjars 5m 15s patch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 46s the patch passed
_ Other Tests _
+1 💚 unit 2m 19s hbase-common in the patch passed.
-1 ❌ unit 215m 9s hbase-server in the patch failed.
247m 1s
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/1/artifact/yetus-jdk8-hadoop3-check/output/Dockerfile
GITHUB PR #5016
Optional Tests javac javadoc unit shadedjars compile
uname Linux 099dc80b3993 5.4.0-137-generic #154-Ubuntu SMP Thu Jan 5 17:03:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / 1a9e465
Default Java Temurin-1.8.0_352-b08
unit https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/1/artifact/yetus-jdk8-hadoop3-check/output/patch-unit-hbase-server.txt
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/1/testReport/
Max. process+thread count 2338 (vs. ulimit of 30000)
modules C: hbase-common hbase-server U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/1/console
versions git=2.34.1 maven=3.8.6
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@Apache9
Copy link
Contributor Author

Apache9 commented Feb 8, 2023

This seems like an ingenious idea. But I want to confirm that due to the eviction mechanism of LRUMap, even if findEntry is used instead of addEntry, is there still a possibility of inconsistent read-write path behavior in theory?

The most important thing here is to read WAL entries in order, and not skip any entries. If these two rules are guaranteed, it is OK to restart as many times as you want. And I think for replication, we must follow these two rules otherwise there will be data loss...

@thangTang
Copy link
Contributor

thangTang commented Feb 8, 2023

This seems like an ingenious idea. But I want to confirm that due to the eviction mechanism of LRUMap, even if findEntry is used instead of addEntry, is there still a possibility of inconsistent read-write path behavior in theory?

The most important thing here is to read WAL entries in order, and not skip any entries. If these two rules are guaranteed, it is OK to restart as many times as you want. And I think for replication, we must follow these two rules otherwise there will be data loss...

Agree about that, but I think I didn't express my question clearly.

For WAL Compression, The core logic is to build an index (LRUMap) in memory while writing/reading WAL. There is another key point here, that is, when operating a WAL file, the behavior of both read/write path needs to be exactly same.

Using findEntry instead of addEntry in this patch, I think it could solve a part of problem. But however, for example, we did not resetPosition when we wrote WAL, but a certain position was reset many times when we read WAL. The implicit operation here is: this node has been movedToHead many times in LRUMap. So is it possible that the node evicted in the write path(write WAL) has inconsistencies in the read path(replication)?

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 25s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
_ master Compile Tests _
+0 🆗 mvndep 0m 13s Maven dependency ordering for branch
+1 💚 mvninstall 3m 30s master passed
+1 💚 compile 2m 55s master passed
+1 💚 checkstyle 0m 43s master passed
+1 💚 spotless 0m 40s branch has no errors when running spotless:check.
+1 💚 spotbugs 1m 56s master passed
_ Patch Compile Tests _
+0 🆗 mvndep 0m 9s Maven dependency ordering for patch
+1 💚 mvninstall 3m 13s the patch passed
+1 💚 compile 2m 51s the patch passed
+1 💚 javac 2m 51s the patch passed
+1 💚 checkstyle 0m 12s The patch passed checkstyle in hbase-common
+1 💚 checkstyle 0m 31s hbase-server: The patch generated 0 new + 5 unchanged - 2 fixed = 5 total (was 7)
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 hadoopcheck 12m 43s Patch does not cause any errors with Hadoop 3.2.4 3.3.4.
+1 💚 spotless 0m 38s patch has no errors when running spotless:check.
+1 💚 spotbugs 2m 3s the patch passed
_ Other Tests _
+1 💚 asflicense 0m 16s The patch does not generate ASF License warnings.
40m 45s
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/2/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #5016
Optional Tests dupname asflicense javac spotbugs hadoopcheck hbaseanti spotless checkstyle compile
uname Linux af035ff80124 5.4.0-1094-aws #102~18.04.1-Ubuntu SMP Tue Jan 10 21:07:03 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / 6a34aa8
Default Java Eclipse Adoptium-11.0.17+8
Max. process+thread count 86 (vs. ulimit of 30000)
modules C: hbase-common hbase-server U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/2/console
versions git=2.34.1 maven=3.8.6 spotbugs=4.7.3
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 26s Docker mode activated.
-0 ⚠️ yetus 0m 3s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+0 🆗 mvndep 0m 16s Maven dependency ordering for branch
+1 💚 mvninstall 3m 22s master passed
+1 💚 compile 0m 57s master passed
+1 💚 shadedjars 4m 36s branch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 35s master passed
_ Patch Compile Tests _
+0 🆗 mvndep 0m 11s Maven dependency ordering for patch
+1 💚 mvninstall 3m 16s the patch passed
+1 💚 compile 0m 56s the patch passed
+1 💚 javac 0m 56s the patch passed
+1 💚 shadedjars 4m 34s patch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 35s the patch passed
_ Other Tests _
+1 💚 unit 1m 58s hbase-common in the patch passed.
+1 💚 unit 204m 47s hbase-server in the patch passed.
230m 42s
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/2/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
GITHUB PR #5016
Optional Tests javac javadoc unit shadedjars compile
uname Linux 685adc51353b 5.4.0-1094-aws #102~18.04.1-Ubuntu SMP Tue Jan 10 21:07:03 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / 6a34aa8
Default Java Eclipse Adoptium-11.0.17+8
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/2/testReport/
Max. process+thread count 2707 (vs. ulimit of 30000)
modules C: hbase-common hbase-server U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/2/console
versions git=2.34.1 maven=3.8.6
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 49s Docker mode activated.
-0 ⚠️ yetus 0m 3s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+0 🆗 mvndep 0m 11s Maven dependency ordering for branch
+1 💚 mvninstall 2m 45s master passed
+1 💚 compile 0m 56s master passed
+1 💚 shadedjars 4m 15s branch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 39s master passed
_ Patch Compile Tests _
+0 🆗 mvndep 0m 13s Maven dependency ordering for patch
+1 💚 mvninstall 2m 49s the patch passed
+1 💚 compile 0m 58s the patch passed
+1 💚 javac 0m 58s the patch passed
+1 💚 shadedjars 4m 17s patch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 37s the patch passed
_ Other Tests _
+1 💚 unit 1m 48s hbase-common in the patch passed.
+1 💚 unit 209m 56s hbase-server in the patch passed.
234m 51s
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/2/artifact/yetus-jdk8-hadoop3-check/output/Dockerfile
GITHUB PR #5016
Optional Tests javac javadoc unit shadedjars compile
uname Linux 8b13800be3c3 5.4.0-137-generic #154-Ubuntu SMP Thu Jan 5 17:03:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / 6a34aa8
Default Java Temurin-1.8.0_352-b08
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/2/testReport/
Max. process+thread count 2387 (vs. ulimit of 30000)
modules C: hbase-common hbase-server U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/2/console
versions git=2.34.1 maven=3.8.6
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 1m 12s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
_ master Compile Tests _
+0 🆗 mvndep 0m 14s Maven dependency ordering for branch
+1 💚 mvninstall 5m 19s master passed
+1 💚 compile 3m 41s master passed
+1 💚 checkstyle 0m 58s master passed
+1 💚 spotless 0m 46s branch has no errors when running spotless:check.
+1 💚 spotbugs 2m 55s master passed
_ Patch Compile Tests _
+0 🆗 mvndep 0m 9s Maven dependency ordering for patch
+1 💚 mvninstall 4m 41s the patch passed
+1 💚 compile 3m 17s the patch passed
+1 💚 javac 3m 17s the patch passed
+1 💚 checkstyle 0m 13s The patch passed checkstyle in hbase-common
+1 💚 checkstyle 0m 43s hbase-server: The patch generated 0 new + 5 unchanged - 2 fixed = 5 total (was 7)
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 hadoopcheck 19m 58s Patch does not cause any errors with Hadoop 3.2.4 3.3.4.
+1 💚 spotless 0m 55s patch has no errors when running spotless:check.
+1 💚 spotbugs 3m 11s the patch passed
_ Other Tests _
+1 💚 asflicense 0m 21s The patch does not generate ASF License warnings.
58m 38s
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/3/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #5016
Optional Tests dupname asflicense javac spotbugs hadoopcheck hbaseanti spotless checkstyle compile
uname Linux 3f1e83ca043a 5.4.0-137-generic #154-Ubuntu SMP Thu Jan 5 17:03:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / 6a34aa8
Default Java Eclipse Adoptium-11.0.17+8
Max. process+thread count 86 (vs. ulimit of 30000)
modules C: hbase-common hbase-server U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/3/console
versions git=2.34.1 maven=3.8.6 spotbugs=4.7.3
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 4m 39s Docker mode activated.
-0 ⚠️ yetus 0m 3s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+0 🆗 mvndep 0m 12s Maven dependency ordering for branch
+1 💚 mvninstall 3m 56s master passed
+1 💚 compile 1m 6s master passed
+1 💚 shadedjars 4m 38s branch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 42s master passed
_ Patch Compile Tests _
+0 🆗 mvndep 0m 10s Maven dependency ordering for patch
+1 💚 mvninstall 3m 33s the patch passed
+1 💚 compile 1m 7s the patch passed
+1 💚 javac 1m 7s the patch passed
+1 💚 shadedjars 4m 33s patch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 40s the patch passed
_ Other Tests _
+1 💚 unit 2m 15s hbase-common in the patch passed.
+1 💚 unit 209m 12s hbase-server in the patch passed.
240m 56s
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/3/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
GITHUB PR #5016
Optional Tests javac javadoc unit shadedjars compile
uname Linux 6d7399d360d4 5.4.0-135-generic #152-Ubuntu SMP Wed Nov 23 20:19:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / 6a34aa8
Default Java Eclipse Adoptium-11.0.17+8
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/3/testReport/
Max. process+thread count 2638 (vs. ulimit of 30000)
modules C: hbase-common hbase-server U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/3/console
versions git=2.34.1 maven=3.8.6
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 1m 14s Docker mode activated.
-0 ⚠️ yetus 0m 3s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+0 🆗 mvndep 0m 13s Maven dependency ordering for branch
+1 💚 mvninstall 4m 37s master passed
+1 💚 compile 1m 7s master passed
+1 💚 shadedjars 5m 13s branch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 49s master passed
_ Patch Compile Tests _
+0 🆗 mvndep 0m 11s Maven dependency ordering for patch
+1 💚 mvninstall 4m 9s the patch passed
+1 💚 compile 1m 14s the patch passed
+1 💚 javac 1m 14s the patch passed
+1 💚 shadedjars 5m 17s patch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 38s the patch passed
_ Other Tests _
+1 💚 unit 2m 17s hbase-common in the patch passed.
+1 💚 unit 214m 48s hbase-server in the patch passed.
246m 17s
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/3/artifact/yetus-jdk8-hadoop3-check/output/Dockerfile
GITHUB PR #5016
Optional Tests javac javadoc unit shadedjars compile
uname Linux 6a2964fb26a9 5.4.0-137-generic #154-Ubuntu SMP Thu Jan 5 17:03:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / 6a34aa8
Default Java Temurin-1.8.0_352-b08
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/3/testReport/
Max. process+thread count 2462 (vs. ulimit of 30000)
modules C: hbase-common hbase-server U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/3/console
versions git=2.34.1 maven=3.8.6
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@Apache9
Copy link
Contributor Author

Apache9 commented Feb 9, 2023

This seems like an ingenious idea. But I want to confirm that due to the eviction mechanism of LRUMap, even if findEntry is used instead of addEntry, is there still a possibility of inconsistent read-write path behavior in theory?

The most important thing here is to read WAL entries in order, and not skip any entries. If these two rules are guaranteed, it is OK to restart as many times as you want. And I think for replication, we must follow these two rules otherwise there will be data loss...

Agree about that, but I think I didn't express my question clearly.

For WAL Compression, The core logic is to build an index (LRUMap) in memory while writing/reading WAL. There is another key point here, that is, when operating a WAL file, the behavior of both read/write path needs to be exactly same.

Using findEntry instead of addEntry in this patch, I think it could solve a part of problem. But however, for example, we did not resetPosition when we wrote WAL, but a certain position was reset many times when we read WAL. The implicit operation here is: this node has been movedToHead many times in LRUMap. So is it possible that the node evicted in the write path(write WAL) has inconsistencies in the read path(replication)?

After deep consideration I think you are right. The solution here can only work perfectly when the dict is infinite, i.e, no eviction. If we also consider eviction, if go back for a long distance, the word of a given index will change due to eviction, then when reading, if we use a index to get the word(a qualifier, a row, for example), we may get a incorrect word on the given index.
In real world, although it is not likely that we will go back for a very long distance, but it is possible that a single WAL entry has a lot of cells, and our qualifier dict capacity is only 127, it is still possible to fall into the above scenario...

So, it seems that rebuilding the dict is necessary when reseting. But anyway, I could try to refactor the readNext method in ProtrobufLogReader, to have more fine-grained control on whether we need to reconstruct the dict. For example, if we just return before reading the actual WAL entry, i.e, we quit earlier after checking available bytes, we do not need to reconstruct the dict.

Thanks for pointing this out!

@thangTang
Copy link
Contributor

So, it seems that rebuilding the dict is necessary when reseting.

Agree. Although this solution will have a performance loss, but it should be the best way I can think of to completely solve this problem.
Another idea is to refactor dict and design an LRUMap that can support precise rollback. I've spent some time in this direction, but found nothing out. At least, it also can't be free (such as memory overhead). . .

@Apache9 Apache9 marked this pull request as draft February 9, 2023 07:31
@Apache9
Copy link
Contributor Author

Apache9 commented Feb 9, 2023

The PR can not solve all the problem so I convert it to draft to avoid others may merge it accidentally.

Thanks all for help reviewing and testing, especially @thangTang for pointing out the problem.

Will change the title and provide a new PR soon.

@sunhelly
Copy link
Contributor

sunhelly commented Feb 9, 2023

For ensure the compress and uncompress construct same dictionary, we should only use LRUDictionary#findEntry() to add entries, but need to keep LRUDictionary#getEntry() do not move ahead the entry?

@Apache9
Copy link
Contributor Author

Apache9 commented Feb 9, 2023

For ensure the compress and uncompress construct same dictionary, we should only use LRUDictionary#findEntry() to add entries, but need to keep LRUDictionary#getEntry() do not move ahead the entry?

This is still not enough... As said above, if we go back for a long distance, the word on a given index could be completely different, and then lead to incorrect result when you find a field is 'in dictionary'...

@Apache9
Copy link
Contributor Author

Apache9 commented Feb 9, 2023

I tried to refactoring a bit but the implementation of ProtobufLogReader is too complicated. I think we'd better abstract two types of WAL.Reader for reading WAL file.
One is StreamingReader, which is used in most cases, for example, WAL splitting, WAL printing, etc, where we only need to read the file once and usually for closed WAL files. There is no need to support reset and seek.
The other is TailingReader, which is used by Replication, where we need to support reset and seek, and also we need to tell the upper layer whether we need to reset the compress context when calling reseting. The logic will be more complicated as we need to consider the requirements for tailing a WAL file which is currently being written.
The refactoring will be a bit big so I do not think we should apply it to branch-2.5 and branch-2.4. So let's apply the simple fix here and file another issue to implement the big refactoring.

Thanks.

@thangTang
Copy link
Contributor

I tried to refactoring a bit but the implementation of ProtobufLogReader is too complicated. I think we'd better abstract two types of WAL.Reader for reading WAL file. One is StreamingReader, which is used in most cases, for example, WAL splitting, WAL printing, etc, where we only need to read the file once and usually for closed WAL files. There is no need to support reset and seek. The other is TailingReader, which is used by Replication, where we need to support reset and seek, and also we need to tell the upper layer whether we need to reset the compress context when calling reseting. The logic will be more complicated as we need to consider the requirements for tailing a WAL file which is currently being written. The refactoring will be a bit big so I do not think we should apply it to branch-2.5 and branch-2.4. So let's apply the simple fix here and file another issue to implement the big refactoring.

Thanks.

I understand that this is a complicated and dirty job, I am ashamed that I didn't solve it thoroughly before...
But by the way, just for this PR, Would you mind taking a look at https://issues.apache.org/jira/browse/HBASE-26850 and #4233?
At that time, I thought that it could not fundamentally solve the problem, so I did not continue to push forward, but these two patches seem a bit similar? The difference is that I changed the implementation of addEntry.

@thangTang
Copy link
Contributor

@apurtell FYI, I think you may also be interested in this patch~

@thangTang
Copy link
Contributor

After all, manual +1 from me: )

@Apache9 Apache9 marked this pull request as ready for review February 9, 2023 15:17
@Apache9 Apache9 changed the title HBASE-27621 Always use findEntry to fill the Dictionary when reading … HBASE-27621 Also clear the Dictionary when resetting when reading compressed WAL file Feb 9, 2023
@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 24s Docker mode activated.
-0 ⚠️ yetus 0m 3s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+0 🆗 mvndep 0m 12s Maven dependency ordering for branch
+1 💚 mvninstall 3m 31s master passed
+1 💚 compile 0m 56s master passed
+1 💚 shadedjars 4m 36s branch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 36s master passed
_ Patch Compile Tests _
+0 🆗 mvndep 0m 11s Maven dependency ordering for patch
+1 💚 mvninstall 3m 15s the patch passed
+1 💚 compile 0m 56s the patch passed
+1 💚 javac 0m 56s the patch passed
+1 💚 shadedjars 4m 35s patch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 33s the patch passed
_ Other Tests _
+1 💚 unit 2m 0s hbase-common in the patch passed.
+1 💚 unit 197m 53s hbase-server in the patch passed.
224m 25s
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/4/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
GITHUB PR #5016
Optional Tests javac javadoc unit shadedjars compile
uname Linux 8f57a4aad181 5.4.0-1093-aws #102~18.04.2-Ubuntu SMP Wed Dec 7 00:31:59 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / a854cba
Default Java Eclipse Adoptium-11.0.17+8
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/4/testReport/
Max. process+thread count 2493 (vs. ulimit of 30000)
modules C: hbase-common hbase-server U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/4/console
versions git=2.34.1 maven=3.8.6
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 46s Docker mode activated.
-0 ⚠️ yetus 0m 4s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+0 🆗 mvndep 0m 12s Maven dependency ordering for branch
+1 💚 mvninstall 2m 45s master passed
+1 💚 compile 0m 55s master passed
+1 💚 shadedjars 4m 16s branch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 37s master passed
_ Patch Compile Tests _
+0 🆗 mvndep 0m 13s Maven dependency ordering for patch
+1 💚 mvninstall 2m 53s the patch passed
+1 💚 compile 0m 56s the patch passed
+1 💚 javac 0m 56s the patch passed
+1 💚 shadedjars 4m 15s patch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 38s the patch passed
_ Other Tests _
+1 💚 unit 1m 45s hbase-common in the patch passed.
+1 💚 unit 211m 19s hbase-server in the patch passed.
235m 55s
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/4/artifact/yetus-jdk8-hadoop3-check/output/Dockerfile
GITHUB PR #5016
Optional Tests javac javadoc unit shadedjars compile
uname Linux 32aa9c6efaca 5.4.0-137-generic #154-Ubuntu SMP Thu Jan 5 17:03:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / a854cba
Default Java Temurin-1.8.0_352-b08
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/4/testReport/
Max. process+thread count 2250 (vs. ulimit of 30000)
modules C: hbase-common hbase-server U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/4/console
versions git=2.34.1 maven=3.8.6
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@Apache9
Copy link
Contributor Author

Apache9 commented Feb 10, 2023

I tried to refactoring a bit but the implementation of ProtobufLogReader is too complicated. I think we'd better abstract two types of WAL.Reader for reading WAL file. One is StreamingReader, which is used in most cases, for example, WAL splitting, WAL printing, etc, where we only need to read the file once and usually for closed WAL files. There is no need to support reset and seek. The other is TailingReader, which is used by Replication, where we need to support reset and seek, and also we need to tell the upper layer whether we need to reset the compress context when calling reseting. The logic will be more complicated as we need to consider the requirements for tailing a WAL file which is currently being written. The refactoring will be a bit big so I do not think we should apply it to branch-2.5 and branch-2.4. So let's apply the simple fix here and file another issue to implement the big refactoring.
Thanks.

I understand that this is a complicated and dirty job, I am ashamed that I didn't solve it thoroughly before... But by the way, just for this PR, Would you mind taking a look at https://issues.apache.org/jira/browse/HBASE-26850 and #4233? At that time, I thought that it could not fundamentally solve the problem, so I did not continue to push forward, but these two patches seem a bit similar? The difference is that I changed the implementation of addEntry.

I think this can be done step by step.
First, we apply the patch here to fix the problem first, where the performance maybe bad than before. And then, we refactor the Reader, to introduce two types of Reader, so we can focus on how to improve the performance of tailing the WAL file which is being written currently in Repliaction without affecting the WAL splitting logic. Then we could try to introduce fine-grained control on whether we should reconstruct the dictionary, and finally, we could try to improve the LRUDictionary to support checkpoint and rollback, and do a checkpoint at a proper place and use rollback instead of clear and reconstruct, to get all the performance back.

WDYT?

Thanks.

@thangTang
Copy link
Contributor

I tried to refactoring a bit but the implementation of ProtobufLogReader is too complicated. I think we'd better abstract two types of WAL.Reader for reading WAL file. One is StreamingReader, which is used in most cases, for example, WAL splitting, WAL printing, etc, where we only need to read the file once and usually for closed WAL files. There is no need to support reset and seek. The other is TailingReader, which is used by Replication, where we need to support reset and seek, and also we need to tell the upper layer whether we need to reset the compress context when calling reseting. The logic will be more complicated as we need to consider the requirements for tailing a WAL file which is currently being written. The refactoring will be a bit big so I do not think we should apply it to branch-2.5 and branch-2.4. So let's apply the simple fix here and file another issue to implement the big refactoring.
Thanks.

I understand that this is a complicated and dirty job, I am ashamed that I didn't solve it thoroughly before... But by the way, just for this PR, Would you mind taking a look at https://issues.apache.org/jira/browse/HBASE-26850 and #4233? At that time, I thought that it could not fundamentally solve the problem, so I did not continue to push forward, but these two patches seem a bit similar? The difference is that I changed the implementation of addEntry.

I think this can be done step by step. First, we apply the patch here to fix the problem first, where the performance maybe bad than before. And then, we refactor the Reader, to introduce two types of Reader, so we can focus on how to improve the performance of tailing the WAL file which is being written currently in Repliaction without affecting the WAL splitting logic. Then we could try to introduce fine-grained control on whether we should reconstruct the dictionary, and finally, we could try to improve the LRUDictionary to support checkpoint and rollback, and do a checkpoint at a proper place and use rollback instead of clear and reconstruct, to get all the performance back.

WDYT?

Thanks.

Make sense.
+1 from me.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 23s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
_ master Compile Tests _
+0 🆗 mvndep 0m 18s Maven dependency ordering for branch
+1 💚 mvninstall 3m 15s master passed
+1 💚 compile 2m 52s master passed
+1 💚 checkstyle 0m 43s master passed
+1 💚 spotless 0m 37s branch has no errors when running spotless:check.
+1 💚 spotbugs 1m 46s master passed
_ Patch Compile Tests _
+0 🆗 mvndep 0m 10s Maven dependency ordering for patch
+1 💚 mvninstall 3m 20s the patch passed
+1 💚 compile 3m 0s the patch passed
+1 💚 javac 3m 0s the patch passed
+1 💚 checkstyle 0m 12s The patch passed checkstyle in hbase-common
+1 💚 checkstyle 0m 31s hbase-server: The patch generated 0 new + 9 unchanged - 2 fixed = 9 total (was 11)
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 hadoopcheck 12m 45s Patch does not cause any errors with Hadoop 3.2.4 3.3.4.
+1 💚 spotless 0m 37s patch has no errors when running spotless:check.
+1 💚 spotbugs 2m 3s the patch passed
_ Other Tests _
+1 💚 asflicense 0m 14s The patch does not generate ASF License warnings.
40m 21s
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/5/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #5016
Optional Tests dupname asflicense javac spotbugs hadoopcheck hbaseanti spotless checkstyle compile
uname Linux e86193e8d189 5.4.0-1094-aws #102~18.04.1-Ubuntu SMP Tue Jan 10 21:07:03 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / a854cba
Default Java Eclipse Adoptium-11.0.17+8
Max. process+thread count 86 (vs. ulimit of 30000)
modules C: hbase-common hbase-server U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/5/console
versions git=2.34.1 maven=3.8.6 spotbugs=4.7.3
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@Apache9
Copy link
Contributor Author

Apache9 commented Feb 10, 2023

@sunhelly Could you please try to see if this PR can also solve your problem?

And is it possible to contribute your replication test case to hbase-it?

Thanks.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 48s Docker mode activated.
-0 ⚠️ yetus 0m 3s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+0 🆗 mvndep 0m 15s Maven dependency ordering for branch
+1 💚 mvninstall 2m 48s master passed
+1 💚 compile 0m 56s master passed
+1 💚 shadedjars 4m 15s branch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 38s master passed
_ Patch Compile Tests _
+0 🆗 mvndep 0m 12s Maven dependency ordering for patch
+1 💚 mvninstall 2m 46s the patch passed
+1 💚 compile 0m 56s the patch passed
+1 💚 javac 0m 56s the patch passed
+1 💚 shadedjars 4m 18s patch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 37s the patch passed
_ Other Tests _
+1 💚 unit 1m 45s hbase-common in the patch passed.
+1 💚 unit 211m 35s hbase-server in the patch passed.
236m 15s
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/5/artifact/yetus-jdk8-hadoop3-check/output/Dockerfile
GITHUB PR #5016
Optional Tests javac javadoc unit shadedjars compile
uname Linux cd3608f7d3ac 5.4.0-137-generic #154-Ubuntu SMP Thu Jan 5 17:03:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / a854cba
Default Java Temurin-1.8.0_352-b08
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/5/testReport/
Max. process+thread count 2345 (vs. ulimit of 30000)
modules C: hbase-common hbase-server U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/5/console
versions git=2.34.1 maven=3.8.6
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 58s Docker mode activated.
-0 ⚠️ yetus 0m 2s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+0 🆗 mvndep 0m 12s Maven dependency ordering for branch
+1 💚 mvninstall 3m 36s master passed
+1 💚 compile 1m 6s master passed
+1 💚 shadedjars 4m 34s branch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 38s master passed
_ Patch Compile Tests _
+0 🆗 mvndep 0m 11s Maven dependency ordering for patch
+1 💚 mvninstall 3m 36s the patch passed
+1 💚 compile 1m 5s the patch passed
+1 💚 javac 1m 5s the patch passed
+1 💚 shadedjars 4m 35s patch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 39s the patch passed
_ Other Tests _
+1 💚 unit 2m 13s hbase-common in the patch passed.
+1 💚 unit 210m 12s hbase-server in the patch passed.
238m 2s
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/5/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
GITHUB PR #5016
Optional Tests javac javadoc unit shadedjars compile
uname Linux 144cedb192cc 5.4.0-135-generic #152-Ubuntu SMP Wed Nov 23 20:19:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / a854cba
Default Java Eclipse Adoptium-11.0.17+8
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/5/testReport/
Max. process+thread count 2622 (vs. ulimit of 30000)
modules C: hbase-common hbase-server U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5016/5/console
versions git=2.34.1 maven=3.8.6
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@sunhelly
Copy link
Contributor

I tested morning, sadly still something wrong...The problem is focus on one scenario, replicated mostly whole row deletes. It seems should not be relevant to the operation, but I can't find more relevant changes.
We have already reset the compress context to fix the issue in the last months, it resolved most problems and seems more stable than before. But we have one circumstance, the replication always stuck. The senerio is as follows.
There is two-way replications between cluster A and cluster B(both using wal group), A without WAL compression, B with WAL compression, write operations only on A. Now there are many whole row deletes on A, the replication of A->B is OK, the replication of B->A is always stucks, and the stuck is not rare, it is very easy to happen.
I can not reproduce this problem locally until now, maybe it's not relevant to the uncompress progress, maybe something wrong when compress and the WAL is corrupt. I used WALPrettyPrinter to read these WALs, the printer always stopped at the same position for one WAL, and no exceptions output, but the end read position of the printer is in the middle of the WAL.

@Apache9
Copy link
Contributor Author

Apache9 commented Feb 10, 2023

I tested morning, sadly still something wrong...The problem is focus on one scenario, replicated mostly whole row deletes. It seems should not be relevant to the operation, but I can't find more relevant changes. We have already reset the compress context to fix the issue in the last months, it resolved most problems and seems more stable than before. But we have one circumstance, the replication always stuck. The senerio is as follows. There is two-way replications between cluster A and cluster B(both using wal group), A without WAL compression, B with WAL compression, write operations only on A. Now there are many whole row deletes on A, the replication of A->B is OK, the replication of B->A is always stucks, and the stuck is not rare, it is very easy to happen. I can not reproduce this problem locally until now, maybe it's not relevant to the uncompress progress, maybe something wrong when compress and the WAL is corrupt. I used WALPrettyPrinter to read these WALs, the printer always stopped at the same position for one WAL, and no exceptions output, but the end read position of the printer is in the middle of the WAL.

If WALPrettyPrinter can not output correct result, I think the problem is not about the replication implementation then, it should be something wrong when writing the WAL file. And I believe it will also make WAL splitting incorrect?

Do you also enabled WAL value compression? Or just the dictionary based compression...

Thanks.

@sunhelly
Copy link
Contributor

Yes, I also enabled WAL value compression. I'll check if the stuck recurs after disable it.
And there are no WAL splitting issues until now. Thanks.

@Apache9
Copy link
Contributor Author

Apache9 commented Feb 10, 2023

Maybe the problem is that, in replication, we will check whether we have parsed all the bytes. But in WAL splitting, we just return after getting EOF...

@sunhelly
Copy link
Contributor

Oh, the cluster B really has lose data issue..

@Apache9
Copy link
Contributor Author

Apache9 commented Feb 10, 2023

The stuck still occurs after diabling WAL value compression.

Is it OK for your company to upload the WAL file somewhere? So we can see the content of the WAL file and check what is the problem...

@sunhelly
Copy link
Contributor

OK. I'll prepare one.

@sunhelly
Copy link
Contributor

It works well after disabling WAL value compress with this fix PR on our cluster. We can reproduce the replication stuck by enable WAL value compression, while the WALPrettyPrinter stops at the middle position without any exceptions. The stuck issue now is not relevant to the dictionary.
Great job! Thanks.

@Apache9
Copy link
Contributor Author

Apache9 commented Feb 11, 2023

Thanks @sunhelly for providing the useful feedback.

Let me merge this PR first to solve the dictionary problem.

For replication value compression, seems there are still other bugs and @apurtell also pointed out that there are some tricks in the buffer reuse mechanism, will dig more and file other issues to try to fix.

Thanks.

@Apache9 Apache9 merged commit 833b10e into apache:master Feb 11, 2023
Apache9 added a commit that referenced this pull request Feb 11, 2023
…pressed WAL file (#5016)

Signed-off-by: Xiaolin Ha <[email protected]>
(cherry picked from commit 833b10e)
Apache9 added a commit that referenced this pull request Feb 11, 2023
…pressed WAL file (#5016)

Signed-off-by: Xiaolin Ha <[email protected]>
(cherry picked from commit 833b10e)
Apache9 added a commit that referenced this pull request Feb 11, 2023
…pressed WAL file (#5016)

Signed-off-by: Xiaolin Ha <[email protected]>
(cherry picked from commit 833b10e)
bbeaudreault pushed a commit to HubSpot/hbase that referenced this pull request Feb 9, 2024
…g when reading compressed WAL file (apache#5016)

Signed-off-by: Xiaolin Ha <[email protected]>
(cherry picked from commit 833b10e)
vinayakphegde pushed a commit to vinayakphegde/hbase that referenced this pull request Apr 4, 2024
…pressed WAL file (apache#5016)

Signed-off-by: Xiaolin Ha <[email protected]>
(cherry picked from commit 833b10e)
(cherry picked from commit 8df3212)
Change-Id: I469fa5b5a7ba6a41c3b8b28acb57a60f33c27fe9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants