-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[improve] Introduce the sync() API to ensure consistency on reads during critical metadata operation paths #18518
Conversation
@eolivelli Please add the following content to your PR description and select a checkbox:
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #18518 +/- ##
============================================
+ Coverage 47.11% 47.21% +0.09%
- Complexity 10595 10685 +90
============================================
Files 710 711 +1
Lines 69423 69455 +32
Branches 7449 7452 +3
============================================
+ Hits 32709 32792 +83
+ Misses 33037 32993 -44
+ Partials 3677 3670 -7
Flags with carried forward coverage won't be shown. Click here to find out more. |
Apart from the fact that Is this to fix the problem of a partitioned topic created and immediately used by a client and seeing error? The part that is important to highlight is that That would mean that even when using a fully-consistent data store, we would still have the problem that we cannot cache this information. I would rather find a better way, where we only force the sync on very specific occasions. eg:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. By the way, should we need a test for this?
Hi @merlimat
No, What this PR is trying to fix is that when the metadata attribute
I agree. +1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we need a test for this?
Thank you @merlimat and @poorbarcode for your comments. @merlimat this proposal is only to cover very specific cases, as you say we are already doing well for most of the cases. The case in which you need "sync()" is when there is this combo happening:
In this case when broker B executes op2 we must guarantee that the value that observes includes the effects generated by op1, otherwise op3 may be executed using stale data (any compare-and-set operation in op3 is not on the same znode as op2, so we not protection, no BadVersion). In the partitioned topic deletion example we have:
You need sync() (on broker B) before op2 because the action that causes op2 is out-of-band in respect to ZooKeeper, and so Broker B may use a stale version of the Z-Node (either because it is connected to a Follower that is not up-to-date or because it has a stale value in the local cache and the watch notification still hasn't arrived, mostly because the Follower is still not up-to-date). So we have to use this "consistent refreshAndGet" only when this pattern happens. No need to change Pulsar everywhere, but only on SOME of the Write paths that touch Metadata. I would really reject any proposal to call sync() before every read as it would kill the existence of the localcache and kill also the performances (as you say, because sync() is a dymmy write)or every write as CAS already covers 99% of the operations |
@poorbarcode I will add tests only when we reach consensus on the approach. |
@merlimat do you have more comments please ? |
The pr had no activity for 30 days, mark with Stale label. |
@merlimat @codelipenghui @hangc0276 @rdhabalia I believe that this PR adds value and that we need to complete this work. I can't find any other way to achieve the goals of this PR |
…n critical metadata operation paths
30d7d3a
to
c7b7c31
Compare
@eolivelli I want to cherry-pick this PR to |
I sent a discuss to cherry-pick this PR to |
Cherry-picked. Since the method |
This is only a POC, opened for discussion
Explanation
When a operation is redirected to another broker, it may happen that the local view over metadata on the other broker is not up-to-date with the view seen by the initial broker.
In order to guarantee casual consistency among different peers you have to ways using the ZooKeeper APIs:
Both of the two operations force (or wait for) the ZooKeeper server to which the client is connected to be in sync with the Leader ZooKeeper server and also that the local ZooKeeper client is up-to-date.
See more at https://zookeeper.apache.org/doc/r3.8.0/zookeeperProgrammers.html at the "Consistency Guarantees" paragraph.
When the control passes from one broker to another broker the second broker needs to ensure that its local view is not stale.
And we can do it by using the sync() API and then invalidating the local MetadataCache.
The guarantee here is that the operation executed on the initial broker "happened before" the operation on the second broker.
A good example of the need for this feature is described on PR #18193.
at comment https://github.com/apache/pulsar/pull/18193/files#r1005919247
Modifications
Add to MetadataStore the support for using the sync() API and add refreshAndGetAsync() to MetadataCache.
Fix the problem reported by https://github.com/apache/pulsar/pull/18193/files#r1005919247 regarding Partitioned Topic deletion (the same fix should be applied to Namespace deletion, about the Policies#deleted flag)
PR in the forked repository: eolivelli#20
doc
doc-required
doc-not-needed
doc-complete