-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix][monitor] Fix the partitioned publisher topic stat aggregation bug #18807
[fix][monitor] Fix the partitioned publisher topic stat aggregation bug #18807
Conversation
Codecov Report
@@ Coverage Diff @@
## master #18807 +/- ##
============================================
+ Coverage 47.46% 48.60% +1.14%
- Complexity 10727 11308 +581
============================================
Files 711 712 +1
Lines 69456 72833 +3377
Branches 7452 8303 +851
============================================
+ Hits 32964 35404 +2440
- Misses 32810 33577 +767
- Partials 3682 3852 +170
Flags with carried forward coverage won't be shown. Click here to find out more.
|
if (index == this.publishers.size()) { | ||
PublisherStatsImpl newStats = new PublisherStatsImpl(); | ||
newStats.setSupportsPartialProducer(false); | ||
this.publishers.add(newStats); | ||
} | ||
this.publishers.get(index) | ||
.add((PublisherStatsImpl) s); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nits:
The publishers
field is not thread-safe from before. Potentially access to it from concurrent threads.
Moreover, it doesn't guarantee the size is index + 1
in L262 because it isn't held mutual lock between L259 and L262.
pulsar/pulsar-common/src/main/java/org/apache/pulsar/common/policies/data/stats/TopicStatsImpl.java
Line 111 in 1107af2
private List<PublisherStatsImpl> publishers; |
pulsar/pulsar-common/src/main/java/org/apache/pulsar/common/policies/data/stats/TopicStatsImpl.java
Line 179 in 1107af2
this.publishers = new ArrayList<>(); |
We want to avoid IndexOutOfBoundsException completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
I see that *TopicStatsImpl
is not a thread-safe class. Currently, the caller needs to externally synchronize the access if concurrent threads access the same instance.
Potentially access to it from concurrent threads.
Could you provide this example? I see that PartitionedTopicStatsImpl
is instantiated by a single thread, and add(..)
gets called from the same thread.
PartitionedTopicStatsImpl stats = new PartitionedTopicStatsImpl(partitionMetadata);
...
FutureUtil.waitForAll(topicStatsFutureList).handle((result, exception) -> {
...
stats.add(statFuture.get());
if (perPartition) {
stats.getPartitions().put(topicName.getPartition(i).toString(), statFuture.get());
...
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case may not be in the current codes. If contributors already know "the caller needs to externally synchronize the access," we don't care in this section. I'm worried about misimplementation. So, I think it's okay to write comments about the current spec instead of fixing it to follow my comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, sorry. Do you mean we need to comment that PartitionedTopicStatsImpl
is not thread-safe?
Otherwise, PTAL again (or could you approve this?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean we need to comment that PartitionedTopicStatsImpl is not thread-safe?
Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comments.
@eolivelli could you review this PR by any chance? |
7be4cdd
to
7a7aaec
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better to split it into 2 PRs? One is to fix the aggregation issue, another change the default value.
So that we can ship the fix to release branches, and the default configuration change only applies to the master branch (release in the next major version).
if (index == this.nonPersistentPublishers.size()) { | ||
NonPersistentPublisherStatsImpl newStats = new NonPersistentPublisherStatsImpl(); | ||
newStats.setSupportsPartialProducer(false); | ||
this.nonPersistentPublishers.add(newStats); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to add some description for this if
branch. Like why we need to use a newStats if index == this.nonPersistentPublishers.size()
. And what does index == this.nonPersistentPublishers.size()
exactly means. Frankly, it's not easy to understand. I think if we have some description, it will help the readers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comments.
…ult to true" This reverts commit bc37696.
I reverted the config default change in this PR. |
I have added the label |
…ug (apache#18807) (cherry picked from commit 8790ed1) (cherry picked from commit b5b2de6)
Motivation
The index-based publisher stat aggregation(configured by
aggregatePublisherStatsByProducerName
=false, default, triggered bypulsar-admin topics partitioned-stats api
) can burst memory and wrongly aggregate publisher metrics if each partition stat returns a different size or order of the publisher stat list.In the worst case, if there are many partitions and publishers created and closed concurrently, the current code can create PublisherStatsImpl objects exponentially, and this can cause a high GC time or OOM.
Discussion Thread:
https://lists.apache.org/thread/vofv1oz0wvzlwk4x9vk067rhkscn8bqo
Issue Code reference:
2c428f7#diff-02e50674125a597f8ae3405a884590759f2fdaa10104cea511d5ea44b6ff6490R224-R247
Modifications
Verifying this change
This change added unit tests.
Does this pull request potentially affect one of the following parts:
If the box was checked, please highlight the changes
Documentation
doc
doc-required
doc-not-needed
doc-complete
Matching PR in forked repository
PR in forked repository: heesung-sn#15