Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][broker] Fix Broker was failing to create producer with broken schema ledger #23395

Merged
merged 1 commit into from
Oct 4, 2024

Conversation

rdhabalia
Copy link
Contributor

@rdhabalia rdhabalia commented Oct 3, 2024

Motivation

When schema ledger is broken and producer tries to connect without schema then broker fails to add that producer with below error.

It fixes below errors

03:46:47.861 [pulsar-io-4-63] DEBUG org.apache.pulsar.broker.service.BrokerService - No autoTopicCreateOverride policy found for persistent://ncp/global/content.ncp/url_uuid_alias.prod
03:46:47.861 [pulsar-io-4-63] DEBUG org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://ncp/global/content.ncp/url_uuid_alias.prod] Storage size = [0], backlog quota limit
 [53687091200]
03:46:47.861 [pulsar-io-4-63] DEBUG org.apache.pulsar.broker.service.schema.BookkeeperSchemaStorage - [ncp/content.ncp/url_uuid_alias.prod] Fetching schema from store
03:46:47.862 [pulsar-io-4-63] DEBUG org.apache.pulsar.broker.service.schema.BookkeeperSchemaStorage - [ncp/content.ncp/url_uuid_alias.prod] Got schema locator Optional[org.apache.pulsar.broker.se
rvice.schema.BookkeeperSchemaStorage$LocatorEntry@2a675a45]
03:46:47.862 [pulsar-io-4-63] DEBUG org.apache.pulsar.broker.service.schema.BookkeeperSchemaStorage - Reading schema entry from ledgerId: 742584758
entryId: 0

03:46:47.862 [pulsar-io-4-63] DEBUG org.apache.pulsar.metadata.bookkeeper.PulsarLedgerManager - No such ledger: 742584758 at path /ledgers/07/4258/L4758
03:46:47.862 [pulsar-io-4-63] ERROR org.apache.bookkeeper.meta.CleanupLedgerManager - Failed on operating ledger metadata: -25
03:46:47.862 [pulsar-io-4-63] DEBUG org.apache.pulsar.broker.service.schema.BookkeeperSchemaStorage - [ncp/content.ncp/url_uuid_alias.prod] Get operation completed. res=null -- ex=java.util.concu
rrent.CompletionException: org.apache.pulsar.broker.service.schema.exceptions.SchemaException: No such ledger exists on Metadata Server -  ledger=742584758 - operation=Failed to open ledger

Modifications

Verifying this change

  • Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (10MB)
  • Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository:

@rdhabalia rdhabalia self-assigned this Oct 3, 2024
@rdhabalia rdhabalia added area/broker doc-not-needed Your PR changes do not impact docs ready-to-test labels Oct 3, 2024
@codecov-commenter
Copy link

codecov-commenter commented Oct 4, 2024

Codecov Report

Attention: Patch coverage is 66.66667% with 2 lines in your changes missing coverage. Please review.

Project coverage is 74.59%. Comparing base (bbc6224) to head (e8fddf0).
Report is 636 commits behind head on master.

Files with missing lines Patch % Lines
...rg/apache/pulsar/broker/service/AbstractTopic.java 66.66% 1 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #23395      +/-   ##
============================================
+ Coverage     73.57%   74.59%   +1.02%     
- Complexity    32624    34047    +1423     
============================================
  Files          1877     1936      +59     
  Lines        139502   145368    +5866     
  Branches      15299    15895     +596     
============================================
+ Hits         102638   108439    +5801     
+ Misses        28908    28623     -285     
- Partials       7956     8306     +350     
Flag Coverage Δ
inttests 27.76% <33.33%> (+3.18%) ⬆️
systests 24.51% <33.33%> (+0.19%) ⬆️
unittests 73.94% <66.66%> (+1.09%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...rg/apache/pulsar/broker/service/AbstractTopic.java 87.48% <66.66%> (-0.51%) ⬇️

... and 615 files with indirect coverage changes

@rdhabalia rdhabalia merged commit aa12561 into apache:master Oct 4, 2024
54 of 55 checks passed
@rdhabalia rdhabalia deleted the schema_producer branch October 4, 2024 17:51
@Technoboy- Technoboy- added this to the 4.0.0 milestone Oct 9, 2024
@rdhabalia
Copy link
Contributor Author

I am also adding email thread reply here

I want to revert this PR. Background: When the schemas of a topic are lost, all of the messages in the topic can not be consumed successfully, and producers can not publish messages anymore.

We must not revert this PR and broker must not allow topic unavailability in any cases and we can't rely on manual operational efforts to fix any topic server side issue
There are already many issues created related to the lost schema ledgers so, it is a very well-known issue that the topic becomes unavailable and producers are not able to publish messages anymore just because of server issues which is extremely critical for mission-critical usecases and not acceptable as well. Because of that brokers must be fault tolerant to recover in case of missing ledgers. Broker already handles such issues during manged-ledger or managed-cursor legers but it should also handle in case of schema ledger as well and make sure that broker won't impact topic's unavailability. It is also very important to just not rely on operational effort to fix such unavailability issues manually but broker should have a mechanism to recover by itself,
Therefore, this PR is important to make sure tenants don't face topic unavailability due to well known issues of missing schema ledgers.

@poorbarcode
Copy link
Contributor

poorbarcode commented Oct 9, 2024

@rdhabalia

Answered in discussion channel, we should do this. User can delete the broken schemas manually when they received such error if they wanted


@vineeth1995

Do you know the risk of this PR before you approved it? Could you join this discussion and throw your point?

@rdhabalia
Copy link
Contributor Author

User can delete the broken schemas manually when they received such error if they wanted

First of all user can't delete ledgers from the metadata store as ledger concept has to be abstract to user. System admin may be able to delete such ledgers but that is not possible for the systems which is running millions of topics and this issue is happening with multiple ledgers at the same time and manual efforts are not possible to address this issue.

@rdhabalia
Copy link
Contributor Author

created PR: #23428 it will switch back to previous behavior unless force-flag is configured and ledger error is non-recoverable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/broker doc-not-needed Your PR changes do not impact docs ready-to-test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants