Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After enabling tiered storage, occasional residual logs are left in the replica. #562

Open
funky-eyes opened this issue Jun 22, 2024 · 3 comments

Comments

@funky-eyes
Copy link
Contributor

What happened?

After enabling tiered storage, occasional residual logs are left in the replica.
Based on the observed phenomenon, the index values of the rolled-out logs generated by the replica and the leader are not the same. As a result, the logs uploaded to S3 at the same time do not include the corresponding log files on the replica side, making it impossible to delete the local logs.
image
leader config:

num.partitions=3
default.replication.factor=2
delete.topic.enable=true
auto.create.topics.enable=false
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=3
transaction.state.log.replication.factor=2
transaction.state.log.min.isr=1
offsets.retention.minutes=4320
log.roll.ms=86400000
log.local.retention.ms=600000
log.segment.bytes=536870912
num.replica.fetchers=1
log.retention.ms=15811200000
remote.log.manager.thread.pool.size=4
remote.log.reader.threads=4
remote.log.metadata.topic.replication.factor=3
remote.log.storage.system.enable=true
remote.log.metadata.topic.retention.ms=180000000
rsm.config.fetch.chunk.cache.class=io.aiven.kafka.tieredstorage.fetch.cache.DiskChunkCache
rsm.config.fetch.chunk.cache.path=/data01/kafka-tiered-storage-cache
# Pick some cache size, 16 GiB here:
rsm.config.fetch.chunk.cache.size=34359738368
rsm.config.fetch.chunk.cache.retention.ms=1200000
# # # Prefetching size, 16 MiB here:
rsm.config.fetch.chunk.cache.prefetch.max.size=33554432
rsm.config.storage.backend.class=io.aiven.kafka.tieredstorage.storage.s3.S3Storage
rsm.config.storage.s3.bucket.name=
rsm.config.storage.s3.region=us-west-1
rsm.config.storage.aws.secret.access.key=
rsm.config.storage.aws.access.key.id=
rsm.config.chunk.size=8388608
remote.log.storage.manager.class.path=/home/admin/core-0.0.1-SNAPSHOT/*:/home/admin/s3-0.0.1-SNAPSHOT/*
remote.log.storage.manager.class.name=io.aiven.kafka.tieredstorage.RemoteStorageManager
remote.log.metadata.manager.class.name=org.apache.kafka.server.log.remote.metadata.storage.TopicBasedRemoteLogMetadataManager
remote.log.metadata.manager.listener.name=PLAINTEXT
rsm.config.upload.rate.limit.bytes.per.second=31457280

replica config:

num.partitions=3
default.replication.factor=2
delete.topic.enable=true
auto.create.topics.enable=false
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=3
transaction.state.log.replication.factor=2
transaction.state.log.min.isr=1
offsets.retention.minutes=4320
log.roll.ms=86400000
log.local.retention.ms=600000
log.segment.bytes=536870912
num.replica.fetchers=1
log.retention.ms=15811200000
remote.log.manager.thread.pool.size=4
remote.log.reader.threads=4
remote.log.metadata.topic.replication.factor=3
remote.log.storage.system.enable=true
#remote.log.metadata.topic.retention.ms=180000000
rsm.config.fetch.chunk.cache.class=io.aiven.kafka.tieredstorage.fetch.cache.DiskChunkCache
rsm.config.fetch.chunk.cache.path=/data01/kafka-tiered-storage-cache
# Pick some cache size, 16 GiB here:
rsm.config.fetch.chunk.cache.size=34359738368
rsm.config.fetch.chunk.cache.retention.ms=1200000
# # # Prefetching size, 16 MiB here:
rsm.config.fetch.chunk.cache.prefetch.max.size=33554432
rsm.config.storage.backend.class=io.aiven.kafka.tieredstorage.storage.s3.S3Storage
rsm.config.storage.s3.bucket.name=
rsm.config.storage.s3.region=us-west-1
rsm.config.storage.aws.secret.access.key=
rsm.config.storage.aws.access.key.id=
rsm.config.chunk.size=8388608
remote.log.storage.manager.class.path=/home/admin/core-0.0.1-SNAPSHOT/*:/home/admin/s3-0.0.1-SNAPSHOT/*
remote.log.storage.manager.class.name=io.aiven.kafka.tieredstorage.RemoteStorageManager
remote.log.metadata.manager.class.name=org.apache.kafka.server.log.remote.metadata.storage.TopicBasedRemoteLogMetadataManager
remote.log.metadata.manager.listener.name=PLAINTEXT
rsm.config.upload.rate.limit.bytes.per.second=31457280

topic config:

Dynamic configs for topic xxxxxx are:
  local.retention.ms=600000 sensitive=false synonyms={DYNAMIC_TOPIC_CONFIG:local.retention.ms=600000, STATIC_BROKER_CONFIG:log.local.retention.ms=600000, DEFAULT_CONFIG:log.local.retention.ms=-2}
  remote.storage.enable=true sensitive=false synonyms={DYNAMIC_TOPIC_CONFIG:remote.storage.enable=true}
  retention.ms=15811200000 sensitive=false synonyms={DYNAMIC_TOPIC_CONFIG:retention.ms=15811200000, STATIC_BROKER_CONFIG:log.retention.ms=15811200000, DEFAULT_CONFIG:log.retention.hours=168}
  segment.bytes=536870912 sensitive=false synonyms={DYNAMIC_TOPIC_CONFIG:segment.bytes=536870912, STATIC_BROKER_CONFIG:log.segment.bytes=536870912, DEFAULT_CONFIG:log.segment.bytes=1073741824}

image
By examining the segment logs for that time period in S3 for the topic, it can be observed that the indices of the two are different.
image
By searching for the residual log index through log analysis, it was found that there were no delete logs on both the leader and replica nodes. However, the logs for the corresponding time period in S3 can be queried in the leader node logs but not in the replica node logs. Therefore, I believe that the issue is due to the different log files generated by the leader and replica nodes.
image

What did you expect to happen?

What else do we need to know?

@funky-eyes
Copy link
Contributor Author

funky-eyes commented Jun 22, 2024

Restarting does not resolve this issue. The only solution is to delete the log folder corresponding to the replica where the log segment anomaly occurred and then resynchronize from the leader.
image

@funky-eyes
Copy link
Contributor Author

I'm not sure whether this is a Kafka bug or a bug in the tiered storage plugin, so I have also reported it to the Kafka community.
https://issues.apache.org/jira/browse/KAFKA-17020

@jeqo
Copy link
Contributor

jeqo commented Jul 25, 2024

@funky-eyes thanks for reporting this one, and sorry for the late reply.
I agree that this is more of a broker issue than a plugin one.

I wonder if it's related to https://issues.apache.org/jira/browse/KAFKA-16890. Let's follow up on the Kafka ticket and close it here once we have a resolution from the community. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants