-
Notifications
You must be signed in to change notification settings - Fork 0
PDP 27: Admin Tools
Status: Under construction
Related issues: Issue 2228.
This proposal outlines the need for a set of admin tools that can help debug and fix problems with a Pravega cluster and how they can be accessed. These tools are meant for accessing Pravega data structures (in ZooKeeper, Tier1, etc) and modify them in order to fix any problems that may have lead to an unrecoverable state. These are not meant for everyday deployment activities, such as starting or stopping services, provisioning and so on.
Shell:
- Run from the command line (no UI) or from the IDE
- The IDE is desired since some problems are so subtle they may require stepping through the debugger to properly diagnose them.
- Be able to load/specify a configuration compatible with that already loaded into Pravega (SegmentStore and/or Controller)
- Support a pluggable model where new commands can be easily tied into.
The following are an initial set of goals for individual components. They will evolve as the need arises and as we gain more experience with Pravega itself.
Tier1/BookKeeper:
- Examine the
BookKeeperLog
metadata for a particular Log (stored in ZooKeeper) - Edit/Delete the contents of a
BookKeeperLog
metadata - Execute orphan ledger cleanup (see issue Issue 1165)
SegmentStore:
- Be able to diagnose a particular issue with Container Recovery (i.e., why recovery fails)
- This should only be used for data corruption failures that would prevent a Container from ever recovering again. Failures due to external factors (such as disks full or network connectivity are considered transient and can be fixed by a system admin).
- Be able to add, replace or delete the contents of an Operation in the Tier1
DurableLog
(presumably this operation is the cause of the recovery failure). - Be able to replace the contents of a
DataFrame
written in Tier1 (if this entire frame is believed to be corrupted)- This could go into the Tier1/BookKeeper section above, however for Tier1/BookKeeper, every entry is a sequence of bytes and it guarantees its correctness. A
DataFrame
is a SegmentStore concept, with its own layout, and, as such, is prone to errors stemming from badly written code.
- This could go into the Tier1/BookKeeper section above, however for Tier1/BookKeeper, every entry is a sequence of bytes and it guarantees its correctness. A
Every command inside the Shell should follow this syntax:
<component> <command> <args>
-
config <key1=value1> ...
- sets or updates the configuration to use for subsequent commands.- Key here refers to the keys in the config file (i.e.
pravegaservice.listeningPort
) so essentially this would be in lieu of an actual config file. - Most of the uses for this would be to set the ZK address so that subsequent commands can connect appropriately.
- Key here refers to the keys in the config file (i.e.
-
bookkeeper list-logs
- list all BK Logs (no details) -
bookkeeper log-meta <log-id>
: output info about a log (metadata) -
bookkeeper log-details <log-id>
: output info about a log (metadata + bk ledgers) -
bookkeeper log-delete <log-id>
: delete a log's contents -
bookkeeper ledger-cleanup
: delete orphan ledgers -
container set-status <container-id> <online|offline>
:-
online
: can be used by SegmentStore but cannot debug -
offline
: SegmentStore cannot load it, but it can be debugged.
-
-
container recover <container-id> <verbosity>
: do a dummy recovery of the Container, without processing anything or otherwise affecting the state of the data in Tier1 (still have a ContainerMetadata, InMemLog and a ReadIndex, but no actual cache storage)- verbosity:
DataFrame|OperationSummary|OperationDetail
(from just reading frames without extracting operations to just listing operations to actually validating operations)
- verbosity:
-
container replace-op <seq-no> <TBD>
: replace an operation with given SeqNo with another. TBD how to load up the new operation contents. -
container delete-op <seq-no>
: variation ofreplace-op
, but it replaces an op with a no-op -
container add-op <after-seq-no> <TBD>
: insert an operation immediately after the operation with existing SeqNo.
Pravega - Streaming as a new software defined storage primitive
- Contributing
- Guidelines for committers
- Testing
-
Pravega Design Documents (PDPs)
- PDP-19: Retention
- PDP-20: Txn timeouts
- PDP-21: Protocol revisioning
- PDP-22: Bookkeeper based Tier-2
- PDP-23: Pravega Security
- PDP-24: Rolling transactions
- PDP-25: Read-Only Segment Store
- PDP-26: Ingestion Watermarks
- PDP-27: Admin Tools
- PDP-28: Cross routing key ordering
- PDP-29: Tables
- PDP-30: Byte Stream API
- PDP-31: End-to-end Request Tags
- PDP-32: Controller Metadata Scalability
- PDP-33: Watermarking
- PDP-34: Simplified-Tier-2
- PDP-35: Move controller metadata to KVS
- PDP-36: Connection pooling
- PDP-37: Server-side compression
- PDP-38: Schema Registry
- PDP-39: Key-Value Tables
- PDP-40: Consistent order guarantees for storage flushes
- PDP-41: Enabling Transport Layer Security (TLS) for External Clients
- PDP-42: New Resource String Format for Authorization
- PDP-43: Large Events
- PDP-44: Lightweight Transactions
- PDP-45: Healthcheck
- PDP-46: Read Only Permissions For Reading Data
- PDP-47: Pravega Message Queues