Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Core Peer Forwarding #700

Closed
17 tasks done
dlvenable opened this issue Dec 3, 2021 · 3 comments
Closed
17 tasks done

[RFC] Core Peer Forwarding #700

dlvenable opened this issue Dec 3, 2021 · 3 comments
Assignees
Labels
proposal Proposed major changes to Data Prepper
Milestone

Comments

@dlvenable
Copy link
Member

dlvenable commented Dec 3, 2021

Background

The background for this change is explained in #699.

Proposal

Data Prepper will include peer forwarding as a core feature which any plugin can use. The aggregate plugin defined in #699 will use this new feature.

Design

The proposed design is to create a more general Peer Forwarder as part of Data Prepper Core. In this design, any plugin can request peer forwarding of events between Data Prepper nodes. Peer Forwarder takes Events, groups these by the plugin-defined correlation values, and then sends them to the correct Data Prepper node. It continues to use the existing hash ring approach for determining the destination.

The following diagram shows the flow of Event with the proposed Peer Forwarder.

Aggregation-PeerForwarding

Peer Forwarder Configuration

The user will configure Peer Forwarder in the existing data-prepper-config.yaml file. Below is a snippet depicting how a user can configure peer-forwarding and what options are available. For brevity, the example does not show all the existing configurations related to peer discovery.

peer_forwarder:
  max_batch_event_count: 48
  port: 4910
  time_out: 300
  discovery_mode: "dns"
  domain_name: "data-prepper-cluster.my-domain.net"

This design allows for one peer-forwarder in Data Prepper. See the Alternatives and Questions below for a discussion on supporting multiple peer-forwarders.

Service Discovery Configuration

The core Peer Forwarder will use the existing service discovery options. Presently, peers can be discovered via:

  • Pre-configured static IP list
  • DNS entry
  • AWS CloudMap

Security Configuration

The peer-forwarder will support authentication and TLS. For TLS encryption, peer-forwarder can utilize the work which is planned for unifying certificate loading #364.

For authentication, peer-forwarder can use the same mechanism for securing its endpoint as was provided in #464. Additionally, it will need a new concept for authenticating requests when it is the client. This could be based on the authentication configure so that the username and password need not be repeated.

Here is a possible secured configuration.

peer_forwarder:
  max_batch_event_count: 48
  port: 4910
  time_out: 300
  ssl: true
  certificate:
    file:
      certificate_path: /usr/share/my/path/public.cert
      private_key_path: /usr/share/my/path/private.key
  authentication:
    http_basic:
      username: admin
      password: admin
  discovery_mode: "dns"
  domain_name: "data-prepper-cluster.my-domain.net"

Peer Forwarder Communication

Peer Forwarder will send batches of Event objects. It will send them over HTTP/2 to a user-configurable port.

The model for communication is loosely defined as:

public class ForwardedEvent {
  private String event;
  private String destinationPlugin;
}

public class ForwardedEvents {
    private List<String> events;
}

Each event is a string. It is the serialized JSON for that event.

The Peer Forwarder also specifies the destination plugin. It must do this so that multiple aggregate plugins can use one shared peer-forwarder.

Peer Forwarder Implementation

The peer forwarder will continue to use consistent hashing and a hash ring to determine the destination node. One significant implementation change is that it will now support multiple keys for determining the hash. Peer Forwarder will perform this by appending the values together into a single string or byte array value.

Peer Forwarder Plugins

Plugins requiring peer-forwarding must implement the following interface. Data Prepper will detect plugins which implement this interface and configure the peer-forwarder for that plugin.

/**
 * Add this interface to a Processor which must have peer forwarding
 * prior to processing events.
 */
interface RequiresPeerForwarding {
  /**
   * Gets the correlation keys which Peer Forwarder uses to allocate
   * Events to specific Data Prepper nodes.
   *
   * @return A set of keys
   */
  Set<String> getCorrelationKeys();
}

Data Prepper will wrap the plugin with a peer-forwarder. With this, plugins will not need to write code to route to peer-forwarder or receive from peer-forwarder. The Data Prepper pipeline will resolve the peer-forwarding.

The plugin only needs to implement the getCorrelationKeys() method. The plugin will return a list of key names which the peer-forwarder will use to determine the node. For example, in Trace Analytics, this could be implemented as follows.

@Override
public Set<String> getCorrelationKeys() {
  return Collection.singleton("traceId");
}

Alternatives and Questions

How will the Peer Forwarder Migrate?

This proposal is to refactor the current peer-forwarder plugin to support the generic peer forwarding. Until the next major release (2.0), it must remain as a plugin. It should be left unchanged.

What Plugin Types can use Peer Forwarding?

The initial implementation will allow peer-forwarding only on Processor plugins. If you need a Source or Sink to peer-forward, please create a new GitHub issue to expand the functionality.

Multiple Peer Forwarders

Data Prepper could support multiple peer forwarders. Users would assign names so that different aggregate plugins could specify which to use. Below is a small example.

peer_forwarder:
  - name: default
    max_batch_event_count: 48
    port: 4910
    time_out: 300
    discovery_mode: "dns"
    domain_name: "data-prepper-cluster.my-domain.net"
  - name: other_forwarder
    max_batch_event_count: 48
    port: 4912
    time_out: 300
    discovery_mode: "dns"
    domain_name: "data-prepper-cluster.my-domain.net"

This could be confusing for users and there may not be a need for it. If you know of a specific use-case that would require this, please comment and explain in the issue.

Distinct Plugins

This RFC proposes core support for peer-forwarding and is based on #699. One alternative I considered is keeping peer-forwarder as distinct plugin which must run prior to the aggregate plugin.

Here is a notional pipeline definition (the details are left out for brevity).

aggregate-pipeline:
  source:
    http:
  processor:
    - grok:
    - peer-forwarder:
    - aggregate:
  sink:
    - opensearch:

Pros to proposed solution:

  • Pipeline authors need not add boilerplate peer-forwarder plugins before the aggregate plugin. It will be easier for plugin authors to create correct pipelines.
  • Other plugins could use peer-forwarding

Pros to alternative solution:

  • It would match the existing design of a peer-forwarder plugin and service-map-stateful plugin.
  • The peer-forwarder configuration is closer to where it is needed by being in the pipeline configuration rather than a different configuration file.
  • Single node clusters don’t need peer-forwarding and it would be easy to leave it out in such cases.

Peer Forwarder as Processor and Source

Another solution would be to create a Peer Forwarder Source and a Peer Forwarder Processor. In this approach, a pipeline author must configure the pipeline to have both the source and processor.

Here is a notional pipeline definition (the details are left out for brevity).

pre-forwarding-pipeline:
  source:
    http:
  processor:
    - grok:
    - peer-forwarder
  sink:
    - pipeline:
        name: post-forwarded-pipeline
post-forwarded-pipeline:
  source:
    - peer-forwarder:
    - pipeline:
        name: pre-forwarding-pipeline
  processor:
    - grok:
  sink:
    - opensearch:

Pros to the proposed solution:

  • Authors don’t have to think about which plugins need peer-forwarding.
  • Authors don’t have to split their pipelines in order to get input from other nodes into the desired plugin.

Pros to the alternative solution:

  • This fits the current model better because processors are not currently able to add to the buffer
  • There would be no need for additional support within Data Prepper core.

Peer Forwarding gRPC

The Peer Forwarder can use gRPC for communication instead of raw HTTP. This may not be necessary since Peer Forwarder can use HTTP/2 and binary messages. However, the protocol must not change within a major version since this would make two Data Preppers of the same major version incompatible with each other.

Tasks

@dlvenable dlvenable added untriaged proposal Proposed major changes to Data Prepper labels Dec 3, 2021
@dlvenable
Copy link
Member Author

Here is some of my thinking on how the RequiresPeerForwarding interface would work.

When the PipelineParser is creating processors, it can also find any Processors which implement RequiresPeerForwarding.

Then it can wrap those in a PeerForwardingProcessorDecorator. This class could use the decorator pattern to decorate the inner processor with peer-forwarding capabilities. It might look somewhat like:

class PeerForwardingProcessorDecorator implements Processor<Record<Event>, Record<Event>> {
    private final Processor innerProcessor;
    private final PeerForwarder peerForwarder;

    @Override
    public Collection<Record<Event>> execute(Collection<Record<Event>> records) {
        Collection<Record<Event>> eventsForThisInstance = peerForwarder.forward(records);
        return innerProcessor.execute(eventsForThisInstance);
    }
}

@dlvenable
Copy link
Member Author

The original RFC provides authentication using HTTP Basic. I'd like to suggest that this change to mTLS. We expect that each peer in a Data Prepper cluster is configured to use the same SSL certificate to provide encryption and verification of the server. Building on top of this, each server could also use the same SSL certificate to determine if it trusts the client.

So the default behavior for core peer-forwarding would be to use a single certificate and private key. This single certificate/key pair would be used for normal SSL verification on the client and also allow the server to authenticate the clients.

A possible configuration might look like the following:

peer_forwarder:
  ssl: true
  certificate:
    file:
      certificate_path: /usr/share/my/path/public.cert
      private_key_path: /usr/share/my/path/private.key
  authentication:
    mtls:
  discovery_mode: "dns"
  domain_name: "data-prepper-cluster.my-domain.net"

Additionally, a Data Prepper administrator could disable authentication using a configuration along the lines of the following.

peer_forwarder:
  ssl: true
  certificate:
    file:
      certificate_path: /usr/share/my/path/public.cert
      private_key_path: /usr/share/my/path/private.key
  authentication:
    unauthenticated:
  discovery_mode: "dns"
  domain_name: "data-prepper-cluster.my-domain.net"

(The exact syntax might change, but this should at least convey the basic concept)

In future versions of Data Prepper, we could permit other authentication schemes. But, I propose that this be the initial solution.

@dlvenable
Copy link
Member Author

Core Peer Forwarder is implemented in Data Prepper 2.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal Proposed major changes to Data Prepper
Projects
Archived in project
Development

No branches or pull requests

1 participant