Support generic parsers/codecs #1532

dlvenable · 2022-06-23T14:37:47Z

Is your feature request related to a problem? Please describe.

Both the S3 Source and HTTP Source use similar concepts of codecs for parsing input data. The S3 Source currently makes these codecs available as plugins. So they can be extended for the S3 source. But, if another source wanted to use these plugins it would be unable to.

Describe the solution you'd like

Create a core concept in Data Prepper of source-based codecs or parsers. These should be generic enough to take any Java InputStream and produce events from them.

I propose that we based this concept on the S3 codec. It has a few advantages:

It uses an InputStream. This is advantageous for large inputs2.
It has a Consumer for each event. This also allows the source using the codec to receive Event objects and decide independently of the Codec of the best way to handle these.
It is not connected to HTTP in anyway.

Describe alternatives you've considered (Optional)

Data Prepper can have a similar concept for output codecs/parsers. However, I see no reason to force these to be the same concept. (Implementors may choose to pair them together to avoid code duplication).

Additional context

S3 Codec interface:

data-prepper/data-prepper-plugins/s3-source/src/main/java/com/amazon/dataprepper/plugins/source/codec/Codec.java

Lines 19 to 27 in 37c8b09

    
           public interface Codec { 
        
               /** 
        
                * Parses an {@link InputStream}. Implementors should call the {@link Consumer} for each 
        
                * {@link Record} loaded from the {@link InputStream}. 
        
                * 
        
                * @param inputStream   The input stream for the S3 object 
        
                * @param eventConsumer The consumer which handles each event from the stream 
        
                */ 
        
               void parse(InputStream inputStream, Consumer<Record<Event>> eventConsumer) throws IOException;

HTTP Codec interface:

data-prepper/data-prepper-plugins/http-source/src/main/java/com/amazon/dataprepper/plugins/source/loghttp/codec/Codec.java

Lines 16 to 23 in 37c8b09

    
           public interface Codec<T> { 
        
               /** 
        
                * parse the request into custom type 
        
                * 
        
                * @param httpData The content of the original HTTP request 
        
                * @return The target data type 
        
                */ 
        
               T parse(HttpData httpData) throws IOException;

The text was updated successfully, but these errors were encountered:

dlvenable · 2023-01-10T16:04:12Z

Here is a concept for how Data Prepper can provide this to pipeline authors.

Define Interfaces in data-prepper-api

The source codec interface can exist in the data-prepper-api package. In this way, any source will have access to the interface. None of the implementations should be in data-prepper-api.

It might look like the following.

public interface InputCodec {
  void parse(InputStream inputStream, Consumer<Record<Event>> eventConsumer) throws IOException; 
}

Create plugin projects for each type

Under data-prepper-plugins, create new projects which support different implementations based on the type. These projects could also have sink codecs in them.

Taking CSV as an example, we could have a project: data-prepper-plugins/csv-codecs. It would be able to have both the CSV source codec and the CSV sink codec. In this way, they can share some logic and dependencies.

Use the plugin framework for loading codecs

The Data Prepper plugin framework supports arbitrary interfaces. This can follow the same pattern as HTTP authentication in Armeria.

Following the CSV example, we might have the following class in data-prepper-plugins/csv-codec:

@DataPrepperPlugin(name = "csv",
        pluginType = InputCodec.class,
        pluginConfigurationType = CsvCodecConfig.class)
public CsvInputCodec implements InputCodec {
 ...
}

Sources load using the plugin framework

The S3 source, for example, can load from the plugin framework similar to how the HTTP source loads authentication. Unlike HTTP, the S3 source should not have a default value.

Support for Source Codecs opensearch-project#1532

dlvenable · 2023-03-31T14:14:43Z

Based on some of the changes coming to support this, I think the interface should have some modifications to support additional flexibility.

Here are some of the things that we may need InputCodec implementations to also need.

The content length of the stream. If the Parquet input codec needs to read all the data, parse, and then create events, it could be useful to have the total content length. In this way, the Parquet input codec could use an in-memory byte array for smaller files and a file system for larger files.
Two forms of exception callbacks:
- When the whole stream cannot be parsed (say it is not JSON), then the parse method should throw an exception. This is already in place.
- When a single event cannot be parsed (say a single invalid JSON object in an array), have a method to return an error to the Source that a single event could not be parsed.
A context. We may wish to re-use a single file for the Parquet input codec. The Source can set this value to the pipeline name. Or if it uses multiple threads, the pipeline name plus an identifier.

public class InputCodecStream {
  public InputStream getInputStream();
  /**
   * Optional. If provided, has the total length of the content supplied.
   */
  public Integer getContentLength();
}

public class InputCodecContext {
  public Consumer<Record<Event>> getEventConsumer();

  /*
   * Gets a `Consumer` for handling any errors parsing individual events.
   * This is called when a single event cannot be parsed from the stream.
   */
  public Consumer<Exception> getEventExceptionConsumer();

  /**
   * Represents a name that is unique across calls and also not called for multiple calls in parallel.
   * This could be used by the Parquet codec to re-use a file.
   */
  public String getContextName();
}

public interface InputCodec {
  void parse(InputCodecStream inputCodecStream, InputCodecContext inputCodecContext) throws IOException; 
}

Thoughts on this approach? @kkondaka , @graytaylor0 , @umairofficial

dlvenable · 2023-04-24T14:26:06Z

I'm re-opening this issue to improve the interface before we release 2.3.

dlvenable · 2023-05-17T18:11:06Z

In order to support Parquet codecs, we may need to update the codec interface to attempt to support seekable input.

Ideally, this means we have two forms of codecs - a base codec, and a seekable codec. It is possible that not all sources will support the means of choosing bytes.

dlvenable added the untriaged label Jun 23, 2022

dlvenable mentioned this issue Aug 8, 2022

Implemented CSVCodec for S3 Source, config & unit tests #1644

Merged

4 tasks

dlvenable added backlog and removed untriaged labels Aug 15, 2022

dlvenable mentioned this issue Oct 8, 2022

Support S3 as a Sink #1048

Closed

asifsmohammed added this to Data Prepper Tracking Board Feb 2, 2023

github-project-automation bot moved this to Untriaged in Data Prepper Tracking Board Feb 2, 2023

mahesh724 mentioned this issue Feb 27, 2023

-Parquet codec implementation added mahesh724/data-prepper#2

Merged

4 tasks

mahesh724 added a commit to mahesh724/data-prepper that referenced this issue Feb 27, 2023

Merge pull request #2 from mahesh724/parquetCodec

3b77164

Support for Source Codecs opensearch-project#1532

mahesh724 mentioned this issue Feb 27, 2023

https://github.com/opensearch-project/data-prepper/issues/2445 #2309

Closed

4 tasks

umairofficial mentioned this issue Mar 23, 2023

Support for source codecs | Avro Codec #2397

Closed

4 tasks

dlvenable mentioned this issue Mar 27, 2023

Support a generic codec structure for sinks #2403

Closed

umairofficial mentioned this issue Mar 29, 2023

Support for source codecs #2414

Closed

4 tasks

dlvenable added enhancement New feature or request and removed backlog labels Mar 31, 2023

This was referenced Apr 4, 2023

Support parsing Parquet formatted files from S3 (and other sources) #2445

Closed

Support parsing Avro files from S3 (and other sources) #2446

Closed

This was referenced Apr 17, 2023

Support for Source Codecs (CSV, Newline & JSON Codecs) #2501

Closed

-Support for source codecs: Avro Codecs #2502

Closed

-Support for Source Codecs: Parquet Codecs #2503

Closed

-Support for Source Codecs #2519

Merged

dlvenable closed this as completed in #2519 Apr 18, 2023

github-project-automation bot moved this from Untriaged to Done in Data Prepper Tracking Board Apr 18, 2023

This was referenced Apr 19, 2023

Avro codecs #2527

Merged

Parquet codecs #2528

Closed

dlvenable added this to the v2.3 milestone Apr 24, 2023

dlvenable reopened this Apr 24, 2023

github-project-automation bot moved this from Done to In progress in Data Prepper Tracking Board Apr 24, 2023

github-project-automation bot moved this from In progress to To do in Data Prepper Tracking Board Apr 24, 2023

github-actions bot added the untriaged label Apr 24, 2023

dlvenable removed the untriaged label Apr 24, 2023

dlvenable closed this as completed in #2527 Apr 24, 2023

github-project-automation bot moved this from To do to Done in Data Prepper Tracking Board Apr 24, 2023

umayr-codes mentioned this issue May 9, 2023

Avro codecs #2663

Closed

4 tasks

dlvenable reopened this May 16, 2023

github-project-automation bot moved this from Done to In progress in Data Prepper Tracking Board May 16, 2023

github-project-automation bot moved this from Done to To do in Data Prepper Tracking Board May 16, 2023

github-actions bot added the untriaged label May 16, 2023

dlvenable removed the untriaged label May 16, 2023

dlvenable self-assigned this May 17, 2023

umayr-codes mentioned this issue May 19, 2023

Source Codecs | Avro Codec follow-on PR #2715

Merged

4 tasks

asuresh8 mentioned this issue May 22, 2023

Add new InputCodec interface to support seek-able input and corresponding implementation and tests for S3 objects #2727

Merged

4 tasks

kkondaka closed this as completed in #2715 Jun 3, 2023

github-project-automation bot moved this from To do to Done in Data Prepper Tracking Board Jun 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support generic parsers/codecs #1532

Support generic parsers/codecs #1532

dlvenable commented Jun 23, 2022

dlvenable commented Jan 10, 2023

dlvenable commented Mar 31, 2023

dlvenable commented Apr 24, 2023

dlvenable commented May 17, 2023

Support generic parsers/codecs #1532

Support generic parsers/codecs #1532

Comments

dlvenable commented Jun 23, 2022

dlvenable commented Jan 10, 2023

dlvenable commented Mar 31, 2023

dlvenable commented Apr 24, 2023

dlvenable commented May 17, 2023