-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Receive log data from S3 as a Source #251
Comments
There are at least two use-cases for this feature:
|
+1 |
I have created an initial draft for the S3 Source. Please see the description for details. |
Does it make sense to require users to use the |
@graytaylor0 , That's a good question. At first, I was thinking that it may be valuable to use the So, I am for changing this to output I'm also interested in including bucket and key data in the S3 object. So perhaps the JSON should go into a As an example, here is a possible input S3 object:
The output would be three Events:
Thoughts? |
I have a few points to raise:
|
Some of the additional SQS properties to consider for polling:
Are we planning to use long polling to receive the message ? |
Thanks @cmanning09 for bringing up these points.
|
@dinujoh , Regarding the The plugin configuration By default, the S3 plugin will use long polling. This is defined in the plugin configuration in the |
I made a change to the proposal for how JSON is loaded. The new proposal is to choose the first JSON array found. The expectation for Data Prepper is to receive a JSON object which is a single array of many different events. This should work with systems such as AWS CloudTrail. The original proposal was the use a JSON Pointer to select the array. The disadvantage to that approach is that it is not compatible with Jackson's streaming API. Using the new proposal will allow Data Prepper to use Jackson's streaming API. This will allow Data Prepper to load only parts of the JSON into memory at a time. And it could allow for retrying large files. |
I'd like to propose adding a few other metrics:
|
Taking another look, I recommend that we remove the |
Use-Case
Many users have external systems which write their logs to Amazon S3. These users want to use OpenSearch to analyze these logs. Data Prepper is an ingestion tool which can aid teams in extracting these logs for S3 and sending them to OpenSearch or elsewhere.
This proposal is to receive events from S3 notifications, read the object from S3, and create log lines for these.
Basic Configuration
This plugin will be a single source plugin which:
The following example shows what a basic configure would look like.
Detailed Process
The S3 Source will start a new thread for reading from S3. (The number of threads can be configured).
This thread will perform the following steps repeatedly until shutdown
ReceiveMessage
API to receive messages from SQS.a. Parse the Message as an S3Event.
b. Download the S3 Object which the S3Event indicates was created.
c. Decompress the object if configured to do so.
d. Parse the decompressed file using the configured
codec
into a list ofLog
Event
objects.e. Write the
Log
objects into the Data Prepper buffer.DeleteMessageBatch
with all of the messages which were successfully processed.Error Handling
The S3 Source will suppress exceptions which occur during processing. Any Message which is not processed correctly will not be included in the
DeleteMessageBatch
request. Thus, the message will appear in the SQS again. Data Prepper expects that the SQS queue is correctly configured with a DLQ or MessageRetentionPeriod to prevent the SQS queue from filling up with invalid messages.Codecs
The S3 Source will use configurable codecs to support multiple data formats in the S3 objects. Initially, two codecs are planned:
single-line
- This is used for logs which should be separated by a newline.json
- A codec for parsing JSON logsSingle Line
The
single-line
codec has no configuration items.Below is an example S3 object.
With
single-line
, the S3 source will produce 3 Events, each with the following structure.JSON
The
json
codec supports reading a JSON file and will create Events for each JSON object in an array. This S3 plugin is starting with the expectation that the incoming JSON is formed as a large JSON array of JSON objects. Each JSON object in that array is an Event. Thus, this codec will find the first JSON array in the JSON. It will output the objects within that array as Events from the JSON.Future iterations of this plugin could allow for more customization. One possibility is to use JSON Pointer. However, the first iteration should meet many use-cases and allows for streaming the JSON to support parsing large JSON objects.
Below is an example configuration. This configures the S3 Sink to read a JSON array from the
items
key.Given the following S3 Object:
The S3 source will output 3 Log events:
Compression
The S3 Source will support three configurations for compression.
none
- The object will be treated as uncompressed.gzip
- The object will be decompressed using the gzip decompression algorithmautomatic
- The S3 Source will example the object key to guess if it is compressed or not. If the key ends with.gz
the S3 Source will attempt to decompress it using gzip. It can support other heuristics to determine if the file is compressed in future iterations.Full Configuration Options
sqs
none
,gzip
,automatic
none
S3 Events
The S3 Source will parse all SQS Messages according to the S3 Event message structure.
The S3 Source will also parse out any event types which are not
s3:ObjectCreated:*
. These events will be silently ignored. That is, the S3 Source will remove them from the SQS Queue, and will not create an Events for them.Additionally, this source will have an optional
buckets
andaccount_ids
lists. If supplied by the pipeline author, Data Prepper will only read objects for S3 events which are part of that list. For thebuckets
list, only S3 buckets in the list are used. For theaccount_ids
list, only buckets owned by accounts with matching Ids are used. If this list is not provided, Data Prepper will read from any bucket which is owned by the accountId of the SQS queue. Use of this list is optional.AWS Permissions Needed
The S3 Source will require the following permissions:
s3:GetObject
sqs:ReceiveMessage
sqs.queue_url
sqs:DeleteMessageBatch
sqs.queue_url
Possible Future Enhancements
Direct SNS Notification
The
notification_type
currently only supports SQS. Some teams may want Data Prepper to receive notifications directly from SNS and thus remove the need for an SQS queue.The
notification_type
could support ansns
value in the future.Additional Codecs
As needed, Data Prepper can support other codecs. Some possible candidates to consider are:
Metrics
Not Included
Tasks
Retry failed buffer writes as long as the connection is alive #1500(Not for the initial feature release)The text was updated successfully, but these errors were encountered: