Best Practices for Transforming Multiple Log Sources #110
Replies: 1 comment 2 replies
-
Hey @Bin-security 👋 There are two potential answers here, so I'll try to answer them both and then we can discuss more. All data evaluation (filtering, differentiation, etc) happens via conditions that are applied to processors, and the best way to learn about how this works is by following this recipe. This is required no matter the architecture you use. With that mentioned, what you are describing may be more of an architecture question, and the architecture you decide to use can directly impact how you write configurations and filter data. Here's a diagram of the architecture that I think you're describing (feel free to redraw it in a reply if needed, I use mermaid.live): graph TD
%% core infrastructure
s3_1(S3 Bucket)
s3_2(S3 Bucket)
s3_3(S3 Bucket)
sns_topic(SNS Topic)
%% Lambda data processing
lambda[Lambda]
%% ingest
s3_1 ---|Push| sns_topic
s3_2 ---|Push| sns_topic
s3_3 ---|Push| sns_topic
sns_topic ---|Push| lambda
Multiple S3 buckets with different datasets are aggregating into a single SNS topic and Substation node. If you use this architecture, then your Lambda will need to differentiate and parse each dataset using a single configuration. This requires more complex configurations, but it's possible to do. If you want to do this, then I recommend using the S3 metadata as part of your conditions to filter the data. For example, this inspector will only match data that came from a specific S3 bucket:
You would need to combine that inspector with other inspectors to process data in each dataset individually. Here's an example of a config that drops specific events if they came from a specific S3 bucket:
Alternatively, you can change your architecture to this (which is closer to a traditional pub-sub model with different consumers that each process their own data) ... graph TD
%% core infrastructure
s3_1(S3 Bucket)
s3_2(S3 Bucket)
s3_3(S3 Bucket)
sns_topic(SNS Topic)
%% Lambda data processing
sns_sink_lambda1[Lambda]
sns_sink_lambda2[Lambda]
sns_sink_lambda3[Lambda]
%% ingest
s3_1 ---|Push| sns_topic
s3_2 ---|Push| sns_topic
s3_3 ---|Push| sns_topic
sns_topic ---|Push| sns_sink_lambda1
sns_topic ---|Push| sns_sink_lambda2
sns_topic ---|Push| sns_sink_lambda3
... or this (which are fully independent pipelines): graph TD
%% core infrastructure
s3_1(S3 Bucket)
s3_2(S3 Bucket)
s3_3(S3 Bucket)
sns_topic1(SNS Topic)
sns_topic2(SNS Topic)
sns_topic3(SNS Topic)
%% Lambda data processing
sns_sink_lambda1[Lambda]
sns_sink_lambda2[Lambda]
sns_sink_lambda3[Lambda]
%% ingest
s3_1 ---|Push| sns_topic1
s3_2 ---|Push| sns_topic2
s3_3 ---|Push| sns_topic3
sns_topic1 ---|Push| sns_sink_lambda1
sns_topic2 ---|Push| sns_sink_lambda2
sns_topic3 ---|Push| sns_sink_lambda3
At Brex we tend to deploy pipelines that follow the third diagram -- our pipelines are isolated based on dataset. This simplifies our configurations and gives us more control over the infrastructure. Hope that helps, feel free to add to the discussion if needed! |
Beta Was this translation helpful? Give feedback.
-
Hi @jshlbrd and @brexhq/substation. What are the best practices to support multiple log sources in Substation? We have many log sources which have their own schemas. Many of them are stored in S3 buckets. The S3 buckets publish notification events to a single SNS topic. Substation subscribes to the topic. We plan to do filtering and enrichment on the logs in substation, which are event specific. Does substation support differentiating multiple events in the transform configuration? What are the best practice for transforming multiple events in a Substation instance? Thanks for reading.
Beta Was this translation helpful? Give feedback.
All reactions