Best Practices for Transforming Multiple Log Sources #110

Bin-security · 2023-05-04T21:42:47Z

Bin-security
May 4, 2023

Hi @jshlbrd and @brexhq/substation. What are the best practices to support multiple log sources in Substation? We have many log sources which have their own schemas. Many of them are stored in S3 buckets. The S3 buckets publish notification events to a single SNS topic. Substation subscribes to the topic. We plan to do filtering and enrichment on the logs in substation, which are event specific. Does substation support differentiating multiple events in the transform configuration? What are the best practice for transforming multiple events in a Substation instance? Thanks for reading.

jshlbrd · 2023-05-05T18:30:18Z

jshlbrd
May 5, 2023
Maintainer

Hey @Bin-security 👋 There are two potential answers here, so I'll try to answer them both and then we can discuss more.

All data evaluation (filtering, differentiation, etc) happens via conditions that are applied to processors, and the best way to learn about how this works is by following this recipe. This is required no matter the architecture you use.

With that mentioned, what you are describing may be more of an architecture question, and the architecture you decide to use can directly impact how you write configurations and filter data. Here's a diagram of the architecture that I think you're describing (feel free to redraw it in a reply if needed, I use mermaid.live):

graph TD
    %% core infrastructure
    s3_1(S3 Bucket)
    s3_2(S3 Bucket)
    s3_3(S3 Bucket)
    sns_topic(SNS Topic)

    %% Lambda data processing
    lambda[Lambda]

    %% ingest
    s3_1 ---|Push| sns_topic
    s3_2 ---|Push| sns_topic
    s3_3 ---|Push| sns_topic
    sns_topic ---|Push| lambda

Multiple S3 buckets with different datasets are aggregating into a single SNS topic and Substation node. If you use this architecture, then your Lambda will need to differentiate and parse each dataset using a single configuration. This requires more complex configurations, but it's possible to do. If you want to do this, then I recommend using the S3 metadata as part of your conditions to filter the data. For example, this inspector will only match data that came from a specific S3 bucket:

sub.patterns.inspector.strings.starts_with(expression='my-bucket', key='!metadata bucketName'),

You would need to combine that inspector with other inspectors to process data in each dataset individually. Here's an example of a config that drops specific events if they came from a specific S3 bucket:

local sub = import '../../build/config/substation.libsonnet';

local op = sub.interfaces.operator.all(
  sub.patterns.inspector.strings.starts_with(expression='my-bucket', key='!metadata bucketName'),
  sub.patterns.inspector.strings.contains(expression='my-value', key='my.field'),
);

local processors = [
  {
    processor: sub.interfaces.processor.drop(settings={condition: op})
  },
];

{
  processors: sub.helpers.flatten_processors(processors),
}

Alternatively, you can change your architecture to this (which is closer to a traditional pub-sub model with different consumers that each process their own data) ...

graph TD
    %% core infrastructure
    s3_1(S3 Bucket)
    s3_2(S3 Bucket)
    s3_3(S3 Bucket)
    sns_topic(SNS Topic)

    %% Lambda data processing
    sns_sink_lambda1[Lambda]
    sns_sink_lambda2[Lambda]
    sns_sink_lambda3[Lambda]

    %% ingest
    s3_1 ---|Push| sns_topic
    s3_2 ---|Push| sns_topic
    s3_3 ---|Push| sns_topic
    sns_topic ---|Push| sns_sink_lambda1
    sns_topic ---|Push| sns_sink_lambda2
    sns_topic ---|Push| sns_sink_lambda3

... or this (which are fully independent pipelines):

graph TD
    %% core infrastructure
    s3_1(S3 Bucket)
    s3_2(S3 Bucket)
    s3_3(S3 Bucket)
    sns_topic1(SNS Topic)
    sns_topic2(SNS Topic)
    sns_topic3(SNS Topic)

    %% Lambda data processing
    sns_sink_lambda1[Lambda]
    sns_sink_lambda2[Lambda]
    sns_sink_lambda3[Lambda]

    %% ingest
    s3_1 ---|Push| sns_topic1
    s3_2 ---|Push| sns_topic2
    s3_3 ---|Push| sns_topic3
    sns_topic1 ---|Push| sns_sink_lambda1
    sns_topic2 ---|Push| sns_sink_lambda2
    sns_topic3 ---|Push| sns_sink_lambda3

At Brex we tend to deploy pipelines that follow the third diagram -- our pipelines are isolated based on dataset. This simplifies our configurations and gives us more control over the infrastructure.

Hope that helps, feel free to add to the discussion if needed!

2 replies

Bin-security May 5, 2023
Author

Thanks @jshlbrd We were thinking about the third approach as well. It seems that the AWS cost is similar compared to the first approach since AWS charges by resource usage.

jshlbrd May 5, 2023
Maintainer

That's correct, the first and third architectures should be approximately the same cost since all of the resources are pay-per-use. The second architecture would actually be more expensive because each Substation node (Lambda) downloads every object from all S3 buckets, then filters unnecessary events.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best Practices for Transforming Multiple Log Sources #110

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Best Practices for Transforming Multiple Log Sources #110

Bin-security May 4, 2023

Replies: 1 comment · 2 replies

jshlbrd May 5, 2023 Maintainer

Bin-security May 5, 2023 Author

jshlbrd May 5, 2023 Maintainer

Bin-security
May 4, 2023

Replies: 1 comment 2 replies

jshlbrd
May 5, 2023
Maintainer

Bin-security May 5, 2023
Author

jshlbrd May 5, 2023
Maintainer