Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Externalize BOM ingestion pipeline #633

Open
nscuro opened this issue Jun 27, 2023 · 2 comments · May be fixed by DependencyTrack/hyades-apiserver#794
Open

Externalize BOM ingestion pipeline #633

nscuro opened this issue Jun 27, 2023 · 2 comments · May be fixed by DependencyTrack/hyades-apiserver#794
Assignees
Labels
architecture component/api-server enhancement New feature or request p3 Nice-to-have features size/XL Higher effort
Milestone

Comments

@nscuro
Copy link
Member

nscuro commented Jun 27, 2023

At the moment, processing of uploaded BOMs is happening entirely in-memory.

BomUploadProcessingTasks are enqueued to the internal task queue (see https://github.com/DependencyTrack/hyades/blob/main/WTF.md#why), and processed by the EventService thread pool.

The current design has some downsides:

  1. When the API server crashes or is stopped, queued tasks are lost. Users have to re-upload their BOM again after the API server was restarted, which may or may not be practical, depending on the users' workflows.
  2. Processing can not be shared among multiple instances of the API server.
  3. We do not store the original BOM, because it is not practical to store many large documents in a RDBMS.
  4. If BOMs are re-uploaded for the same project in close succession, we'll run into race conditions, as BOM ingestion is not executed in a single, large DB transaction.

The proposed enhancement involves storage of uploaded BOMs in a Koala-compatible system (e.g. the CycloneDX BOM Repository Server), and publishing "BOM uploaded" events to Kafka. Consumers (API server or specialized workers) consume from the Kafka topic, and perform the actual ingestion into the database.

sequenceDiagram
    Client->>+API Server: Upload BOM
    API Server->>Koala: Upload BOM
    Koala->>Koala: Validate BOM
    Koala->>API Server: Location of BOM in Koala (URL)
    API Server->>API Server: Generate and persist correlation ID<br/>identifying the upload
    API Server->>Kafka: Publish event to "BOM uploaded" topic
    Note over API Server, Kafka: Key=Project UUID<br/>Value=Koala URL<br/>Header=Correlation ID
    API Server->>Client: Report correlation ID
    loop continuously
        API Server->>Kafka: Consume from "BOM uploaded" topic
        loop for each event
            API Server->>Koala: Fetch BOM
            API Server->>API Server: Process BOM
            alt processing failed
                API Server->>API Server: Update status of upload in DB to "failed"
                API Server->>Kafka: Publish event to "BOM Processing failed" topic
            else processing succeeded
                API Server->>API Server: Update status of upload in DB to "successful"
                API Server->>Kafka: Publish event to "BOM Processed" topic
                API Server->>API Server: Trigger vuln analysis etc.
            end
        end
    end
Loading

Warning
Because BOM processing can take up to multiple minutes for huge BOMs, it is not viable to perform processing in Kafka Streams, where short processing times are mandatory (see #529). We either need to offload processing to a separate thread pool, or write some custom logic around the low-level Kafka consumer.

We need to look into proper AuthN / AuthZ for the Koala service. The BOM repo server does not have those built-in.

Focusing on the client-side a little more, existing workflows should still continue to work:

sequenceDiagram
    Client->>+API Server: Upload BOM
    API Server->>Client: Report correlation ID
    loop continuously
        Client->>API Server: Is BOM still being processed?<br/>(Using correlation ID)
        alt processing ongoing
            API Server->>Client: "true"
        else processing completed
            API Server->>Client: "false"
            Client->>Client: Stop polling
        end
    end
    Client->>API Server: Fetch findings
    Client->>API Server: Fetch policy violations
Loading
@nscuro nscuro added this to Hyades Jun 30, 2023
@nscuro nscuro moved this to Todo in Hyades Jun 30, 2023
@nscuro nscuro added size/XL Higher effort and removed size/L High effort labels Jul 7, 2023
@mehab mehab added p1 Critical bugs that prevent DT from being used, or features that must be implemented ASAP p2 Non-critical bugs, and features that help organizations to identify and reduce risk and removed p1 Critical bugs that prevent DT from being used, or features that must be implemented ASAP labels Jul 7, 2023
@nscuro
Copy link
Member Author

nscuro commented Jul 11, 2023

Decoupled the centralized tracking of processing status(es) into #664. Deprioritizing this issue.

@nscuro nscuro removed this from Hyades Jul 11, 2023
@nscuro nscuro added p3 Nice-to-have features and removed p2 Non-critical bugs, and features that help organizations to identify and reduce risk labels Jul 11, 2023
@nscuro
Copy link
Member Author

nscuro commented Jul 18, 2024

Project Koala / Transparency Exchange API will not arrive anytime soon, so we need to evaluate alternatives.

I don't think we should introduce a generic blob storage for this yet. Instead, we might want to consider storing uploaded BOMs in a new table, as BYTEA column.

BOMs can be arbitrarily large. While Postgres compresses large values, we still need to send all that data over the wire twice (once for storage, once for retrieval). The default compression is also not particularly good.

We already bring in zstd-jni via kafka-clients. I did some testing and it is possible to compress a ~22MB JSON BOM to ~1MB with reasonable resource consumption using zstd. I would thus propose that we perform compression/decompression in the application.

@nscuro nscuro added this to the 0.6.0 milestone Jul 18, 2024
@nscuro nscuro self-assigned this Aug 2, 2024
nscuro added a commit that referenced this issue Aug 5, 2024
@nscuro nscuro modified the milestones: 0.6.0, 0.7.0 Aug 22, 2024
nscuro added a commit that referenced this issue Sep 20, 2024
@nscuro nscuro mentioned this issue Oct 2, 2024
34 tasks
@nscuro nscuro added this to Hyades Oct 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
architecture component/api-server enhancement New feature or request p3 Nice-to-have features size/XL Higher effort
Projects
Status: In Progress
Development

Successfully merging a pull request may close this issue.

2 participants