Overhaul data model of dependency graphs for better query performance and relationship type support #3452

nscuro · 2024-02-08T09:55:04Z

Current Behavior

How Dependency Graphs Work Today

Both the PROJECT and COMPONENT object have a DIRECT_DEPENDENCIES column, which contains a JSON array of serialized ComponentIdentity objects.

This roughly resembles how dependencies are represented in CycloneDX v1.5 and earlier.

While this approach works, it has a few downsides:

Referential integrity can't be verified by the RDBMS. Since IDs are encoded in JSON, foreign keys can't be utilized.
Poor query performance. Traversing the graph requires expensive LIKE conditions such as "DIRECT_DEPENDENCIES" LIKE ('%' || :childComponentUuid || '%').
- Per default, DT does not create a working index on the DIRECT_DEPENDENCIES column
- A LIKE condition with wildcards on both ends requires a special index, such as GIN in PostgreSQL, which in turn requires an extension that is not enabled per default.
It bloats table sizes and REST API responses, since JSON arrays can become quite large for projects with busy dependency graphs.

To give an example of how traversal queries look like in the current data model, here's one from Hyades that identifies whether a specific component (identified by the parameter :leadComponentUuid) is introduced through another component with the name foo:

Example Query

-- Determine the project the given leaf component is part of.
WITH RECURSIVE
"CTE_PROJECT" AS (
  SELECT
    "PROJECT_ID" AS "ID"
  FROM
    "COMPONENT"
  WHERE
    "UUID" = :leafComponentUuid
),
-- Identify the IDs of all components in the project that
-- match the desired criteria.
"CTE_MATCHES" AS (
  SELECT
    "ID"
  FROM
    "COMPONENT"
  WHERE
    "PROJECT_ID" = (SELECT "ID" FROM "CTE_PROJECT")
    -- Do not consider other leaf nodes (typically the majority of components).
    -- Because we're looking for parent nodes, they MUST have direct dependencies defined.
    AND "DIRECT_DEPENDENCIES" IS NOT NULL
    AND ("NAME" = 'foo') -- NB: These could be arbitrarily many conditions
),
"CTE_DEPENDENCIES" ("UUID", "PROJECT_ID", "FOUND", "PATH") AS (
  SELECT
    "C"."UUID"                                       AS "UUID",
    "C"."PROJECT_ID"                                 AS "PROJECT_ID",
    ("C"."ID" = ANY(SELECT "ID" FROM "CTE_MATCHES")) AS "FOUND",
    ARRAY ["C"."ID"]::BIGINT[]                       AS "PATH"
  FROM
    "COMPONENT" AS "C"
  WHERE
    -- Short-circuit the recursive query if we don't have any matches at all.
    EXISTS(SELECT 1 FROM "CTE_MATCHES")
    -- Otherwise, find components of which the given leaf component is a direct dependency.
    AND "C"."DIRECT_DEPENDENCIES" LIKE ('%' || :leafComponentUuid || '%')
  UNION ALL
  SELECT
    "C"."UUID"                                       AS "UUID",
    "C"."PROJECT_ID"                                 AS "PROJECT_ID",
    ("C"."ID" = ANY(SELECT "ID" FROM "CTE_MATCHES")) AS "FOUND",
    ARRAY_APPEND("PREVIOUS"."PATH", "C"."ID")        AS "PATH"
  FROM
    "COMPONENT" AS "C"
  INNER JOIN
    "CTE_DEPENDENCIES" AS "PREVIOUS" ON "PREVIOUS"."PROJECT_ID" = "C"."PROJECT_ID"
  WHERE
    -- If the previous row was a match already, we're done.
    NOT "PREVIOUS"."FOUND"
    -- Also, ensure we haven't seen this component before, to prevent cycles.
    AND NOT ("C"."ID" = ANY("PREVIOUS"."PATH"))
    -- Otherwise, the previous component must appear in the current direct dependencies.
    AND "C"."DIRECT_DEPENDENCIES" LIKE ('%' || "PREVIOUS"."UUID" || '%')
)
SELECT BOOL_OR("FOUND") FROM "CTE_DEPENDENCIES";

CycloneDX v1.6

CycloneDX v1.6 will introduce the provides attribute to its Dependency model (source):

message Dependency {
  // References a component or service by the its bom-ref attribute
  string ref = 1;
  // The bom-ref identifiers of the components or services that are dependencies of this dependency object.
  repeated Dependency dependencies = 2;
  // The bom-ref identifiers of the components or services that define a given specification or standard, which are provided or implemented by this dependency object.
  repeated string provides = 3;
}

In order to support this, and potential future additions to relationships, we need to rework how we handle dependency graphs.

Proposed Behavior

I propose to revisit how we store and traverse dependency graphs.

Ideally, the graph structure would live in separate tables, and would be able to refer to multiple object types, such as Component, Service, Data, etc.

Instead of JSON, more efficient data types should be used, that are trivial to index and allow for referential integrity verification.

Just to put something out there:

CREATE TABLE "DEPENDENCY_GRAPH_NODE" (
  "ID" BIGSERIAL PRIMARY KEY,
  "PROJECT_ID"   BIGINT REFERENCES "PROJECT" ("ID") NULL,
  "COMPONENT_ID" BIGINT REFERENCES "COMPONENT" ("ID") NULL,
  "SERVICE_ID"   BIGINT REFERENCES "SERVICECOMPONENT" ("ID") NULL,
  -- Ensure that exactly one reference is provided.
  CHECK (
    ("PROJECT_ID" IS NOT NULL)::INT
      + ("COMPONENT_ID" IS NOT NULL)::INT
      + ("SERVICE_ID" IS NOT NULL)::INT = 1
  )
);

CREATE TABLE "DEPENDENCY_GRAPH_EDGE" (
  "FROM" BIGINT REFERENCES "DEPENDENCY_GRAPH_NODE" ("ID") NOT NULL,
  "TO"   BIGINT REFERENCES "DEPENDENCY_GRAPH_NODE" ("ID") NOT NULL,
  "TYPE" TEXT NOT NULL,
  CHECK ("TYPE" = ANY(ARRAY['DEPENDS_ON', 'PROVIDES']))
);

Or a less strict variant using JSON to store properties:

CREATE TABLE "DEPENDENCY_GRAPH_NODE" (
  "ID" BIGSERIAL PRIMARY KEY,
  "PROPERTIES" JSON NOT NULL
);

CREATE TABLE "DEPENDENCY_GRAPH_EDGE" (
  "FROM" BIGINT REFERENCES "DEPENDENCY_GRAPH_NODE" ("ID") NOT NULL,
  "TO"   BIGINT REFERENCES "DEPENDENCY_GRAPH_NODE" ("ID") NOT NULL,
  "PROPERTIES" JSON NULL
);

However, JSON support and the ability to index such columns varies wildly between RDBMSes.

We need to ensure that whatever data model we end up using, works well with the queries we're going to execute.

In the best case, the new graph structure would allow to also represent relationships between projects.

Checklist

I have read and understand the contributing guidelines
I have checked the existing issues for whether this enhancement was already requested

The text was updated successfully, but these errors were encountered:

nscuro added enhancement New feature or request p2 Non-critical bugs, and features that help organizations to identify and reduce risk cdx-1.6 Related to CycloneDX specification v1.6 performance spike / research labels Feb 8, 2024

nscuro added the size/L High effort label Mar 7, 2024

nscuro mentioned this issue Oct 22, 2024

feat: CBOM support DependencyTrack/hyades-apiserver#933

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overhaul data model of dependency graphs for better query performance and relationship type support #3452

Overhaul data model of dependency graphs for better query performance and relationship type support #3452

nscuro commented Feb 8, 2024 •

edited

Loading

Overhaul data model of dependency graphs for better query performance and relationship type support #3452

Overhaul data model of dependency graphs for better query performance and relationship type support #3452

Comments

nscuro commented Feb 8, 2024 • edited Loading

Current Behavior

How Dependency Graphs Work Today

CycloneDX v1.6

Proposed Behavior

Checklist

nscuro commented Feb 8, 2024 •

edited

Loading