Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul data model of dependency graphs for better query performance and relationship type support #3452

Open
2 tasks done
nscuro opened this issue Feb 8, 2024 · 0 comments
Labels
cdx-1.6 Related to CycloneDX specification v1.6 enhancement New feature or request p2 Non-critical bugs, and features that help organizations to identify and reduce risk performance size/L High effort spike / research

Comments

@nscuro
Copy link
Member

nscuro commented Feb 8, 2024

Current Behavior

How Dependency Graphs Work Today

Both the PROJECT and COMPONENT object have a DIRECT_DEPENDENCIES column, which contains a JSON array of serialized ComponentIdentity objects.

This roughly resembles how dependencies are represented in CycloneDX v1.5 and earlier.

While this approach works, it has a few downsides:

  • Referential integrity can't be verified by the RDBMS. Since IDs are encoded in JSON, foreign keys can't be utilized.
  • Poor query performance. Traversing the graph requires expensive LIKE conditions such as "DIRECT_DEPENDENCIES" LIKE ('%' || :childComponentUuid || '%').
    • Per default, DT does not create a working index on the DIRECT_DEPENDENCIES column
    • A LIKE condition with wildcards on both ends requires a special index, such as GIN in PostgreSQL, which in turn requires an extension that is not enabled per default.
  • It bloats table sizes and REST API responses, since JSON arrays can become quite large for projects with busy dependency graphs.

To give an example of how traversal queries look like in the current data model, here's one from Hyades that identifies whether a specific component (identified by the parameter :leadComponentUuid) is introduced through another component with the name foo:

Example Query
-- Determine the project the given leaf component is part of.
WITH RECURSIVE
"CTE_PROJECT" AS (
  SELECT
    "PROJECT_ID" AS "ID"
  FROM
    "COMPONENT"
  WHERE
    "UUID" = :leafComponentUuid
),
-- Identify the IDs of all components in the project that
-- match the desired criteria.
"CTE_MATCHES" AS (
  SELECT
    "ID"
  FROM
    "COMPONENT"
  WHERE
    "PROJECT_ID" = (SELECT "ID" FROM "CTE_PROJECT")
    -- Do not consider other leaf nodes (typically the majority of components).
    -- Because we're looking for parent nodes, they MUST have direct dependencies defined.
    AND "DIRECT_DEPENDENCIES" IS NOT NULL
    AND ("NAME" = 'foo') -- NB: These could be arbitrarily many conditions
),
"CTE_DEPENDENCIES" ("UUID", "PROJECT_ID", "FOUND", "PATH") AS (
  SELECT
    "C"."UUID"                                       AS "UUID",
    "C"."PROJECT_ID"                                 AS "PROJECT_ID",
    ("C"."ID" = ANY(SELECT "ID" FROM "CTE_MATCHES")) AS "FOUND",
    ARRAY ["C"."ID"]::BIGINT[]                       AS "PATH"
  FROM
    "COMPONENT" AS "C"
  WHERE
    -- Short-circuit the recursive query if we don't have any matches at all.
    EXISTS(SELECT 1 FROM "CTE_MATCHES")
    -- Otherwise, find components of which the given leaf component is a direct dependency.
    AND "C"."DIRECT_DEPENDENCIES" LIKE ('%' || :leafComponentUuid || '%')
  UNION ALL
  SELECT
    "C"."UUID"                                       AS "UUID",
    "C"."PROJECT_ID"                                 AS "PROJECT_ID",
    ("C"."ID" = ANY(SELECT "ID" FROM "CTE_MATCHES")) AS "FOUND",
    ARRAY_APPEND("PREVIOUS"."PATH", "C"."ID")        AS "PATH"
  FROM
    "COMPONENT" AS "C"
  INNER JOIN
    "CTE_DEPENDENCIES" AS "PREVIOUS" ON "PREVIOUS"."PROJECT_ID" = "C"."PROJECT_ID"
  WHERE
    -- If the previous row was a match already, we're done.
    NOT "PREVIOUS"."FOUND"
    -- Also, ensure we haven't seen this component before, to prevent cycles.
    AND NOT ("C"."ID" = ANY("PREVIOUS"."PATH"))
    -- Otherwise, the previous component must appear in the current direct dependencies.
    AND "C"."DIRECT_DEPENDENCIES" LIKE ('%' || "PREVIOUS"."UUID" || '%')
)
SELECT BOOL_OR("FOUND") FROM "CTE_DEPENDENCIES";

CycloneDX v1.6

CycloneDX v1.6 will introduce the provides attribute to its Dependency model (source):

message Dependency {
  // References a component or service by the its bom-ref attribute
  string ref = 1;
  // The bom-ref identifiers of the components or services that are dependencies of this dependency object.
  repeated Dependency dependencies = 2;
  // The bom-ref identifiers of the components or services that define a given specification or standard, which are provided or implemented by this dependency object.
  repeated string provides = 3;
}

In order to support this, and potential future additions to relationships, we need to rework how we handle dependency graphs.

Proposed Behavior

I propose to revisit how we store and traverse dependency graphs.

Ideally, the graph structure would live in separate tables, and would be able to refer to multiple object types, such as Component, Service, Data, etc.

Instead of JSON, more efficient data types should be used, that are trivial to index and allow for referential integrity verification.

Just to put something out there:

CREATE TABLE "DEPENDENCY_GRAPH_NODE" (
  "ID" BIGSERIAL PRIMARY KEY,
  "PROJECT_ID"   BIGINT REFERENCES "PROJECT" ("ID") NULL,
  "COMPONENT_ID" BIGINT REFERENCES "COMPONENT" ("ID") NULL,
  "SERVICE_ID"   BIGINT REFERENCES "SERVICECOMPONENT" ("ID") NULL,
  -- Ensure that exactly one reference is provided.
  CHECK (
    ("PROJECT_ID" IS NOT NULL)::INT
      + ("COMPONENT_ID" IS NOT NULL)::INT
      + ("SERVICE_ID" IS NOT NULL)::INT = 1
  )
);

CREATE TABLE "DEPENDENCY_GRAPH_EDGE" (
  "FROM" BIGINT REFERENCES "DEPENDENCY_GRAPH_NODE" ("ID") NOT NULL,
  "TO"   BIGINT REFERENCES "DEPENDENCY_GRAPH_NODE" ("ID") NOT NULL,
  "TYPE" TEXT NOT NULL,
  CHECK ("TYPE" = ANY(ARRAY['DEPENDS_ON', 'PROVIDES']))
);

Or a less strict variant using JSON to store properties:

CREATE TABLE "DEPENDENCY_GRAPH_NODE" (
  "ID" BIGSERIAL PRIMARY KEY,
  "PROPERTIES" JSON NOT NULL
);

CREATE TABLE "DEPENDENCY_GRAPH_EDGE" (
  "FROM" BIGINT REFERENCES "DEPENDENCY_GRAPH_NODE" ("ID") NOT NULL,
  "TO"   BIGINT REFERENCES "DEPENDENCY_GRAPH_NODE" ("ID") NOT NULL,
  "PROPERTIES" JSON NULL
);

However, JSON support and the ability to index such columns varies wildly between RDBMSes.

We need to ensure that whatever data model we end up using, works well with the queries we're going to execute.

In the best case, the new graph structure would allow to also represent relationships between projects.

Checklist

@nscuro nscuro added enhancement New feature or request p2 Non-critical bugs, and features that help organizations to identify and reduce risk cdx-1.6 Related to CycloneDX specification v1.6 performance spike / research labels Feb 8, 2024
@nscuro nscuro added the size/L High effort label Mar 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cdx-1.6 Related to CycloneDX specification v1.6 enhancement New feature or request p2 Non-critical bugs, and features that help organizations to identify and reduce risk performance size/L High effort spike / research
Projects
None yet
Development

No branches or pull requests

1 participant