Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the need of comparison artifact in order to make propagation self-sufficient #60

Open
davidferlay opened this issue Nov 19, 2024 · 5 comments

Comments

@davidferlay
Copy link
Contributor

davidferlay commented Nov 19, 2024

Current

cd readme:

  • Overall Propagation Workflow:
    • Search and download the artifact to compare the build to.
    • Find the list of different files between the build and the artifact.
    • Identify list of resources which version should be propagated.
    • Build propagation map.
    • Update resources in build dir.

In summary, current build is compared with an artifact built from a previous commit to propagate the versions of resources which have changed

Expected

  • Main idea:

    • We can build propagation map from collecting date of last change for all resources, in parallel of building dependency tree, and use created timeline to build propagation map
    • The idea would be to avoid storing results of propagation at all by repropagating all resources and all variables everytime sync is executed
    • It means that all of versions will be re-computed every time sync is executed but has the great advantage of not relying on external artifact file
    • It also means that every resource will have a different comparison point, based on latest changes done to it, kind of
  • Process could be:

    • Build the dependency tree
    • Take note of which namespace (which git repo) resource comes from
    • Get date of last change for version value in meta/plasma.yaml using something similar to git --no-pager log -1 --oneline -- integration/flows/roles/account_person_identification/meta/plasma.yaml (that actually look only to version value and filter out any other change)
    • For each resource which has parents, compare both child and parent dates: if child has been modified later than parent, child version can be propagated to parent. If not, child version is not propagated to parent.
    • Do that bottom to top comparison in an itterative manner on all resources and all their parents to build full propagation map
    • In case a parent resource has multilpe child resources, process can be speed up by comparing meta version change date of all children between them to decide which child will give it's version to the parent resource without iterating on each other child for the same parent
    • Concerning the propagation of group_vars and vault variables: each var used in a specific resource can be part of same "candidate list" where child resources are listed: Var change/addition/removal date will be compared to meta version change date of all childrens, and only most recent one will give it's version to parent.
  • So if we look back and try to make sense of it all from the start, process could be:

    • make list of all resources post-composition
    • find date of last change for all resources meta version
    • make list of all variables from group_vars
    • find date of last change for all variables from group_vars
    • find all resources where these variables are used
    • make list of all variables from vaults
    • find all resources where these variables are used
    • find date of last change for all variables from vaults
    • then for each of all resources, iterate this way:
      • make "candidate list" which includes: all children resources + all variables from group_vars used in this resource + all variables from vaults used in this resource
      • compare the "change date" of all candidates of "candidate list" to decide which will give it's version to current resource (version of the candidate with most recent change date)
      • propagate version of elected candidate to resource (as ZZZ in current dual syntax XXX-ZZZ)

For variables from vaults, it's a bit more tricky to find out precisely in which commit each var was last changed, considering that git won't be able to tell it as the whole file is encrypted. Therefore it's probably needed to compare with previous vault state untill commit where each var of vault was updated last is found

Performance optimization

Let's consider an example where optimization may be required:

  • Resource A is child of resource B
  • Resource B is child of resource C
  • Variable X from group_vars is used in resource C

Let's also consider change date for each of these:

  • Resource A change date is 2st of november
  • Resource B change date is 3st of november
  • Resource C change date is 1st of november
  • Variable X change date is 4st of november

Using a standard topological sort to process the graph bottom-to-top and ensuring that children are processed before their parents; will; in this case, result in a lot of iterations:

  • Version of resource A will first propagate to resource B and C
  • Version of resource B will then propagate to resource C
  • Version of variable X will finaly propagate to resource C
    Meaning resource C will have it's version changed 3 times, whereas 2 of these change may not have been useful, considering only 1 of them will remain.

How can we make sure that no un-necessary processing is performed in such a case ?

  • Instead of propagating versions iteratively in multiple passes using a top-to-bottom approach, we can chronologically sort all resources + variables by change date and compute propagation map from most recent to least recent. Once a resource version has been updated, it can be marked as 'processed' and be skipped in further processing to ensure that the resource’s version is updated only once, directly to its final state.
  • It's important to note that resources + variables by change date will define which "candidate group" to process and it's their parent that will thereafter be marked as processed, if any. In further iteration, any "candidate group" candidating to update version of a parent resource will skip any parent already processed.

Final result

  • All resources with children should have been propagated and end up with dual version syntax
@davidferlay
Copy link
Contributor Author

davidferlay commented Dec 20, 2024

  • Updated resources in domain repo
  • Successive updates of different resources in domain repo (with no dependecy to each other)
  • Successive updates of different resources in domain repo then package (with no dependecy to each other)
  • Successive updates of different resources in package (with no dependecy to each other)
  • Successive updates of different resources in package repo then domain repo (with no dependecy to each other)
  • Added resources in a package
  • Added resources in a package
  • Added resources in domain repo (app)
  • Added resources in domain repo (service)
  • Added resources in domain repo (software)
  • Updated resources in domain repo (services)
  • Removed resources in domain repo (softwares)
  • Removed resources in a package (softwares)
  • Successive updates of resources with low dep in domain repo and dependants in package repo (with dependecy to each other)
  • Successive updates of resources with low dep in package repo and dependants in package repo (with dependecy to each other)
  • Successive updates of resources with low dep in package repo and dependants in domain rep (with dependecy to each other)
  • Updated variable in an existing group_vars in domain repo
  • [KO] Removed variable in an existing group_vars in domain repo
  • Added variable in an existing group_vars in domain repo
  • [KO] Removed group_vars file in domain repo
  • Added group_vars file in domain repo
  • Updated variable in an existing vault in domain repo
  • [KO] Removed variable in an existing vault in domain repo
  • Added variable in an existing vault in domain repo
  • [KO] Removed vault file in domain repo
  • Added vault file in domain repo
  • Successive updates of resources with low dep in package repo and dependants in domain rep (with dependecy to each other) then updated variable in an existing group_vars/vault in domain repo
  • --list-impacted-resources removed
  • Print a warning during propagation when version of resource is not a bump commit but a regular commit (with -vvv)
  • Throw a warning during propagation when version in meta does not match an existing commit
  • Throw an error during propagation when version in build dir does not match head of resource (if --allow-override is not set)
  • Throw a warning during propagation when version in build dir does not match head of resource (if --allow-override is set)
  • Move propagated version also in logs
  • Add progress bar or throbber
  • Add a INFO print "Processing propagation..."
  • Move warning from warning to debug: "duplicate found, parsing YAML file manually"
  • Iterate timeline from latest to oldest to avoid processing same resource more than one time
  • Reduce git history iteration
  • Determine only used resources and look their versions
  • Determine only used variables and look their versions

TLDR: overall works as expected 👌

The only behavior lost in translation is detection/propagation of deleted group_var/vault variables. Which we could probably live with (at least for now) considering the other advantages these changes bring to the table

@iignatevich
Copy link
Collaborator

iignatevich commented Dec 20, 2024

@davidferlay
how much is implemented from description ideas of @iignatevich

  • does each resource uses a kind of "candidate list" ?
  • does each resource is processed only one time for perf ? (similarly to what described in "Performance optimization")

logic before then for each of all resources, iterate this way worked before just like described, i just improved speed (code quality).

Iterating git history and retrieving versions of variables/resources are longest operations (as you could see from progress bars)

There is no candidate list as described above, but something close to it. Every dependency between is pre-calculated before processing git history:

  • resource used in resources [a,b,c]
  • variable used in variables [a,b,c]
  • variable used in resources [a,b,c]

Later this data used during iterating timeline items and gather list of dependent resources/variable to set version to propagate

Resources / Variables processed once, but some variables and resource can depend on resources/variables from next timeline item, so iterated again, but everything use already calculated data.

Most of the time spend on iterating git and gathering versions, as we open meta.yaml / vars.yaml / vault.yaml in each commit.

My suggestion for improvement here is:

right now, to get latest version for resource/variable, we iterate full git history, meaning we open yaml files from bytes for each commit

Idea is to:

- iterate full git history first
- determine commits where meta / group_vars /vault files were updated
- iterate only these commits to open yaml files and process what we need to process.

basically simulate git log --online --file x to get list of commits when file was changed / added
as go-git can't do that properly
it will cut processing unnecessary commits -> less yaml parsing ->should improve speed

@iignatevich
Copy link
Collaborator

iignatevich commented Dec 20, 2024

@davidferlay
why not tag resource already processed as "not to be processed anymore" ?
cause according to timeline, only last change of resource/deps/var give it's version
goal of "candidate list" is partly that, but main mechanism is to process from timeline from latest to oldest
see what I mean ?

I didn't consider this and took logic that was working before. I think it's a good change to iterate timeline from newest to oldest, in this case there will be less checks and overrides of version

@iignatevich
Copy link
Collaborator

iignatevich commented Dec 20, 2024

@davidferlay

we can improve how list of all resources to process is created

  • current: we look for meta/plasma.yaml recusrively ?
    • proposal: we can parse playbooks like platform/platform.yaml, interaction/interaction.yaml etc to get list of top level resources and then list their dependencies resursively
    • That will have the advantage of not processing any resources which is in repo but not used by ansible (which can be a lot)
  • we can improve how list of variables to process is created
    • current: we look for all group_vars and vaults files to list all existing variables, then we check in which resources they are used
    • proposal: we can use list of "actualy used resources" describe in point before to check which variables are used in them, then look for these variables in group_vars and vaults, then process only these variables
    • That will have the advantage of not processing any variable which is in repo but not used by any of the actualy-used resources (which can be a lot)
  • processing latest-to-oldest idea we already discussed

Agree on these point, that we can tweak it more. It will concludes into this list:

  • reduce git history iteration
  • determine only used resources and look for their versions
  • determine only used variables and look for their versions
  • iterate timeline last to oldest, to reduce amount of dependencies cherry-picking

not a big change for variables, because we parse per file,not per variable.
Unless all variables are not used in file. To implement latest i guess.

@davidferlay
Copy link
Contributor Author

davidferlay commented Dec 24, 2024

Merging current state for real situation testing

Let's create another PR for performance optimisation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants