Remove the need of comparison artifact in order to make propagation self-sufficient #60

davidferlay · 2024-11-19T14:05:19Z

Current

Overall Propagation Workflow:
- Search and download the artifact to compare the build to.
- Find the list of different files between the build and the artifact.
- Identify list of resources which version should be propagated.
- Build propagation map.
- Update resources in build dir.

In summary, current build is compared with an artifact built from a previous commit to propagate the versions of resources which have changed

Expected

Main idea:
- We can build propagation map from collecting date of last change for all resources, in parallel of building dependency tree, and use created timeline to build propagation map
- The idea would be to avoid storing results of propagation at all by repropagating all resources and all variables everytime sync is executed
- It means that all of versions will be re-computed every time sync is executed but has the great advantage of not relying on external artifact file
- It also means that every resource will have a different comparison point, based on latest changes done to it, kind of
Process could be:
- Build the dependency tree
- Take note of which namespace (which git repo) resource comes from
- Get date of last change for version value in meta/plasma.yaml using something similar to git --no-pager log -1 --oneline -- integration/flows/roles/account_person_identification/meta/plasma.yaml (that actually look only to version value and filter out any other change)
- For each resource which has parents, compare both child and parent dates: if child has been modified later than parent, child version can be propagated to parent. If not, child version is not propagated to parent.
- Do that bottom to top comparison in an itterative manner on all resources and all their parents to build full propagation map
- In case a parent resource has multilpe child resources, process can be speed up by comparing meta version change date of all children between them to decide which child will give it's version to the parent resource without iterating on each other child for the same parent
- Concerning the propagation of group_vars and vault variables: each var used in a specific resource can be part of same "candidate list" where child resources are listed: Var change/addition/removal date will be compared to meta version change date of all childrens, and only most recent one will give it's version to parent.
So if we look back and try to make sense of it all from the start, process could be:
- make list of all resources post-composition
- find date of last change for all resources meta version
- make list of all variables from group_vars
- find date of last change for all variables from group_vars
- find all resources where these variables are used
- make list of all variables from vaults
- find all resources where these variables are used
- find date of last change for all variables from vaults
- then for each of all resources, iterate this way:
  - make "candidate list" which includes: all children resources + all variables from group_vars used in this resource + all variables from vaults used in this resource
  - compare the "change date" of all candidates of "candidate list" to decide which will give it's version to current resource (version of the candidate with most recent change date)
  - propagate version of elected candidate to resource (as ZZZ in current dual syntax XXX-ZZZ)

For variables from vaults, it's a bit more tricky to find out precisely in which commit each var was last changed, considering that git won't be able to tell it as the whole file is encrypted. Therefore it's probably needed to compare with previous vault state untill commit where each var of vault was updated last is found

Performance optimization

Let's consider an example where optimization may be required:

Resource A is child of resource B
Resource B is child of resource C
Variable X from group_vars is used in resource C

Let's also consider change date for each of these:

Resource A change date is 2st of november
Resource B change date is 3st of november
Resource C change date is 1st of november
Variable X change date is 4st of november

Using a standard topological sort to process the graph bottom-to-top and ensuring that children are processed before their parents; will; in this case, result in a lot of iterations:

Version of resource A will first propagate to resource B and C
Version of resource B will then propagate to resource C
Version of variable X will finaly propagate to resource C
Meaning resource C will have it's version changed 3 times, whereas 2 of these change may not have been useful, considering only 1 of them will remain.

How can we make sure that no un-necessary processing is performed in such a case ?

Instead of propagating versions iteratively in multiple passes using a top-to-bottom approach, we can chronologically sort all resources + variables by change date and compute propagation map from most recent to least recent. Once a resource version has been updated, it can be marked as 'processed' and be skipped in further processing to ensure that the resource’s version is updated only once, directly to its final state.
It's important to note that resources + variables by change date will define which "candidate group" to process and it's their parent that will thereafter be marked as processed, if any. In further iteration, any "candidate group" candidating to update version of a parent resource will skip any parent already processed.

Final result

All resources with children should have been propagated and end up with dual version syntax

The text was updated successfully, but these errors were encountered:

davidferlay · 2024-12-20T13:25:20Z

TLDR: overall works as expected 👌

The only behavior lost in translation is detection/propagation of deleted group_var/vault variables. Which we could probably live with (at least for now) considering the other advantages these changes bring to the table

iignatevich · 2024-12-20T13:33:58Z

@davidferlay
how much is implemented from description ideas of @iignatevich

does each resource uses a kind of "candidate list" ?

does each resource is processed only one time for perf ? (similarly to what described in "Performance optimization")

logic before then for each of all resources, iterate this way worked before just like described, i just improved speed (code quality).

Iterating git history and retrieving versions of variables/resources are longest operations (as you could see from progress bars)

There is no candidate list as described above, but something close to it. Every dependency between is pre-calculated before processing git history:

resource used in resources [a,b,c]
variable used in variables [a,b,c]
variable used in resources [a,b,c]

Later this data used during iterating timeline items and gather list of dependent resources/variable to set version to propagate

Resources / Variables processed once, but some variables and resource can depend on resources/variables from next timeline item, so iterated again, but everything use already calculated data.

Most of the time spend on iterating git and gathering versions, as we open meta.yaml / vars.yaml / vault.yaml in each commit.

My suggestion for improvement here is:

right now, to get latest version for resource/variable, we iterate full git history, meaning we open yaml files from bytes for each commit

Idea is to:

- iterate full git history first
- determine commits where meta / group_vars /vault files were updated
- iterate only these commits to open yaml files and process what we need to process.

basically simulate git log --online --file x to get list of commits when file was changed / added
as go-git can't do that properly
it will cut processing unnecessary commits -> less yaml parsing ->should improve speed

iignatevich · 2024-12-20T13:35:09Z

@davidferlay
why not tag resource already processed as "not to be processed anymore" ?
cause according to timeline, only last change of resource/deps/var give it's version
goal of "candidate list" is partly that, but main mechanism is to process from timeline from latest to oldest
see what I mean ?

I didn't consider this and took logic that was working before. I think it's a good change to iterate timeline from newest to oldest, in this case there will be less checks and overrides of version

iignatevich · 2024-12-20T13:38:01Z

@davidferlay

we can improve how list of all resources to process is created

current: we look for meta/plasma.yaml recusrively ?

proposal: we can parse playbooks like platform/platform.yaml, interaction/interaction.yaml etc to get list of top level resources and then list their dependencies resursively

That will have the advantage of not processing any resources which is in repo but not used by ansible (which can be a lot)

we can improve how list of variables to process is created

current: we look for all group_vars and vaults files to list all existing variables, then we check in which resources they are used

proposal: we can use list of "actualy used resources" describe in point before to check which variables are used in them, then look for these variables in group_vars and vaults, then process only these variables

That will have the advantage of not processing any variable which is in repo but not used by any of the actualy-used resources (which can be a lot)

processing latest-to-oldest idea we already discussed

Agree on these point, that we can tweak it more. It will concludes into this list:

reduce git history iteration
determine only used resources and look for their versions
determine only used variables and look for their versions
iterate timeline last to oldest, to reduce amount of dependencies cherry-picking

not a big change for variables, because we parse per file,not per variable.
Unless all variables are not used in file. To implement latest i guess.

davidferlay · 2024-12-24T12:55:35Z

Merging current state for real situation testing

Let's create another PR for performance optimisation

iignatevich added a commit that referenced this issue Dec 5, 2024

#60: propagation without artifact

573e732

iignatevich added a commit that referenced this issue Dec 12, 2024

#60: propagation without artifact

5478516

iignatevich added a commit that referenced this issue Dec 12, 2024

#60: ability to hack build with uncommitted values

0307bfe

iignatevich added a commit that referenced this issue Dec 24, 2024

#60: propagation without artifact

4d11262

davidferlay pushed a commit that referenced this issue Dec 24, 2024

#60: propagation without artifact

8e05a0b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove the need of comparison artifact in order to make propagation self-sufficient #60

Remove the need of comparison artifact in order to make propagation self-sufficient #60

davidferlay commented Nov 19, 2024 •

edited

Loading

davidferlay commented Dec 20, 2024 •

edited

Loading

iignatevich commented Dec 20, 2024 •

edited

Loading

iignatevich commented Dec 20, 2024 •

edited

Loading

iignatevich commented Dec 20, 2024 •

edited

Loading

davidferlay commented Dec 24, 2024 •

edited

Loading

Remove the need of comparison artifact in order to make propagation self-sufficient #60

Remove the need of comparison artifact in order to make propagation self-sufficient #60

Comments

davidferlay commented Nov 19, 2024 • edited Loading

Current

Expected

Performance optimization

Final result

davidferlay commented Dec 20, 2024 • edited Loading

iignatevich commented Dec 20, 2024 • edited Loading

iignatevich commented Dec 20, 2024 • edited Loading

iignatevich commented Dec 20, 2024 • edited Loading

davidferlay commented Dec 24, 2024 • edited Loading

davidferlay commented Nov 19, 2024 •

edited

Loading

davidferlay commented Dec 20, 2024 •

edited

Loading

iignatevich commented Dec 20, 2024 •

edited

Loading

iignatevich commented Dec 20, 2024 •

edited

Loading

iignatevich commented Dec 20, 2024 •

edited

Loading

davidferlay commented Dec 24, 2024 •

edited

Loading