Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite Selective Check in Python #22327

Closed

Conversation

edithturn
Copy link
Contributor

@edithturn edithturn commented Mar 16, 2022

@edithturn
Copy link
Contributor Author

edithturn commented Mar 19, 2022

@potiuk I created two different files: scc_get_changed_files.py and selective_ci_checks.py and in ci.yaml I am just calling to selective_ci_checks.py.
I have a doubt about this statement:
Screenshot from 2022-03-18 20-21-44
I understand this:

  1. Direct Push (PR is merged) doesn't generate a commit hash (SHA)
  2. Merge Commit generates a commit hash

Both are "merged" why Direct Push doesn't generate a commit SHA?

On the other hand, could you explain to me the selected part of this script?
I know that we are trying to get all the names of the changed files from a commit SHA. I saw also that this symbol "^" in bash is to keep lower case the first letter of a string (In python I am having the same issue "ambiguous argument")

CHANGED_FILES=$(git diff-tree --no-commit-id --name-only \
-r "${INCOMING_COMMIT_SHA}^" "${INCOMING_COMMIT_SHA}" || true)

Screenshot from 2022-03-18 20-44-27

Thanks in advance!

This is a great task, I am getting fun 💃🏼

@potiuk
Copy link
Member

potiuk commented Mar 20, 2022

Direct Push (PR is merged) doesn't generate a commit hash (SHA)

It also generates commit hash, but we do not care about it. For Direct Push/Merge we always want to run all posible tests so in this case selective check turns into "run everything possible".

Merge Commit generates a commit hash

Yep. We have it in all cases (because we are using it to assign the "tag" to images we build). But in case of Pull Request Github prepares a "merge commit" containing only changes coming from the PR, and the commit HASH we have is this particular commit.

CHANGED_FILES=$(git diff-tree --no-commit-id --name-only -r "${INCOMING_COMMIT_SHA}^" "${INCOMING_COMMIT_SHA} || true)

This is just a way how you can get the list of files when you get an incoming commit:

  1. git diff-tree produces information about difference between two commits
  2. --no-commit-id --name-only -> this causes git diff-tree to only show file names that were changed between the two commits
  3. -r "${INCOMING_COMMIT_SHA}^" -> this is the parent (^) of the incoming merge commit
  4. "${INCOMING_COMMIT_SHA}" -> this is the incoming commit

So what we get at the end is the list of changed files between the "parent of commit" and "commit" that is incoming from the PR.

@potiuk
Copy link
Member

potiuk commented Mar 20, 2022

I saw also that this symbol "^" in bash is to keep lower case the first letter of a string (In python I am having the same issue "ambiguous argument")

The ^ is not a bash construct. It's GIT's "commit parent". It is so called "commit-ish" reference - one of the many ways you can refer to a commit in git.

For example HEAD^ means direct (first) parent of the commit, HEAD^^ or HEAD~2 is (following first parents grand-parent and HEAD^2 means second parent (if commit is a merge commit and has more than one parent).

All the "commit-ish" references can be found here: https://mirrors.edge.kernel.org/pub/software/scm/git/docs/gitrevisions.html#_specifying_revisions

BTW. one of the great things in our "rebase" workflow is that we have no merge commits. Every commit (after pull request is merged) has exactly one parent, so it makes all the reasoning about changes WAY simpler (as you do not have commits with multiple parents, you know exactly which files has changed by making a diff between the commit and it's parent).

@edithturn edithturn force-pushed the test-rewrite-selective-check-python branch from 9ea3f67 to 678cc39 Compare March 20, 2022 15:16
@edithturn
Copy link
Contributor Author

edithturn commented Mar 20, 2022

Thanks, @potiuk. I am retrieving all the files that are modified in the PR. Let me know now if the statement bellow is in the right way.

Process:
Steps 1, 2, and 3 will be in the Selective Check main script. The other steps will be independent scripts. This is just an example of many other python calls.

1. First Python Call: Build image
1.1. Return all changed files
1.2. Validate parameters of the commit like SHA, Kind of PR
1.3 Determinate if the IMAGE should be BUILD
2. Second Python Call: Python Version to Build
2.1. Return all changed files
2.2. Validate parameters of the commit like SHA, Kind of PR
2.3 Determinate what version of pythons build
3. Process the validation of 1.3 and 2.3

Also, I created another step in ci.yaml for selective_ci_checks.py

https://github.com/apache/airflow/runs/5619304320?check_suite_focus=true#step:6:587

@potiuk
Copy link
Member

potiuk commented Mar 21, 2022

Thanks, @potiuk. I am retrieving all the files that are modified in the PR. Let me know now if the statement bellow is in the right way.

Process: Steps 1, 2, and 3 will be in the Selective Check main script. The other steps will be independent scripts. This is just an example of many other python calls.

Also, I created another step in ci.yaml for selective_ci_checks.py

https://github.com/apache/airflow/runs/5619304320?check_suite_focus=true#step:6:587

Good start. I think there are lots of good things there :). One thing though I thin we should improve

  1. I think all the checks should still be in on file "selective_checks.py". This will make it easier to manage and share code.
  2. This selective_checks.py should be run through entrypoint (airflow-selective-checks) - very similarly to airlflow-freespace
  3. Then in CI instead of running ./scripts/ci/.... you will run airlflow-selective-checks same as freespace :)
  4. Now. I think the clue here is to get diffrent commands in this one airflow-selective-checks). I can imagine such commands:
  • airflow-selective-checks matrix-strategy

This one (based on the checks) should either print :

  • python-versions:

    • ::set-output name=python-versions::[''3.7', '3.8', '3.9', '3.10'] - if we determine that all python versions should be used
    • ::set-output name=python-versions::[''3.7'] - if we determine that only this one python version should be used
      Plus few more outputs that are related:
  • python-versions-list-as-string

  • kubernetes-versions

  • kubernetes-versions-list-as-string

  • all-python-versions

  • mysql-versions

  • mssql-versions

  • default-mssql-version

  • .....

  • postgres-exclude

  • mysql-exclude

And so on - basically all the parameters that determine what should be the matrix of tests executed.

Then the next step will be to build sther commands:

  • airflow-selective-checks image-build-needed
    ::set-output name=image-build::true or ::set-output name=image-build::true depending if we determine that images are needed or not

  • airflow-selective-checks doc-build-needed
    similarly but for docs-build

  • airflow-selective-checks test-types
    Similarly but for test-types

But we can do it gradually. The matrix-strategy seems to be good start,

I hope it makes sense :)

@edithturn edithturn force-pushed the test-rewrite-selective-check-python branch from b5e509e to 792927b Compare March 22, 2022 14:48
@edithturn
Copy link
Contributor Author

edithturn commented Mar 25, 2022

@potiuk I understood the idea. Thanks!

I tried to make the first and second points. Created the entrypoint for "selective-checks-python" However, for freespace we need to install "Setup python" and "python -m pip install --editable ./dev/breeze/" which are installed in another job.

In the case of "selective check python", I added this dependency to make the entrypoint work. And I am getting "permission denied" => [Errno 13] Permission denied: '/usr/lib/python3.8/site-packages', Looking for this issue there are 03 possible solutions:
Screenshot from 2022-03-25 16-58-03
You can see more about these solutions here: googlesamples/assistant-sdk-python#236 (comment)

I tried "--user" and "venv", they are not working. build-info is the first job in ci.yaml, for that reason it doesn't have all the inputs necessary to build things. Is this right?
Here is my failed job: https://github.com/apache/airflow/runs/5698251497?check_suite_focus=true

@edithturn edithturn force-pushed the test-rewrite-selective-check-python branch 5 times, most recently from 752403d to 877b0b7 Compare March 29, 2022 23:51
@edithturn
Copy link
Contributor Author

@potiuk I have a doubt about how to send the parameter "${GITHUB_SHA}" to the entry point in ci.yaml, I was thinking something like this: airflow-selective-checks ${GITHUB_SHA}
I also think it should be like "freespace", when we run it we can send parameters like --help and others.

What would be the best here?

@potiuk
Copy link
Member

potiuk commented Mar 31, 2022

@potiuk I have a doubt about how to send the parameter "${GITHUB_SHA}" to the entry point in ci.yaml, I was thinking something like this: airflow-selective-checks ${GITHUB_SHA} I also think it should be like "freespace", when we run it we can send parameters like --help and others.

What would be the best here?

add --github-sha as Click parameter with GITHUB_SHA envvar. That should work.

@edithturn edithturn force-pushed the test-rewrite-selective-check-python branch 2 times, most recently from b1ba34c to ed6fd40 Compare April 1, 2022 13:10
@edithturn
Copy link
Contributor Author

@potiuk please ignore that this job is failing, I just wanna know if this is the right way about your last feedback: #22327 (comment)

Please, can see the code of the structure of the files. I restructured the files I am working on.

  • Selective check would work like free-space, with python click for matrix strategy.
    airflow-selective-checks -m "matrix_strategy"
  • There would be a python file of all validations and checks.
  • The matrix strategy function is calling all these validations before printing python versions (just for now)

@@ -0,0 +1,48 @@
# Licensed to the Apache Software Foundation (ASF) under one
Copy link
Member

@potiuk potiuk Apr 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those files should all go to selective_checks package under ci package.

@@ -0,0 +1,24 @@
#!/usr/bin/env python3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not need "read_only" - it was only to get more sanity for BASH code, but using it in Python is not needed.

@@ -0,0 +1,71 @@
#!/usr/bin/env python3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do not need a separate command for "get_changed_files". This is (at least for now) not very useful. But it should be a "utility" function. something that other real "commands" can use.

I think our commands should be centered around the "functionality" we need from selective checks.

Just to explain what I envision:

airflow-selective-checks  build-image

This should produce one output:

::set-output name=image-build::true

or

::set-output name=image-build::false

Depends on the set of changed files.

Similarly we should have other commands:

airflow-selective-checks  matrix-strategy

This one should produce all the "matrix" components (also as ga-output):

  • python, backend, kubernetes etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the above should use "get_changed_files" but as an internal Python method rather than click "command".

@edithturn edithturn force-pushed the test-rewrite-selective-check-python branch 5 times, most recently from faade06 to 43085f6 Compare April 12, 2022 17:53
@edithturn edithturn force-pushed the test-rewrite-selective-check-python branch 2 times, most recently from 9edfa58 to e0edcca Compare April 14, 2022 06:17
@edithturn
Copy link
Contributor Author

@potiuk thank you so much.
I have taken your feedback into account. Please tell me if the following is correct:

  • To build an Image, I have to identify which files have been modified, for example, if .py were modified I will build the image (here I would call the BUILD IMAGE function that is already built), and thus I would evaluate the types of files that were modified to know whether or not to build the image.

@edithturn edithturn force-pushed the test-rewrite-selective-check-python branch from e0edcca to ad3edb8 Compare April 14, 2022 14:12
@potiuk
Copy link
Member

potiuk commented Apr 25, 2022

Hey @edithturn - After #23193 and #23205 - this should be part of https://github.com/apache/airflow/blob/main/dev/breeze/src/airflow_breeze/commands/configuration_and_maintenance.py

Eventually it turned out that instead of standalone "freespace" etc. incorporating those tools into breeze command seems to be a much better and consistent solution.

I think maybe simply let's start "small" -> let's merge a tool first that alllows to print "changed files" list as first step?

for example breeze selective-checks list-changed-files :) ? . Click has the options to use sub-commands we have not used it so far) but this one is an obvious candidate). We will have to do the same for "kind" commands - so this might be a good opportunity for you to see how it is done.

Then I could attempt to split this change into a number of smallers "selective check tasks" to make it easier and possibly parallelise the work. Same with kind :).

@edithturn edithturn closed this Jul 7, 2022
@edithturn
Copy link
Contributor Author

Closing this one, It was worked for Jarek in another issue #24610 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI: Rewrite selective check script in Python
2 participants