Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use shared builds for some e2e jobs #4105

Merged
merged 17 commits into from
Aug 28, 2017

Conversation

BenTheElder
Copy link
Member

@BenTheElder BenTheElder commented Aug 17, 2017

This will allow some jobs to reuse the same kubernetes build. In particular our PR e2e jobs each do a build and then kubetest --extract=local ... instead of one shared build.

TODO:

  • Make run_after_success jobs able to find builds
  • This is done by logging the path to a text file in GCS similar to pr-logs/directory since we can't just move all the build locations. This file is at gs://some-shared-bucket/$PULL_REFS/(bazel-)?build-location.txt
  • add test(s) to make sure the config does not contain new invalid flag combinations
  • make jobs fall back to local builds
  • Configure PR e2e jobs to use a shared build step

FOLLOW-UP:

  • verify that this works for the test job, expand to all PR e2e jobs

This is a little ugly but after a lot of discussion on-and-offline (#4047, #4074, ...) it seems to be the best solution for the moment. We may want to revisit the entire run_after_success system at some point in the future.

@BenTheElder BenTheElder requested review from krzyzacy and spxtr August 17, 2017 22:15
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 17, 2017
@BenTheElder BenTheElder force-pushed the upload_build_path branch 2 times, most recently from fa9d4cb to 82e0fa6 Compare August 17, 2017 22:53
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Aug 17, 2017
gcs_build = "gs://"+bucket+dest+latest
# and then write it to GCS
gcs_shared = args.gcs_shared + os.getenv('PULL_REFS', '') + '/build-location.txt'
upload_string(gcs_shared, gcs_build)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd make the extra steps optional, since only pr build job need to upload to shared builds for now.

Maybe consider only do this when PULL_REFS is valid.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM.

if args.use_shared_build:
link += args.use_shared_build + '-'
link += 'build-location.txt'
url = link.replace("gs://", "https://storage.googleapis.com/")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/"/'/ (and other places), and I think gsutil cat might be better

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixing the strings, I'm not sure if gsutil cat is going to be any better.

url = link.replace("gs://", "https://storage.googleapis.com/")
build_loc = urllib2.urlopen(url).read()
# tell kubetest to extract from this location
args.kubetest_args += "--extract="+build_loc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'll need to add a test to assert other --extract does not exist in config.json, or what's the expect behavior?

@BenTheElder
Copy link
Member Author

BenTheElder commented Aug 17, 2017 via email

@krzyzacy
Copy link
Member

krzyzacy commented Aug 17, 2017

We support multiple --extract flags, it will extract multiple times and it means different things in kubetest, mostly for upgrade from one version to another :-)

(It does not simply overwrite, if that's what you are expecting)

@krzyzacy krzyzacy requested a review from ixdy August 17, 2017 23:10
@BenTheElder
Copy link
Member Author

@krzyzacy Added an entry to this PR's todo list to ensure that the flags are valid in the config, I will do that now.

@spxtr spxtr mentioned this pull request Aug 17, 2017
@BenTheElder
Copy link
Member Author

BenTheElder commented Aug 18, 2017

@spxtr @krzyzacy PTAL. If this makes enough sense I can start migrating some job configs to use run_after_success + --use-shared_build.
EDIT: I have a big refactor of the job configs in #4002 that I'd also like to get in first.

@krzyzacy
Copy link
Member

Please also add some tests to assert the intended behavior.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 21, 2017
@BenTheElder
Copy link
Member Author

BenTheElder commented Aug 21, 2017

@krzyzacy PTAL, I've updated the tests, fixed a few things, and configured the non-blocking PR GPU job to run after a build step.

@BenTheElder BenTheElder changed the title [WIP - DO NOT MERGE] use shared builds for some e2e jobs Use shared builds for some e2e jobs Aug 21, 2017
dest += args.suffix
build_tar = tarfile.open("_output/release-tars/kubernetes.tar.gz")
latest = build_tar.getmember("kubernetes/version").read().strip()
gcs_build = 'gs://'+bucket+dest+latest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not give you the right path - maybe use os.path.join

and I would suggest you add a unittest for both bazel and build scenario as well

try:
build_loc = read_gcs_path(gcs_path)
# tell kubetest to extract from this location
args.kubetest_args.append("--extract="+build_loc)
Copy link
Member

@krzyzacy krzyzacy Aug 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: single quote, and below

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, extra space: " + build_loc

@krzyzacy
Copy link
Member

/assign @ixdy

I might have more bikeshed comments but overall the idea should work. I'd also let @ixdy have a chance to go through it overall as well, as he knows more about the build stuff here :-)

@@ -37,6 +37,13 @@ def check_output(*cmd):
print >>sys.stderr, 'Run:', cmd
return subprocess.check_output(cmd)

def upload_string(gcs_path, text):
"""Uploads s to gcs_path"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/s/text/?

# log push-build location to path child jobs can find
# (gs://<shared-bucket>/$PULL_REFS/bazel-build-location.txt)
pull_refs = os.getenv('PULL_REFS', '')
gcs_shared = args.gcs_shared + pull_refs + '/bazel-build-location.txt'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it'd probably be safer to use os.path.join here so that you don't end up with something weird if --gcs_shared is missing a trailing slash. i.e. change this to

gcs_shared = os.path.join(args.gcs_shared, pull_refs, 'bazel-build-location.txt`)

dest += args.suffix
build_tar = tarfile.open("_output/release-tars/kubernetes.tar.gz")
latest = build_tar.getmember("kubernetes/version").read().strip()
gcs_build = os.path.join('gs://', bucket, dest, latest)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicating the logic of push-build.sh seems pretty fragile... I'd rather either we remove that logic from push-build.sh or add an option to push-build.sh to create this pointer file. @david-mcmahon

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's definitely fragile, I'm considering if we can just enable this feature for bazel build jobs and skip doing it for non-bazel builds since it's now scoped to just PR jobs anyhow. I'm pretty sure all jobs shouldn't need the special logic for finding builds.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed out-of-band, I think we don't need to support this for non-bazel builds so for now I will remove this.

args.build = None
except urllib2.URLError as err:
print >>sys.stderr, 'Failed to get shared build location: %s'%err.reason
print >>sys.stderr, 'Falling back to local build...'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why fallback? this makes things complicated and may result in the job timing out instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fallback just seemed like a safer way to make this change. I can remove it.

@BenTheElder
Copy link
Member Author

BenTheElder commented Aug 23, 2017

NOTE: this needs to port over #4160 once that is in.
Edit: this is done now.

@BenTheElder
Copy link
Member Author

ping @ixdy @krzyzacy, this has reduced scope now and should be more straightforward. PTAL

extract_in_args = False
build_in_args = False
for arg in args:
if arg.startswith("--use-shared-build"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/"/'/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, also below

gcs_path = os.path.join(args.gcs_shared, os.getenv('PULL_REFS', ''), build_file)
try:
# tell kubetest to extract from this location
args.kubetest_args.append('--extract=' + read_gcs_path(gcs_path))
Copy link
Member

@krzyzacy krzyzacy Aug 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about print gcs_path so that we can have more debug clue on failures

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

'--use-shared-build', nargs='?', default=None, const='',
help='Use prebuilt kubernetes binaries if set, optionally specifying strategy')
parser.add_argument(
'--gcs-shared',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bikeshed: maybe --shared-build-location?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this flag can be used for other shared files and isn't actually the build location but where the "symlink" to the build is.

build_file += args.use_shared_build + '-'
build_file += 'build-location.txt'
gcs_path = os.path.join(args.gcs_shared, os.getenv('PULL_REFS', ''), build_file)
print >>sys.stderr, "Getting shared build location from: "+gcs_path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last nitpick for the quotes!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I seriously need gofmt, but for python.
done.

help='Use prebuilt kubernetes binaries if set, optionally specifying strategy')
parser.add_argument(
'--gcs-shared',
default="gs://kubernetes-jenkins/shared-results/",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here as well 😈

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

we really need a gofmt, but for python :|
# normal path
with Stub(kubernetes_e2e, 'check_env', self.fake_check_env):
with Stub(kubernetes_e2e, 'read_gcs_path', always_kubernetes):
kubernetes_e2e.main(args)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, here in the test, I would like to see the gcs path pass into read_gcs_path is correct, both with normal build and bazel build.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@krzyzacy
Copy link
Member

and please squash :-)

@BenTheElder
Copy link
Member Author

Should be green, fixed nits. Will sqaush once everything checks out.

@krzyzacy
Copy link
Member

/lgtm

weee, let's see how it will work!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 28, 2017
@BenTheElder BenTheElder merged commit b75c234 into kubernetes:master Aug 28, 2017
@k8s-ci-robot
Copy link
Contributor

@BenTheElder: I updated Prow config for you!

In response to this:

This will allow some jobs to reuse the same kubernetes build. In particular our PR e2e jobs each do a build and then kubetest --extract=local ... instead of one shared build.

TODO:

  • Make run_after_success jobs able to find builds
  • This is done by logging the path to a text file in GCS similar to pr-logs/directory since we can't just move all the build locations. This file is at gs://some-shared-bucket/$PULL_REFS/(bazel-)?build-location.txt
  • add test(s) to make sure the config does not contain new invalid flag combinations
  • make jobs fall back to local builds
  • Configure PR e2e jobs to use a shared build step

FOLLOW-UP:

  • verify that this works for the test job, expand to all PR e2e jobs

This is a little ugly but after a lot of discussion on-and-offline (#4047, #4074, ...) it seems to be the best solution for the moment. We may want to revisit the entire run_after_success system at some point in the future.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

run_after_success:
- name: pull-kubernetes-e2e-gce-gpu
agent: jenkins
always_run: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does always_run and run_after_success interact?
How does run_after_success and /test all interact?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I think @BenTheElder chatted with @mindprince offline, but worth document the behavior as well)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spxtr knows this better, this has since been reverted and is only being used on a canary job while we sort some other things out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ref #4475 (comment)
We should definitely doc this more. I've since been a bit confused about the behavior myself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see also some discussion in sig-testing slack: https://kubernetes.slack.com/archives/C09QZ4DQB/p1504904920000295

@BenTheElder BenTheElder deleted the upload_build_path branch October 6, 2017 01:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants