Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request API in repository_ctx to download multiple files simultaneously #19674

Closed
jacky8hyf opened this issue Sep 28, 2023 · 12 comments
Closed
Assignees
Labels
help wanted Someone outside the Bazel team could own this P2 We'll consider working on this in future. (Assignee optional) team-Core Skyframe, bazel query, BEP, options parsing, bazelrc team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. type: feature request

Comments

@jacky8hyf
Copy link

Description of the feature request:

tl;dr: Request an API that is similar to repository_ctx.download, but download multiple files simultaneously.

Which category does this issue belong to?

Core, External Dependency

What underlying problem are you trying to solve with this feature?

Example to reproduce the issue:

my_repo_rule.bzl

def _impl(repository_ctx):
    repository_ctx.download(
        url = "https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb",
        output = "chrome.deb",
    )
    repository_ctx.download(
        url = "https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm",
        output = "chrome.rpm",
    )

    repository_ctx.file("WORKSPACE", "")
    repository_ctx.file("BUILD", """exports_files(glob(["**"]))""")

my_repo_rule = repository_rule(
    implementation = _impl,
)

WORKSPACE

load(":my_repo_rule.bzl", "my_repo_rule")

my_repo_rule(
    name = "my_ext_repo",
)

BUILD

# empty file

Then run

bazel fetch @my_ext_repo//...

You can see that the two files are downloaded in serial, not in parallel.

To workaround this issue, one could define two repositories. For example:

my_repo_rule.bzl

def _impl(repository_ctx):
    repository_ctx.download(
        url = repository_ctx.attr.url,
        output = "myfile",
    )

    repository_ctx.file("WORKSPACE", "")
    repository_ctx.file("BUILD", """exports_files(glob(["**"]))""")

my_repo_rule = repository_rule(
    implementation = _impl,
    attrs = {"url": attr.string()}
)

WORKSPACE

load(":my_repo_rule.bzl", "my_repo_rule")

my_repo_rule(
    name = "chrome_deb",
    url = "https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb"
)

my_repo_rule(
    name = "chrome_rpm",
    url = "https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm"
)

However, when there are a lot of files, this approach requires creating a lot of repositories.

The benefit of downloading in parallel is:

  • If there are a lot of small files to download, the time savings are big, because most of the time spent on setting up the connection is parallelized. If there are a few big files to download, the time savings are small.
  • I could also do something like the following with good parallelization (pseudocode):
def _impl(repository_ctx):
    repository_ctx.download(<metadata file>)
    metadata = repository_ctx.read(<metadata file>)
    repository_ctx.download_multiple(metadata.urls)

Without the API to download multiple files, the above can only be done in WORKSPACE only.

# WORKSPACE
load(":download.bzl", "download_metadata", "download")
download_metadata(name = "metadata")

load("@metadata//:metadata.bzl", "metadata")
[
  download(name = elem.name, url = elem.url)
  for elem in meatadata
]

I can't hide all these in a single macro because the load statements interleave with the repository definitions.

Which operating system are you running Bazel on?

Linux

What is the output of bazel info release?

release 6.3.2

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Have you found anything relevant by searching the web?

https://bazel.build/rules/lib/builtins/repository_ctx#download

Any other information, logs, or outputs that you want to share?

No response

@iancha1992 iancha1992 added team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. team-Core Skyframe, bazel query, BEP, options parsing, bazelrc labels Sep 28, 2023
@jacky8hyf
Copy link
Author

jacky8hyf commented Sep 28, 2023

This is how Kleaf works around the issue right now (by defining multiple repositories):

https://cs.android.com/android/kernel/superproject/+/common-android-mainline:build/kernel/kleaf/download_repo.bzl;l=356;drc=93375eb99ad216911bca8366cd8b70dd3d00def8

This feature request blocks DDKv2 development (Internal bug 298416462)

@lberki
Copy link
Contributor

lberki commented Oct 9, 2023

@meteorcloudy @Wyverald WDYT?

I could imagine implementing something like ctx.download_multiple(); it would be a non-trivial but small addition to our API surface, although the official Bazel Way at the moment is, as you say, to define multiple repositories and achieve parallelism that way.

@jacky8hyf How big are your downloads and how many of them are there and what pain do you experience from this? As long as you don't have an enormous number, repository-level parallelization should work, even if it is, as you say, a bit clunky.

What I have pretty strong feelings about is that I'd much rather not get into the business of implementing something like Futures (tokens = [ctx.download(url) for url in urls]; [token.wait() for token in tokens]) in Starlark.

From that internal bug, do I understand correctly that this is not a hard blocker, "only" a significant usability hurdle?

@meteorcloudy meteorcloudy added P2 We'll consider working on this in future. (Assignee optional) help wanted Someone outside the Bazel team could own this and removed untriaged labels Oct 9, 2023
@meteorcloudy
Copy link
Member

I agree this would be a nice feature. But the team is quite overloaded currently, community contribution is welcome!

@jacky8hyf
Copy link
Author

jacky8hyf commented Oct 10, 2023

I am working on rules to download kernel prebuilts from ci.android.com, and here are some numbers that may provide some context:

For a typical GKI build, I need to download 24 files to support external driver developments. The files range from empty (modules_blocklist), a few bytes (e.g. toolchain_version) to hundreds of megabytes (vmlinux, 294M)

For a certain build number that involves driver development kit (DDK):

42983565   gki_prebuilts_boot_img_tar_gz/file/boot-img.tar.gz
35205632   gki_prebuilts_Image/file/Image
14210442   gki_prebuilts_Image_gz/file/Image.gz
16654085   gki_prebuilts_Image_lz4/file/Image.lz4
146731     gki_prebuilts_kernel_aarch64_config_outdir_tar_gz/file/kernel_aarch64_config_outdir.tar.gz
191        gki_prebuilts_kernel_aarch64_ddk_headers_archive_tar_gz/metadata.bzl
9607113    gki_prebuilts_kernel_aarch64_ddk_headers_archive_tar_gz/file/kernel_aarch64_ddk_headers_archive.tar.gz
13099      gki_prebuilts_kernel_aarch64_env_sh/file/kernel_aarch64_env.sh
181195     gki_prebuilts_kernel_aarch64_internal_outs_tar_gz/file/kernel_aarch64_internal_outs.tar.gz
8875380    gki_prebuilts_kernel_aarch64_module_env_tar_gz/file/kernel_aarch64_module_env.tar.gz
1673       gki_prebuilts_kernel_aarch64_modules/file/kernel_aarch64_modules
1661306    gki_prebuilts_kernel_uapi_headers_tar_gz/file/kernel-uapi-headers.tar.gz
20328      gki_prebuilts_modules_builtin/file/modules.builtin
133459     gki_prebuilts_modules_builtin_modinfo/file/modules.builtin.modinfo
3922357    gki_prebuilts_modules_prepare_outdir_tar_gz/file/modules_prepare_outdir.tar.gz
7640754    gki_prebuilts_modules_staging_dir_tar_gz/file/modules_staging_dir.tar.gz
0          gki_prebuilts_system_dlkm_modules_blocklist/file/system_dlkm.modules.blocklist
2135       gki_prebuilts_system_dlkm_modules_load/file/system_dlkm.modules.load
7684967    gki_prebuilts_system_dlkm_staging_archive_tar_gz/file/system_dlkm_staging_archive.tar.gz
4710929    gki_prebuilts_System_map/file/System.map
9          gki_prebuilts_toolchain_version/file/toolchain_version
35398730   gki_prebuilts_unstripped_modules_tar_gz/file/unstripped_modules.tar.gz
307406096  gki_prebuilts_vmlinux/file/vmlinux
779305     gki_prebuilts_vmlinux_symvers/file/vmlinux.symvers

From that internal bug, do I understand correctly that this is not a hard blocker, "only" a significant usability hurdle?

This is correct.

@colatkinson
Copy link

I think I'm hitting into a use case for this as well, though in my case splitting the downloads across multiple repos doesn't help, as passing a list of labels to a subsequent repo rule seems to still force downloads to happen serially. Details copied from the Slack thread.

To add some detail, I'm attempting to write a module extension that downloads several artifacts, extracts them, and then manipulates the output (in a way that's not amenable to e.g. a genrule).

Initially I went for the obvious implementation (rctx.download_and_extract() in a loop). Which worked, but had the disadvantage that prompted this FR, i.e. the downloads being serial.

I tried to work around this by changing it to more of a hub-and-spoke model: the module_extension generates a number of http_archives, and then another repo rule takes a label list with the targets from the http_archives.

However, this also seems to result in downloads happening serially -- I'm guessing this is because the label attr prefetch mechanism mentioned in the docs.

I've got a minimal-ish repro here if it helps clarify the use-case at all. Please let me know if there's anything I've missed that would make the workaround as described ineffective.

@lberki
Copy link
Contributor

lberki commented Nov 24, 2023

Draft pull request: #20309

@lberki lberki self-assigned this Dec 1, 2023
@Wyverald
Copy link
Member

Wyverald commented Dec 6, 2023

@bazel-io fork 7.1.0

@aignas
Copy link

aignas commented Dec 15, 2023

Just wanted to say thanks for developing this feature, it makes it much easier to implement bzlmod extensions that can do a few lightweight queries to the internet before calling repository rules without forcing users to wait a long time.

@lberki
Copy link
Contributor

lberki commented Dec 15, 2023

We're glad we could help :)

iancha1992 pushed a commit to iancha1992/bazel that referenced this issue Jan 11, 2024
This is accomplished by adding a "block" argument to the "download" call. If set,
the call returns a "pending download" object with one single method that waits for the download to finish.

Appropriate care is taken that downloads don't hang around after the repository function finishes running.

Fixes bazelbuild#19674 .

RELNOTES: None.
PiperOrigin-RevId: 588320352
Change-Id: Ib0f48b6c7c2a07e93a4af602b0045120bd418829
bazel-io pushed a commit to bazel-io/bazel that referenced this issue Jan 11, 2024
This is accomplished by adding a "block" argument to the "download" call. If set,
the call returns a "pending download" object with one single method that waits for the download to finish.

Appropriate care is taken that downloads don't hang around after the repository function finishes running.

Fixes bazelbuild#19674 .

RELNOTES: None.
PiperOrigin-RevId: 588320352
Change-Id: Ib0f48b6c7c2a07e93a4af602b0045120bd418829
github-merge-queue bot pushed a commit that referenced this issue Jan 11, 2024
…0856)

This is accomplished by adding a "block" argument to the "download"
call. If set, the call returns a "pending download" object with one
single method that waits for the download to finish.

Appropriate care is taken that downloads don't hang around after the
repository function finishes running.

Fixes #19674 .

Commit
73c1a1e

RELNOTES: None.
PiperOrigin-RevId: 588320352
Change-Id: Ib0f48b6c7c2a07e93a4af602b0045120bd418829

Co-authored-by: Googler <[email protected]>
@jacky8hyf
Copy link
Author

May I ask if this is enabled on 7.1.0 onwards? We are on 7.0.2 right now so we are looking for the version number to update to.

@Wyverald
Copy link
Member

Yes, this is in 7.1.0.

@iancha1992
Copy link
Member

A fix for this issue has been included in Bazel 7.1.0 RC1. Please test out the release candidate and report any issues as soon as possible. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Someone outside the Bazel team could own this P2 We'll consider working on this in future. (Assignee optional) team-Core Skyframe, bazel query, BEP, options parsing, bazelrc team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. type: feature request
Projects
None yet
Development

No branches or pull requests

9 participants