Skip to content

Commit

Permalink
backfill: add --sparse option
Browse files Browse the repository at this point in the history
One way to significantly reduce the cost of a Git clone and later fetches is
to use a blobless partial clone and combine that with a sparse-checkout that
reduces the paths that need to be populated in the working directory. Not
only does this reduce the cost of clones and fetches, the sparse-checkout
reduces the number of objects needed to download from a promisor remote.

However, history investigations can be expensie as computing blob diffs will
trigger promisor remote requests for one object at a time. This can be
avoided by downloading the blobs needed for the given sparse-checkout using
'git backfill' and its new '--sparse' mode, at a time that the user is
willing to pay that extra cost.

Note that this is distinctly different from the '--filter=sparse:<oid>'
option, as this assumes that the partial clone has all reachable trees and
we are using client-side logic to avoid downloading blobs outside of the
sparse-checkout cone. This avoids the server-side cost of walking trees
while also achieving a similar goal. It also downloads in batches based on
similar path names, presenting a resumable download if things are
interrupted.

This augments the path-walk API to have a possibly-NULL 'pl' member that may
point to a 'struct pattern_list'. This could be more general than the
sparse-checkout definition at HEAD, but 'git backfill --sparse' is currently
the only consumer.

Be sure to test this in both cone mode and not cone mode. Cone mode has the
benefit that the path-walk can skip certain paths once they would expand
beyond the sparse-checkout.

Signed-off-by: Derrick Stolee <[email protected]>
  • Loading branch information
derrickstolee authored and dscho committed Jan 7, 2025
1 parent a8f4d79 commit 0e911de
Show file tree
Hide file tree
Showing 10 changed files with 177 additions and 10 deletions.
6 changes: 5 additions & 1 deletion Documentation/git-backfill.txt
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ git-backfill - Download missing objects in a partial clone
SYNOPSIS
--------
[verse]
'git backfill' [--batch-size=<n>]
'git backfill' [--batch-size=<n>] [--[no-]sparse]

DESCRIPTION
-----------
Expand Down Expand Up @@ -46,6 +46,10 @@ OPTIONS
from the server. This size may be exceeded by the last set of
blobs seen at a given path. Default batch size is 16,000.

--[no-]sparse::
Only download objects if they appear at a path that matches the
current sparse-checkout.

SEE ALSO
--------
linkgit:git-clone[1].
Expand Down
8 changes: 8 additions & 0 deletions Documentation/technical/api-path-walk.txt
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,14 @@ better off using the revision walk API instead.
the revision walk so that the walk emits commits marked with the
`UNINTERESTING` flag.

`pl`::
This pattern list pointer allows focusing the path-walk search to
a set of patterns, only emitting paths that match the given
patterns. See linkgit:gitignore[5] or
linkgit:git-sparse-checkout[1] for details about pattern lists.
When the pattern list uses cone-mode patterns, then the path-walk
API can prune the set of paths it walks to improve performance.

Examples
--------

Expand Down
20 changes: 19 additions & 1 deletion builtin/backfill.c
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
#include "parse-options.h"
#include "repository.h"
#include "commit.h"
#include "dir.h"
#include "hex.h"
#include "tree.h"
#include "tree-walk.h"
Expand All @@ -21,14 +22,15 @@
#include "path-walk.h"

static const char * const builtin_backfill_usage[] = {
N_("git backfill [--batch-size=<n>]"),
N_("git backfill [--batch-size=<n>] [--[no-]sparse]"),
NULL
};

struct backfill_context {
struct repository *repo;
struct oid_array current_batch;
size_t batch_size;
int sparse;
};

static void clear_backfill_context(struct backfill_context *ctx)
Expand Down Expand Up @@ -84,6 +86,15 @@ static int do_backfill(struct backfill_context *ctx)
struct path_walk_info info = PATH_WALK_INFO_INIT;
int ret;

if (ctx->sparse) {
CALLOC_ARRAY(info.pl, 1);
if (get_sparse_checkout_patterns(info.pl)) {
clear_pattern_list(info.pl);
free(info.pl);
return error(_("problem loading sparse-checkout"));
}
}

repo_init_revisions(ctx->repo, &revs, "");
handle_revision_arg("HEAD", &revs, 0, 0);

Expand All @@ -102,6 +113,10 @@ static int do_backfill(struct backfill_context *ctx)

clear_backfill_context(ctx);
release_revisions(&revs);
if (info.pl) {
clear_pattern_list(info.pl);
free(info.pl);
}
return ret;
}

Expand All @@ -111,10 +126,13 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
.repo = repo,
.current_batch = OID_ARRAY_INIT,
.batch_size = 50000,
.sparse = 0,
};
struct option options[] = {
OPT_INTEGER(0, "batch-size", &ctx.batch_size,
N_("Minimun number of objects to request at a time")),
OPT_BOOL(0, "sparse", &ctx.sparse,
N_("Restrict the missing objects to the current sparse-checkout")),
OPT_END(),
};

Expand Down
10 changes: 3 additions & 7 deletions dir.c
Original file line number Diff line number Diff line change
Expand Up @@ -1092,10 +1092,6 @@ static void invalidate_directory(struct untracked_cache *uc,
dir->dirs[i]->recurse = 0;
}

static int add_patterns_from_buffer(char *buf, size_t size,
const char *base, int baselen,
struct pattern_list *pl);

/* Flags for add_patterns() */
#define PATTERN_NOFOLLOW (1<<0)

Expand Down Expand Up @@ -1185,9 +1181,9 @@ static int add_patterns(const char *fname, const char *base, int baselen,
return 0;
}

static int add_patterns_from_buffer(char *buf, size_t size,
const char *base, int baselen,
struct pattern_list *pl)
int add_patterns_from_buffer(char *buf, size_t size,
const char *base, int baselen,
struct pattern_list *pl)
{
char *orig = buf;
int i, lineno = 1;
Expand Down
3 changes: 3 additions & 0 deletions dir.h
Original file line number Diff line number Diff line change
Expand Up @@ -467,6 +467,9 @@ void add_patterns_from_file(struct dir_struct *, const char *fname);
int add_patterns_from_blob_to_list(struct object_id *oid,
const char *base, int baselen,
struct pattern_list *pl);
int add_patterns_from_buffer(char *buf, size_t size,
const char *base, int baselen,
struct pattern_list *pl);
void parse_path_pattern(const char **string, int *patternlen, unsigned *flags, int *nowildcardlen);
void add_pattern(const char *string, const char *base,
int baselen, struct pattern_list *pl, int srcpos);
Expand Down
18 changes: 18 additions & 0 deletions path-walk.c
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
#include "hex.h"
#include "object.h"
#include "oid-array.h"
#include "repository.h"
#include "revision.h"
#include "string-list.h"
#include "strmap.h"
Expand Down Expand Up @@ -119,6 +120,23 @@ static int add_children(struct path_walk_context *ctx,
if (type == OBJ_TREE)
strbuf_addch(&path, '/');

if (ctx->info->pl) {
int dtype;
enum pattern_match_result match;
match = path_matches_pattern_list(path.buf, path.len,
path.buf + base_len, &dtype,
ctx->info->pl,
ctx->repo->index);

if (ctx->info->pl->use_cone_patterns &&
match == NOT_MATCHED)
continue;
else if (!ctx->info->pl->use_cone_patterns &&
type == OBJ_BLOB &&
match != MATCHED)
continue;
}

if (!(list = strmap_get(&ctx->paths_to_lists, path.buf))) {
CALLOC_ARRAY(list, 1);
list->type = type;
Expand Down
11 changes: 11 additions & 0 deletions path-walk.h
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

struct rev_info;
struct oid_array;
struct pattern_list;

/**
* The type of a function pointer for the method that is called on a list of
Expand Down Expand Up @@ -46,6 +47,16 @@ struct path_walk_info {
* walk the children of such trees.
*/
int prune_all_uninteresting;

/**
* Specify a sparse-checkout definition to match our paths to. Do not
* walk outside of this sparse definition. If the patterns are in
* cone mode, then the search may prune directories that are outside
* of the cone. If not in cone mode, then all tree paths will be
* explored but the path_fn will only be called when the path matches
* the sparse-checkout patterns.
*/
struct pattern_list *pl;
};

#define PATH_WALK_INFO_INIT { \
Expand Down
21 changes: 20 additions & 1 deletion t/helper/test-path-walk.c
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
#define USE_THE_REPOSITORY_VARIABLE

#include "test-tool.h"
#include "dir.h"
#include "environment.h"
#include "hex.h"
#include "object-name.h"
Expand All @@ -9,6 +10,7 @@
#include "revision.h"
#include "setup.h"
#include "parse-options.h"
#include "strbuf.h"
#include "path-walk.h"
#include "oid-array.h"

Expand Down Expand Up @@ -67,7 +69,7 @@ static int emit_block(const char *path, struct oid_array *oids,

int cmd__path_walk(int argc, const char **argv)
{
int res;
int res, stdin_pl = 0;
struct rev_info revs = REV_INFO_INIT;
struct path_walk_info info = PATH_WALK_INFO_INIT;
struct path_walk_test_data data = { 0 };
Expand All @@ -82,6 +84,8 @@ int cmd__path_walk(int argc, const char **argv)
N_("toggle inclusion of tree objects")),
OPT_BOOL(0, "prune", &info.prune_all_uninteresting,
N_("toggle pruning of uninteresting paths")),
OPT_BOOL(0, "stdin-pl", &stdin_pl,
N_("read a pattern list over stdin")),
OPT_END(),
};

Expand All @@ -101,6 +105,17 @@ int cmd__path_walk(int argc, const char **argv)
info.path_fn = emit_block;
info.path_fn_data = &data;

if (stdin_pl) {
struct strbuf in = STRBUF_INIT;
CALLOC_ARRAY(info.pl, 1);

info.pl->use_cone_patterns = 1;

strbuf_fread(&in, 2048, stdin);
add_patterns_from_buffer(in.buf, in.len, "", 0, info.pl);
strbuf_release(&in);
}

res = walk_objects_by_path(&info);

printf("commits:%" PRIuMAX "\n"
Expand All @@ -109,6 +124,10 @@ int cmd__path_walk(int argc, const char **argv)
"tags:%" PRIuMAX "\n",
data.commit_nr, data.tree_nr, data.blob_nr, data.tag_nr);

if (info.pl) {
clear_pattern_list(info.pl);
free(info.pl);
}
release_revisions(&revs);
return res;
}
55 changes: 55 additions & 0 deletions t/t5620-backfill.sh
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,61 @@ test_expect_success 'do partial clone 2, backfill batch size' '
test_line_count = 0 revs2
'

test_expect_success 'backfill --sparse' '
git clone --sparse --filter=blob:none \
--single-branch --branch=main \
"file://$(pwd)/srv.bare" backfill3 &&
# Initial checkout includes four files at root.
git -C backfill3 rev-list --quiet --objects --missing=print HEAD >missing &&
test_line_count = 44 missing &&
# Initial sparse-checkout is just the files at root, so we get the
# older versions of the four files at tip.
GIT_TRACE2_EVENT="$(pwd)/sparse-trace1" git \
-C backfill3 backfill --sparse &&
test_trace2_data promisor fetch_count 4 <sparse-trace1 &&
test_trace2_data path-walk paths 5 <sparse-trace1 &&
git -C backfill3 rev-list --quiet --objects --missing=print HEAD >missing &&
test_line_count = 40 missing &&
# Expand the sparse-checkout to include 'd' recursively. This
# engages the algorithm to skip the trees for 'a'. Note that
# the "sparse-checkout set" command downloads the objects at tip
# to satisfy the current checkout.
git -C backfill3 sparse-checkout set d &&
GIT_TRACE2_EVENT="$(pwd)/sparse-trace2" git \
-C backfill3 backfill --sparse &&
test_trace2_data promisor fetch_count 8 <sparse-trace2 &&
test_trace2_data path-walk paths 15 <sparse-trace2 &&
git -C backfill3 rev-list --quiet --objects --missing=print HEAD >missing &&
test_line_count = 24 missing
'

test_expect_success 'backfill --sparse without cone mode' '
git clone --no-checkout --filter=blob:none \
--single-branch --branch=main \
"file://$(pwd)/srv.bare" backfill4 &&
# No blobs yet
git -C backfill4 rev-list --quiet --objects --missing=print HEAD >missing &&
test_line_count = 48 missing &&
# Define sparse-checkout by filename regardless of parent directory.
# This downloads 6 blobs to satisfy the checkout.
git -C backfill4 sparse-checkout set --no-cone "**/file.1.txt" &&
git -C backfill4 checkout main &&
GIT_TRACE2_EVENT="$(pwd)/no-cone-trace1" git \
-C backfill4 backfill --sparse &&
test_trace2_data promisor fetch_count 6 <no-cone-trace1 &&
# This walk needed to visit all directories to search for these paths.
test_trace2_data path-walk paths 12 <no-cone-trace1 &&
git -C backfill4 rev-list --quiet --objects --missing=print HEAD >missing &&
test_line_count = 36 missing
'

. "$TEST_DIRECTORY"/lib-httpd.sh
start_httpd

Expand Down
35 changes: 35 additions & 0 deletions t/t6601-path-walk.sh
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,41 @@ test_expect_success 'all' '
test_cmp expect.sorted out.sorted
'

test_expect_success 'base & topic, sparse' '
cat >patterns <<-EOF &&
/*
!/*/
/left/
EOF
test-tool path-walk --stdin-pl -- base topic <patterns >out &&
cat >expect <<-EOF &&
COMMIT::$(git rev-parse topic)
COMMIT::$(git rev-parse base)
COMMIT::$(git rev-parse base~1)
COMMIT::$(git rev-parse base~2)
commits:4
TREE::$(git rev-parse topic^{tree})
TREE::$(git rev-parse base^{tree})
TREE::$(git rev-parse base~1^{tree})
TREE::$(git rev-parse base~2^{tree})
TREE:left/:$(git rev-parse base:left)
TREE:left/:$(git rev-parse base~2:left)
trees:6
BLOB:a:$(git rev-parse base~2:a)
BLOB:left/b:$(git rev-parse base~2:left/b)
BLOB:left/b:$(git rev-parse base:left/b)
blobs:3
tags:0
EOF
sort expect >expect.sorted &&
sort out >out.sorted &&
test_cmp expect.sorted out.sorted
'

test_expect_success 'topic only' '
test-tool path-walk -- topic >out &&
Expand Down

0 comments on commit 0e911de

Please sign in to comment.